google-sre-interview-handbook

🚨 Incident Playbook: The BGP Route Leak

“What do you do when your dashboards are perfectly green, but the Internet thinks you no longer exist?”

In a Senior/Staff SRE interview, they will test your “Edge” and “Network” intuition. They will give you a scenario where the application is flawless, the databases are fast, but traffic has suddenly dropped off a cliff.

If your troubleshooting stops at the Kubernetes ingress controller, you fail. You must prove you can debug the global routing table.


🛑 The “Application Tunnel Vision” Trap

When traffic drops, junior engineers assume the application broke and stopped accepting connections. Senior SREs look at the edge.

- The Failing Move (The "App Developer")
- "Traffic dropped by 80% in South America! The API pods must be deadlocked. I'm going to exec into the containers, check the JVM heaps, and restart the ingress controllers."

+ The Passing Move (The "Reliability Architect")
+ "Our internal API health is 100%, but ingress traffic plummeted. We aren't down; the internet is misrouting our users. I'm going to check external Looking Glass servers to see if our BGP prefixes are being hijacked or leaked by a transit provider in South America."

If you try to fix a global routing leak by restarting a Docker container, you instantly broadcast a lack of operational maturity.


🔍 1. Symptoms & Initial Triage

🛠️ 2. First 5 Commands (Localization)

You cannot SSH into an ISP’s router in another country. How do you get signal?

1. Verify from the Outside In (External Probing)

2. Check BGP Prefix Announcements

3. Inspect Edge Traffic at the Packet Level

4. Check Edge Router Logs

5. Query Global Synthetic Monitors

🛡️ 3. Mitigation Sequence (Stabilize)

You do not control the router that is leaking your routes. You must use Traffic Engineering to mitigate.

  1. DNS Steering (Fastest): If you use DNS-based global load balancing, immediately update your DNS records to point affected users to a different IP range/region that is not being hijacked. (Assumes you have low TTLs).
  2. Anycast Withdrawal: If using Anycast, withdraw the BGP announcements for the affected PoP (Point of Presence). This forces upstream routers to find the next best path to your other healthy datacenters.
  3. Prepend AS Paths: If a specific peering link is saturated by a leak, advertise your routes to that peer with a heavily prepended AS Path (e.g., AS_PATH: 65000 65000 65000). This makes the route look artificially “long” and ugly to the internet, forcing traffic to take a different path.
  4. Engage the NOC: Open an emergency ticket with your upstream transit provider (Tier 1 ISP) to filter out the leaked routes.

🔬 4. Root Cause Investigation

Once traffic is restored via steering, you investigate the routing anomaly.

đź§± 5. Prevention (The Senior Signal)

To score “Exceptional” (L5/L6), you must discuss cryptographic and policy-based routing defenses.


🚀 The “Execution Sequencing” Gap

In a high-pressure interview, it is incredibly tempting to focus on what you can control (your application code) rather than what you can’t (the global internet routing table).

But a Google Reliability Architect knows that the network is not reliable.

If you don’t sequence your debugging to verify the network edge before tearing apart your microservices, you will fail the Troubleshooting round.

I built The Complete Google SRE Career Launchpad to simulate these exact, “outside-in” infrastructure failures.

👉 Get The Complete Google SRE Career Launchpad (Gumroad)

The Full Training System Includes:

Don’t let a network trap ruin your loop. Train your reflexes.