google-sre-interview-handbook

🚨 Incident Playbook: The Kernel Panic Cascade

“At Google scale, hardware and kernels don’t just fail. They fail simultaneously, trap your workloads, and take down your control plane.”

In a senior SRE troubleshooting interview, the interviewer will eventually strip away your application metrics. They will tell you the dashboards are blank, SSH is timing out, and Kubernetes is shedding pods.

They are testing your Machine-Space Intuition. Can you debug the Operating System itself?


🛑 The “Reboot” Trap

When faced with an unresponsive node, the instinct of a junior operator is to turn it off and on again. In a Google SRE loop, this is a fatal error.

- The Failing Move (The "Script Runner")
- "I can't SSH into the box. I will go into the AWS/GCP console and hard reboot the instance. Once it comes back up, I'll check the logs."

+ The Passing Move (The "Reliability Architect")
+ "A hard reboot destroys the forensic evidence in RAM. I will cordon the node to stop new traffic, trigger a crash dump (kdump) to preserve the memory state, and isolate the node for investigation. If this is a bad kernel patch, rebooting it will just put it back into a crash loop."

If you destroy the evidence before capturing the signal, you fail the round.


🔍 1. Symptoms & Initial Triage

🛠️ 2. First 5 Commands (Localization)

You cannot SSH into a panicked or frozen box. How do you get signal?

1. Access the Out-of-Band Console

2. Inspect the Crash Dump Directory

3. If partially alive: Check for D-State Processes

4. Check Kernel Logs for OOM Cascades

5. Trigger a Manual Crash Dump (The “Magic SysRq” Key)

🛡️ 3. Mitigation Sequence (Stabilize)

  1. Assess the Blast Radius: Is this isolated to one node (likely a bad RAM stick or SSD), or is it 30% of the fleet in one Availability Zone?
  2. Cordon and Drain: Tell the orchestrator (Kubernetes) to stop scheduling new work to the affected instances.
  3. Halt Rollouts: If a new OS image, kernel version, or DaemonSet was recently deployed, hit the big red Freeze button immediately.
  4. Auto-Remediation: Ensure your autoscaler is spinning up fresh instances on the last known good kernel image to replace the dying ones.

🔬 4. Root Cause Investigation

Once the fleet is stable, you analyze the vmcore dump using tools like crash or gdb. Typical Google-scale culprits include:

đź§± 5. Prevention (The Senior Signal)

To score “Exceptional” (L5/L6), you must engineer the system to survive the next panic automatically.


🚀 The “Execution Sequencing” Gap

Knowing what kdump does is easy.

Knowing when to trigger it, how to isolate the node, and why you must stop the Kubernetes scheduler from sending traffic to it during a live 45-minute interview is what separates the hires from the rejects.

Google SRE interviews test your Execution Sequencing under pressure. If your sequence is wrong, your technical knowledge won’t save you.

I built The Complete Google SRE Career Launchpad to simulate these exact, high-stakes infrastructure failures.

👉 Get The Complete Google SRE Career Launchpad (Gumroad)

The Full Training System Includes:

Don’t let a kernel panic freeze your interview. Train your reflexes.