google-sre-interview-handbook

🚨 Incident Playbook: Disk I/O Saturation & The Silent Outage

“A full disk is easy to detect. A slow disk is a silent killer.”

In a Google SRE troubleshooting interview, you will often be handed a scenario where the application CPU is low, memory is stable, the network is uncongested, but p99 latency has spiked from 50ms to 2 seconds.

If you start debugging the application code, you fail. The database is choking on disk I/O, and the application threads are piling up waiting for the write to finish.


🛑 The “Throughput vs. IOPS” Trap

The most common mistake candidates make when debugging storage is looking at the wrong metric.

- The Failing Move (The "Dashboard Watcher")
- "I see our cloud volume is rated for 500 MB/s, and we are only writing 40 MB/s. The disk is fine. I'm going to check the application logic for a deadlock."

+ The Passing Move (The "Reliability Architect")
+ "We are only doing 40 MB/s, but our workload is logging thousands of tiny 4KB events per second. I need to check our IOPS (Input/Output Operations Per Second) limits. If our volume is capped at 10,000 IOPS, we are hitting the ceiling, causing I/O wait times to skyrocket."

Throughput is how much water flows through the pipe. IOPS is how many buckets of water you can carry at once. Small, random writes exhaust IOPS long before they exhaust throughput.


🔍 1. Symptoms & Initial Triage

🛠️ 2. First 5 Commands (Localization)

You need to prove the disk is the bottleneck.

1. Check I/O Wait (The Smoking Gun)

2. Check Disk Saturation and Queue Depth

3. Identify the Offending Process

4. Check the Dirty Page Cache

5. Check Inode Exhaustion (The “Fake” Full Disk)

🛡️ 3. Mitigation Sequence (Stabilize)

You cannot “magically” make the disk faster during an incident. You must reduce the load.

  1. Shed Load (Drop Non-Critical I/O): Temporarily disable verbose debug logging, or pause background batch jobs (like database backups or compactions) that are competing for IOPS.
  2. Increase IOPS (Cloud-Native): If you are on AWS EBS or GCP Persistent Disk, dynamically modify the volume to provision higher IOPS (e.g., switch from gp2 to io1/io2).
  3. Route Traffic Away: If this is a database primary, failover to a replica in a different zone that isn’t experiencing noisy neighbor disk contention.

🔬 4. Root Cause Investigation

Once the system is stable, find out why the disk saturated.

đź§± 5. Prevention (The Senior Signal)

To score “Exceptional” (L5/L6), you must engineer the system to decouple application latency from disk latency.


🚀 The “Execution Sequencing” Gap

Knowing iostat is table stakes.

Knowing when to look at iowait instead of CPU, how to throttle background jobs to save the database, and why you must batch writes to preserve IOPS is what separates the hires from the rejects.

Google SRE interviews test your Execution Sequencing under pressure. If your sequence is wrong, your technical knowledge won’t save you.

I built The Complete Google SRE Career Launchpad to simulate these exact, high-stakes infrastructure failures.

👉 Get The Complete Google SRE Interview Career Launchpad (Gumroad)

The Full Training System Includes:

Don’t let a slow disk ruin your loop. Train your reflexes.