“A full disk is easy to detect. A slow disk is a silent killer.”
In a Google SRE troubleshooting interview, you will often be handed a scenario where the application CPU is low, memory is stable, the network is uncongested, but p99 latency has spiked from 50ms to 2 seconds.
If you start debugging the application code, you fail. The database is choking on disk I/O, and the application threads are piling up waiting for the write to finish.
The most common mistake candidates make when debugging storage is looking at the wrong metric.
- The Failing Move (The "Dashboard Watcher")
- "I see our cloud volume is rated for 500 MB/s, and we are only writing 40 MB/s. The disk is fine. I'm going to check the application logic for a deadlock."
+ The Passing Move (The "Reliability Architect")
+ "We are only doing 40 MB/s, but our workload is logging thousands of tiny 4KB events per second. I need to check our IOPS (Input/Output Operations Per Second) limits. If our volume is capped at 10,000 IOPS, we are hitting the ceiling, causing I/O wait times to skyrocket."
Throughput is how much water flows through the pipe. IOPS is how many buckets of water you can carry at once. Small, random writes exhaust IOPS long before they exhaust throughput.
iowait.You need to prove the disk is the bottleneck.
1. Check I/O Wait (The Smoking Gun)
top or vmstat 1%wa (I/O wait) column. If this number is consistently high (e.g., >20%), your CPU is spending its time waiting for the disk to respond.2. Check Disk Saturation and Queue Depth
iostat -xz 1%util (is the disk 100% busy?) and await (how long are requests waiting in the queue?). If await jumps from 2ms to 200ms, your disk is saturated.3. Identify the Offending Process
sudo iotop -oPapostgres), or is it a runaway logging agent (fluentd) competing for the same disk?4. Check the Dirty Page Cache
cat /proc/meminfo | grep -i dirty5. Check Inode Exhaustion (The “Fake” Full Disk)
df -idf -h looks fine), but you have run out of “file pointers” (inodes). The disk refuses new writes.You cannot “magically” make the disk faster during an incident. You must reduce the load.
gp2 to io1/io2).Once the system is stable, find out why the disk saturated.
fsync()) on every single transaction instead of batching them. This instantly murders IOPS.To score “Exceptional” (L5/L6), you must engineer the system to decouple application latency from disk latency.
blkio cgroup limits to ensure background tasks (like backups) are throttled and cannot consume more than 10% of the disk’s IOPS, protecting the primary database.Knowing iostat is table stakes.
Knowing when to look at iowait instead of CPU, how to throttle background jobs to save the database, and why you must batch writes to preserve IOPS is what separates the hires from the rejects.
Google SRE interviews test your Execution Sequencing under pressure. If your sequence is wrong, your technical knowledge won’t save you.
I built The Complete Google SRE Career Launchpad to simulate these exact, high-stakes infrastructure failures.
👉 Get The Complete Google SRE Interview Career Launchpad (Gumroad)
The Full Training System Includes:
Don’t let a slow disk ruin your loop. Train your reflexes.