google-sre-interview-handbook

🐧 The Google SRE’s Tactical Linux Command Cheat Sheet

ā€œThe 20 Commands That Solve 80% of Production Incidentsā€

In a Google SRE interview (especially Troubleshooting and NALS), you aren’t judged on how many flags you memorize. You are judged on your command fluency and risk awareness.

- The Junior Approach
- Runs `cat /var/log/syslog | grep error` on a 50GB file, maxes out memory, and crashes the production node.

+ The Senior Approach
+ Runs `tail -n 5000 /var/log/syslog | grep -i error` to safely extract recent signals without threatening node stability.

Can you manipulate text streams, diagnose system state, and inspect logs without leaving the terminal? This cheat sheet covers the ā€œBread and Butterā€ commands you need to have at your fingertips.


1. The ā€œLog Surgeonā€ Toolkit (Text Processing)

Google SREs live in logs. These commands turn raw text into data.

tail (The Pulse Check)

head & tail -n (Safe Inspection)

awk (The Column Extractor)

sort | uniq -c | sort -nr (The ā€œPoor Man’s MapReduceā€)


2. The ā€œSystem Doctorā€ Toolkit (Resource Diagnosis)

Is the box sick? These commands tell you why.

top / htop (The Dashboard)

- The Failing Diagnosis
- "CPU usage is at 90%, we need to scale up."

+ The Passing Diagnosis
+ "CPU is high, but looking at `%wa` (I/O Wait), the CPU is actually just waiting on a slow disk. We have a storage bottleneck, not a compute bottleneck."

lsof (The Leak Detector)

df -h vs. df -i (The Silent Killer)

- The Failing Diagnosis
- Runs `df -h`. Sees 50% disk space free. Concludes the disk is fine.

+ The Passing Diagnosis
+ Runs `df -i`. Sees 100% Inode exhaustion because an app wrote 10 million tiny 1KB files. Concludes the disk cannot accept new writes.

ps (The Process Investigator)


3. The ā€œNetwork Plumberā€ Toolkit

Is the pipe broken?

ss or netstat (The Connection Checker)

curl -vI (The Handshake Inspector)


4. Bonus: The ā€œEmergencyā€ Loop

Use this in a coding interview when you need to monitor something but don’t have a dashboard.

The ā€œWatchā€ Command


šŸš€ The Execution Gap: Knowing vs. Debugging

Knowing the commands is Step 1.

Knowing when to use them to debug a Kernel Panic, CPU Throttling, or a BGP Route Leak under the pressure of a 45-minute Google interview is Step 2.

Most candidates can list these commands. Few can sequence them into a coherent OODA Loop (Observe, Orient, Decide, Act) during an active incident simulation.

The full Linux Internals Playbook (part of the Complete SRE Career Launchpad) trains this execution muscle. It covers:

šŸ‘‰ Get the Full Linux Internals Playbook Here