āThe 20 Commands That Solve 80% of Production Incidentsā
In a Google SRE interview (especially Troubleshooting and NALS), you arenāt judged on how many flags you memorize. You are judged on your command fluency and risk awareness.
- The Junior Approach
- Runs `cat /var/log/syslog | grep error` on a 50GB file, maxes out memory, and crashes the production node.
+ The Senior Approach
+ Runs `tail -n 5000 /var/log/syslog | grep -i error` to safely extract recent signals without threatening node stability.
Can you manipulate text streams, diagnose system state, and inspect logs without leaving the terminal? This cheat sheet covers the āBread and Butterā commands you need to have at your fingertips.
Google SREs live in logs. These commands turn raw text into data.
tail (The Pulse Check)tail -f /var/log/syslogtail -f access.log | grep 500 (Watch only the errors live).head & tail -n (Safe Inspection)head -n 20 large_file.logawk (The Column Extractor)awk '{print $9}' access.logawk to strip away the timestamps and IPs so I can focus purely on the distribution of HTTP status codes.āsort | uniq -c | sort -nr (The āPoor Manās MapReduceā)cat access.log | awk '{print $9}' | sort | uniq -c | sort -nrIs the box sick? These commands tell you why.
top / htop (The Dashboard)- The Failing Diagnosis
- "CPU usage is at 90%, we need to scale up."
+ The Passing Diagnosis
+ "CPU is high, but looking at `%wa` (I/O Wait), the CPU is actually just waiting on a slow disk. We have a storage bottleneck, not a compute bottleneck."
lsof (The Leak Detector)lsof -p <PID>lsof to see if the process is holding onto deleted files or hanging network sockets.ādf -h vs. df -i (The Silent Killer)- The Failing Diagnosis
- Runs `df -h`. Sees 50% disk space free. Concludes the disk is fine.
+ The Passing Diagnosis
+ Runs `df -i`. Sees 100% Inode exhaustion because an app wrote 10 million tiny 1KB files. Concludes the disk cannot accept new writes.
ps (The Process Investigator)ps aux | grep <process_name>D state (Uninterruptible Sleep) or Z state (Zombie).kill -9 wonāt work here. We need to investigate the storage array.āIs the pipe broken?
ss or netstat (The Connection Checker)ss -s (Summary) or ss -plunt (Listening ports)TIME_WAIT. We might be exhausting our ephemeral port range.ācurl -vI (The Handshake Inspector)curl -vI https://google.com-v (verbose) flag shows the DNS resolution, TLS handshake, and headers without downloading the payload.Use this in a coding interview when you need to monitor something but donāt have a dashboard.
The āWatchā Command
watch -n 1 "ss -s | grep estab"Knowing the commands is Step 1.
Knowing when to use them to debug a Kernel Panic, CPU Throttling, or a BGP Route Leak under the pressure of a 45-minute Google interview is Step 2.
Most candidates can list these commands. Few can sequence them into a coherent OODA Loop (Observe, Orient, Decide, Act) during an active incident simulation.
The full Linux Internals Playbook (part of the Complete SRE Career Launchpad) trains this execution muscle. It covers:
OOMKilled events using dmesg and cgroup metrics.strace to find why a process is hanging on a syscall./proc/sched_debug.