google-sre-interview-handbook

🧮 NALSD Math Traps: Where Strong Candidates Quietly Fail

“If the math doesn’t work, the design is fiction.”

In Google SRE interviews, Non-Abstract Large System Design (NALSD) is not a creativity exercise. It is a feasibility test.

Many senior candidates produce beautiful architectures that fail immediately once basic math is applied. The interviewer notices. They write it down. The candidate is rejected, completely unaware that their design violated the laws of physics.

This document outlines the most common mathematical traps that cause otherwise strong candidates to fail NALSD rounds.

🚨 Trap #1: The Availability Impossibility

Candidates love to promise “five nines” (99.999%). Interviewers test if you actually know what that costs.

- The Failing Move (The Dreamer)
- "We’ll achieve 99.99% availability by putting a load balancer in front of 3 instances."

+ The Passing Move (The Realist)
+ "99.99% allows for 52 minutes of downtime per YEAR. Given our cloud provider's SLA is 99.9%, a single-region architecture is mathematically incapable of hitting this SLO. We must go multi-region."

The Hidden Check: Interviewers expect you to mentally combine disk failure rates, node counts, correlated failure domains, and Mean Time To Repair (MTTR). If the stated SLO is mathematically impossible under given constraints, continuing to design is a negative signal.

🚨 Trap #2: Ignoring Repair Bandwidth (The RTO Killer)

Candidates assume “auto-scaling” and “replication” happen instantly. They don’t.

- The Failing Move (The Cloud Architect)
- "If the US-East cluster dies, our disaster recovery plan will spin up EU-West and pull the 5PB database backup from cold storage to meet our 4-hour RTO."

+ The Passing Move (The Reliability Architect)
+ "Wait. 5 Petabytes over a dedicated 10Gbps link takes ~46 days to transfer. Moving this data during an outage to meet a 4-hour RTO is physically impossible. We need continuous delta-sync replication."

The Hidden Check: Interviewers are watching for awareness of network bandwidth limits, rebuild amplification, and background vs. foreground traffic collision.

🚨 Trap #3: The Speed-of-Light Violation

You cannot out-engineer the speed of light.

- The Failing Move (The Academic)
- "We will use synchronous replication across global datacenters (US, EU, APAC) to guarantee zero data loss on every user write."

+ The Passing Move (The Operator)
+ "Global round-trip time (RTT) is roughly 200ms. Synchronous global writes will permanently floor our latency SLO. We must trade strong consistency for eventual consistency here."

The Hidden Check: If your design assumes global coordination without calculating the latency penalty of cross-region RTT, it is silently marked down.

🚨 Trap #4: The IOPS / Throughput Fantasy

Candidates state massive load numbers without reducing them to per-node constraints.

- The Failing Move (The Scaler)
- "The service handles 100,000 QPS. We'll just route it to our backend database."

+ The Passing Move (The SRE)
+ "100k QPS of 4KB writes means 400 MB/s of throughput. However, if these are random writes, a standard HDD maxes out at ~150 IOPS. This workload will instantly saturate the disk queue. We need to batch writes in memory first or provision high-IOPS NVMe."

The Hidden Check: Interviewers look for decomposition: per-node QPS, tail latency amplification, and block size math. Stating big numbers without reduction is a red flag.

🚨 Trap #5: Infinite Budget Thinking

Candidates assume the solution to every problem is “add a cache” or “add more nodes.”

- The Failing Move (The Spender)
- "To handle the traffic spike, we'll autoscale our Kubernetes cluster from 1,000 to 10,000 nodes."

+ The Passing Move (The Custodian)
+ "Scaling to 10,000 nodes handles the load, but it increases our compute cost by 10x for a transient spike. I'd rather implement rate-limiting and graceful degradation to drop non-critical traffic and protect our budget."

The Hidden Check: NALSD always includes an implicit constraint: Budget, power, and operational complexity. Strong candidates surface these constraints themselves.

🎯 The Meta-Signal Interviewers Look For

Interviewers are not testing your mental arithmetic. They are testing whether you:

Notice when math is required.
Pause architecture to validate feasibility.
Reject impossible requirements calmly.
Protect the system from wishful thinking.

Most candidates never stop drawing. The strongest candidates stop early.

🚀 The Execution Gap: Knowing vs. Doing

A quiet truth about Google NALSD rounds: They are often pass/fail on whether you realize the problem is underspecified or unsolvable. That realization is the entire point of the interview.

Knowing these traps intellectually does not prevent failure. Failure happens when:

The math appears mid-discussion.
The interviewer changes a constraint on the fly.
The clock is ticking and you panic.

At that moment, the question is not: “Do you know the formula?” It is: “Do you have the reflex to stop designing and calculate?”

That instinct only forms through repeated simulation.

I built a simulation-based preparation system specifically designed to train this reflex. It includes:

10+ NALSD Scenarios with hidden infeasible constraints designed to trip you up.
Interviewer Scorecards showing exact pass/fail signals.
Real-time mitigation-first design drills.
A structured 30-Day Preparation Blueprint.

👉 Get The Complete Google SRE Interview Career Launchpad (Gumroad)

Use this repository to sharpen your awareness.
Use the full system to build your interview-grade reflexes.

This site is open source. Improve this page.