google-sre-interview-handbook

🏗️ SRE vs. SWE System Design: The NALSD Difference

“Software Engineers design for the happy path.
Site Reliability Engineers design for the hostile path.”

The most common reason experienced Backend Engineers fail the Google SRE Non-Abstract Large System Design (NALSD) round is simple: They design the system like a Software Engineer.

They focus on API schemas, database normalization, and perfect consistency.

Google SRE interviewers don’t care about your API schema. They care about what happens when the network between your API and your database is physically severed by a backhoe.

Here is the exact difference between a “SWE” design and a “Google SRE” design, using a classic interview prompt.


🚨 The Prompt: Design a Global Payment Gateway

Requirements:


❌ The SWE Design (The “No Hire” Approach)

The standard Software Engineer immediately draws a three-tier web architecture.

The Architecture:

  1. Global Load Balancer routing traffic to the closest region.
  2. Stateless API Servers handling the business logic.
  3. A Global Relational Database (e.g., Cloud Spanner or CockroachDB) to ensure ACID transactions and prevent double-charging.

The SWE Narration:

“I’ll use a Global Load Balancer to route users to the nearest API server. The API server will validate the payload and write the transaction to a globally distributed, strongly consistent database like Spanner. Since Spanner handles ACID transactions, we guarantee the user is only charged once.”

🛑 Why the SRE Hiring Committee Rejects This:

This design is functionally correct, but it is operationally fragile.

  1. The Physics Trap: Synchronous global writes to Spanner require cross-region consensus (Paxos). This adds 100ms - 300ms of latency to every single payment. If the network lags, the connection pool on the API servers will fill up, causing a cascading failure.
  2. Zero Isolation: If the global database goes down, or if a bad schema migration is pushed, the entire global payment system is offline instantly.
  3. No Backpressure: If a sudden spike of 100,000 QPS hits the API, it will blindly forward that load to the database, potentially crushing the most expensive component in the stack.

Verdict: L3 (Junior) / No Hire for Senior SRE.


✅ The SRE Design (The “Strong Hire” Approach)

The Reliability Architect does not trust the network, does not trust the database, and absolutely does not trust a “global” anything.

They design for Scarcity, Isolation, and Asynchrony.

The Architecture:

  1. Regional, Isolated Cells: The architecture is deployed as completely independent “cells” in each region. US-East shares nothing with EU-West.
  2. Idempotency Keys: The client generates a unique UUID for every payment attempt.
  3. Regional Message Queues (Kafka/PubSub): API servers do not talk to the database. They write the payment (with the Idempotency Key) to a highly available, regional append-only queue.
  4. Asynchronous Workers: Background workers pull from the queue, deduplicate using the Idempotency Key (usually via a fast local cache like Redis), and then batch-write to the regional database shard.

The SRE Narration:

“We cannot rely on a single global database for synchronous payments; the latency and blast-radius risks are too high. I will architect this using Regional Isolation.

The client will generate an Idempotency Key. Our API servers will do minimal validation and immediately drop the request into a Regional Kafka Queue. This allows us to return a ‘202 Accepted’ to the user in <20ms.

We will use Asynchronous Workers to process the queue. If the downstream database slows down, the workers will naturally back off, and the queue will act as a shock absorber. We will monitor Queue Depth and Oldest Message Age as our primary SLIs.

If a region fails entirely, the blast radius is contained. We can use DNS to steer new traffic to a healthy region, knowing the failed region’s queue will safely hold the pending payments until recovery.”

🟢 Why the SRE Hiring Committee Loves This:

  1. Blast Radius Containment: The “Cell-Based” architecture means a bad deploy in Europe doesn’t take down North America.
  2. Asynchronous Decoupling: The API servers are protected from database latency. The system can absorb massive traffic spikes without dropping requests.
  3. Idempotency by Design: Instead of relying on a slow, expensive global lock to prevent double-charging, the design uses Idempotency Keys, making retries safe and cheap at every layer.

Verdict: L5 / L6 (Senior/Staff) Strong Hire.


🎯 The Core Difference

Feature SWE Focus SRE Focus
Primary Goal Feature Completeness System Survivability
Data Flow Synchronous (RPCs) Asynchronous (Queues)
Failure Handling try/catch blocks Circuit Breakers & Shedding
Scaling Strategy Bigger Databases Sharding & Cell Architecture
The “Source of Truth” The Database Schema The Telemetry (SLIs/SLOs)

🚀 Stop Designing Like a Developer. Start Architecting Like an SRE.

If you draw a “Global Database” in a Google NALSD round without immediately calculating the cross-region latency penalty, the interview is effectively over.

You must train yourself to identify Blast Radiuses, Bottlenecks, and Asynchronous Boundaries before you draw a single box.

I built a complete Simulation-Based Training System designed to break your SWE habits and build your SRE reflexes.

👉 The Complete Google SRE Interview Career Launchpad (Gumroad)

What is inside the full system:

Read the frameworks here for free. Use the bundle to simulate the pressure.