google-sre-interview-handbook

🚨 Incident Playbook: The Silent TLS Expiry Cascade

“A manual certificate renewal is not a fix. It is a scheduled outage waiting to happen.”

In a Senior SRE troubleshooting interview, the prompt will sound incredibly simple: “At 00:00 UTC, traffic to our main API dropped by 90%. CPU and memory are completely idle. Go.”

If you check the database, the queues, or the application code, you are wasting time. The network edge has severed the connection.


🛑 The “Leaf vs. Chain” Trap

When candidates finally suspect a certificate issue, they make a fatal diagnostic mistake that reveals a lack of deep network experience.

- The Failing Move (The "Frontend Dev")
- "I will check if our SSL certificate expired today by looking at our internal dashboard or checking the expiry date of the domain's primary cert."

+ The Passing Move (The "Reliability Architect")
+ "I will immediately test the TLS handshake from an external network using `openssl`. I am not just checking if the leaf certificate expired; I am checking if an Intermediate CA in the chain was rotated or expired, which causes clients to silently drop the connection."

Checking internal metrics is useless if the issue is that external browsers no longer trust your certificate chain. You must test from the outside in.


🔍 1. Symptoms & Initial Triage

🛠️ 2. First 5 Commands (Localization)

You must bypass the application and ask the network layer what certificate it is actually serving.

1. The Ultimate Source of Truth (Inspect the Handshake)

2. The Quick HTTP Check

3. Check Local Disk Certificates (If you are on the LB/Proxy)

4. Check Load Balancer Logs

🛡️ 3. Mitigation Sequence (Stabilize)

You need to restore trust immediately.

  1. Route Traffic Away: If this is a regional rotation failure, immediately use DNS or Global Load Balancing to steer traffic to a healthy PoP (Point of Presence) that has a valid certificate.
  2. Emergency Manual Push: If the automation failed, manually provision a new certificate (or rollback to the previous one if a bad cert was deployed) and force reload the edge proxies.
  3. Degrade mTLS (Internal Only): If an internal mTLS CA expired and the entire mesh is down, temporarily configure the service mesh to “Permissive Mode” (allow plaintext) to restore the data plane, if security policies allow for SEV-1 mitigation.

🔬 4. Root Cause Investigation

Why did the certificate expire in the first place?

đź§± 5. Prevention (The Senior Signal)

To score “Exceptional” (L5/L6), you must prove you will never let a human track an expiration date again.


🚀 The “Execution Sequencing” Gap

In an interview, identifying an expired certificate is easy.

Knowing how to verify the intermediate chain, why the proxy needs a reload, and how to architect synthetic probers is what separates the hires from the rejects.

Google SRE interviews test your Execution Sequencing under pressure. If your sequence is wrong, your technical knowledge won’t save you.

I built The Complete Google SRE Career Launchpad to simulate these exact, high-stakes infrastructure failures.

👉 Get The Complete Google SRE Interview Career Launchpad (Gumroad)

The Full Training System Includes:

Don’t let an expired certificate freeze your interview. Train your reflexes.