Availability and Recovery: When It Breaks, How Fast Do You Come Back?

The final part of the full-stack series. Twelve layers of building it well, and here’s the humbling truth they all lead to: everything fails eventually. Disks die, regions go dark, a deploy goes bad, someone fat-fingers a DELETE. Mature engineering isn’t pretending failure won’t happen — it’s deciding, in advance, how the system behaves when it does. Two halves: availability (staying up despite failures) and recovery (getting back fast when you can’t).

Backups: the floor under everything

A backup is the difference between an incident and a catastrophe. But the word means nothing until you can answer three questions:

Question	Bad answer	Good answer
How often?	“whenever I remember”	automated, scheduled, frequent
Where?	“same server as the data”	off-site, separate provider/region
Tested?	“they should work”	restored on a schedule, verified

That third row is the one everyone skips and everyone regrets. A backup you have never restored is not a backup — it’s a hope. Corrupt dumps, missing tables, an un-runnable format: you find out at the worst possible moment. So test the restore on a cadence:

# Don't just back up — prove the backup restores into a scratch database
mysqldump app_prod | gzip > backup.sql.gz
gunzip < backup.sql.gz | mysql restore_test   # if this fails, you have no backup

Off-site matters too: a backup on the same server as the database dies with the server. Different provider or region, always.

RTO and RPO: name your tolerance

Two numbers turn “we should be reliable” into a concrete, fundable target:

RTO (Recovery Time Objective): how long can you be down? Minutes? Hours?
RPO (Recovery Point Objective): how much data can you lose? The last hour? The last five minutes?

These drive every other decision and every dollar. An RPO of five minutes means continuous replication, not a nightly dump. An RTO of two minutes means automated failover, not a human paged at 3am. Be honest: tighter numbers cost real money, so set them per the actual stakes — a side project and a payments platform deserve very different answers.

Availability: redundancy and graceful degradation

Staying up is mostly about removing single points of failure — the stateless, load-balanced design from the scaling post is also your availability story, because any one server can die and the rest carry on. The other half is graceful degradation: when a dependency fails, lose a feature, not the whole app.

// Search is down? Degrade to a basic query — don't 500 the whole page.
try {
    $results = $elasticClient->search($query);
} catch (ConnectionException $e) {
    report($e);                                  // logged for the on-call
    $results = Post::where('title', 'like', "%{$query}%")->limit(20)->get();
}

The non-critical queue work from earlier pays off here too: if email is down, jobs wait and retry instead of taking checkout down with them. Isolate failures so one broken dependency is a missing feature, not an outage.

Have a runbook before you need it

At 3am, mid-incident, is the worst time to think. Write the plan while calm: how to restore from backup, how to roll back a deploy, who to call, where the logs and dashboards are. A checklist beats panicked improvisation every time. And afterward, a blameless post-mortem — what failed, why, what guardrail stops a repeat — turns one outage into permanent resilience instead of a recurring nightmare.

Caveats and best practices

Test restores on a schedule, and rehearse the failover. Game-day drills surface the gap between “we have a plan” and “the plan works” while it’s cheap to find out.
Automate rollback. The atomic releases from the deploy post mean a bad release reverts in seconds — your fastest recovery path for the most common cause of downtime (a bad deploy).
Monitor the backups themselves. A silently-failing backup job is the cruelest failure — you think you’re covered right up until you need it. Alert on a missed or shrinking backup.
Right-size the investment. Multi-region active-active is overkill for most apps and a great way to over-engineer. Tested off-site backups + fast rollback + a runbook covers the overwhelming majority of real incidents.

Conclusion

Backups → automated, off-site, RESTORE-TESTED (or it's just hope)
Targets → RTO (downtime tolerance) + RPO (data-loss tolerance) drive spend
Stay up → redundancy + graceful degradation (lose a feature, not the app)
Recover → runbook, automated rollback, blameless post-mortem

And that closes the series. Thirteen layers — from the frontend pixel the user taps to the off-site backup that saves you when it all goes wrong. The vibe-coded demo was three of these; a product people can depend on is all thirteen. You don’t add them all on day one — you add each layer when the risk it covers becomes real. That judgment, layer by layer, is most of what production-ready actually means.