Availability and Recovery: When It Breaks, How Fast Do You Come Back?
· Jerwin Arnado ·
The final part of the full-stack series. Twelve layers
of building it well, and here’s the humbling truth they all lead to: everything fails
eventually. Disks die, regions go dark, a deploy goes bad, someone fat-fingers a DELETE.
Mature engineering isn’t pretending failure won’t happen — it’s deciding, in advance, how the
system behaves when it does. Two halves: availability (staying up despite failures) and
recovery (getting back fast when you can’t).
Backups: the floor under everything
A backup is the difference between an incident and a catastrophe. But the word means nothing until you can answer three questions:
| Question | Bad answer | Good answer |
|---|---|---|
| How often? | “whenever I remember” | automated, scheduled, frequent |
| Where? | “same server as the data” | off-site, separate provider/region |
| Tested? | “they should work” | restored on a schedule, verified |
That third row is the one everyone skips and everyone regrets. A backup you have never restored is not a backup — it’s a hope. Corrupt dumps, missing tables, an un-runnable format: you find out at the worst possible moment. So test the restore on a cadence:
# Don't just back up — prove the backup restores into a scratch database
mysqldump app_prod | gzip > backup.sql.gz
gunzip < backup.sql.gz | mysql restore_test # if this fails, you have no backup
Off-site matters too: a backup on the same server as the database dies with the server. Different provider or region, always.
RTO and RPO: name your tolerance
Two numbers turn “we should be reliable” into a concrete, fundable target:
- RTO (Recovery Time Objective): how long can you be down? Minutes? Hours?
- RPO (Recovery Point Objective): how much data can you lose? The last hour? The last five minutes?
These drive every other decision and every dollar. An RPO of five minutes means continuous replication, not a nightly dump. An RTO of two minutes means automated failover, not a human paged at 3am. Be honest: tighter numbers cost real money, so set them per the actual stakes — a side project and a payments platform deserve very different answers.
Availability: redundancy and graceful degradation
Staying up is mostly about removing single points of failure — the stateless, load-balanced design from the scaling post is also your availability story, because any one server can die and the rest carry on. The other half is graceful degradation: when a dependency fails, lose a feature, not the whole app.
// Search is down? Degrade to a basic query — don't 500 the whole page.
try {
$results = $elasticClient->search($query);
} catch (ConnectionException $e) {
report($e); // logged for the on-call
$results = Post::where('title', 'like', "%{$query}%")->limit(20)->get();
}
The non-critical queue work from earlier pays off here too: if email is down, jobs wait and retry instead of taking checkout down with them. Isolate failures so one broken dependency is a missing feature, not an outage.
Have a runbook before you need it
At 3am, mid-incident, is the worst time to think. Write the plan while calm: how to restore from backup, how to roll back a deploy, who to call, where the logs and dashboards are. A checklist beats panicked improvisation every time. And afterward, a blameless post-mortem — what failed, why, what guardrail stops a repeat — turns one outage into permanent resilience instead of a recurring nightmare.
Caveats and best practices
- Test restores on a schedule, and rehearse the failover. Game-day drills surface the gap between “we have a plan” and “the plan works” while it’s cheap to find out.
- Automate rollback. The atomic releases from the deploy post mean a bad release reverts in seconds — your fastest recovery path for the most common cause of downtime (a bad deploy).
- Monitor the backups themselves. A silently-failing backup job is the cruelest failure — you think you’re covered right up until you need it. Alert on a missed or shrinking backup.
- Right-size the investment. Multi-region active-active is overkill for most apps and a great way to over-engineer. Tested off-site backups + fast rollback + a runbook covers the overwhelming majority of real incidents.
Conclusion
Backups → automated, off-site, RESTORE-TESTED (or it's just hope)
Targets → RTO (downtime tolerance) + RPO (data-loss tolerance) drive spend
Stay up → redundancy + graceful degradation (lose a feature, not the app)
Recover → runbook, automated rollback, blameless post-mortem
And that closes the series. Thirteen layers — from the frontend pixel the user taps to the off-site backup that saves you when it all goes wrong. The vibe-coded demo was three of these; a product people can depend on is all thirteen. You don’t add them all on day one — you add each layer when the risk it covers becomes real. That judgment, layer by layer, is most of what production-ready actually means.