Skip to content

← Writing

engineering

Error Tracking and Logs: Knowing Before Your Users Tell You

· Jerwin Arnado ·

Part twelve of the full-stack series. Your app is now scaled across several servers — which means when something breaks, “SSH in and look” no longer works; which server? This layer is how you see a running system. The bar is simple and unforgiving: you should find out about a problem before your users do. If your incident alerts are angry tweets, you’re flying blind.

The three pillars of observability

Observability is knowing what a system is doing from the outside. Three complementary signals:

Pillar Answers Tool example
Logs “what happened, in order?” structured logs → Loki, CloudWatch, Papertrail
Errors “what’s broken right now, and where?” Sentry, Bugsnag, Flare
Metrics “how is it behaving over time?” Prometheus + Grafana, the host’s dashboards

You want all three. Logs are the detailed flight recorder; error tracking is the smoke alarm that pages you; metrics are the cockpit dashboard showing trends before they become fires.

Logs: structured, or barely useful

A log is a timestamped record of an event. The upgrade that changes everything is making them structured (JSON with fields) instead of free-text strings — so you can query them (“all errors for user 4567 in the last hour”) instead of grepping prose:

// Free text — human-readable, machine-hostile
Log::error("Payment failed for user {$user->id}: {$e->getMessage()}");

// Structured — queryable, filterable, aggregatable
Log::error('payment.failed', [
    'user_id' => $user->id,
    'amount'  => $amount,
    'gateway' => 'paymongo',
    'error'   => $e->getMessage(),
]);

Across many servers, ship logs to a central aggregator — chasing a request across five boxes by hand is hopeless. And never log secrets: passwords, tokens, full card numbers. Logs are lower-trust than your database and often flow to third parties; a logged credential is a leaked one.

Error tracking: the alarm that finds you

Logs are passive — they sit there until you look. Error tracking is active: it catches unhandled exceptions, groups duplicates, captures the stack trace plus context (user, request, release), and alerts you. The difference between “47 users hit this NullPointer since the 2pm deploy” landing in Slack versus discovering it next week in a log file is the difference between a five-minute fix and a reputation hit.

// One bootstrap call; every unhandled exception now reports with context
\Sentry\init(['dsn' => env('SENTRY_DSN'), 'environment' => env('APP_ENV')]);

Tag releases so an error spike maps to the deploy that caused it — that’s often the entire investigation, and it pairs with git bisect when it isn’t.

Catch problems in dev, not prod

The cheapest production error is the one that never ships. Turn latent issues into loud failures locally so they can’t reach users:

// AppServiceProvider::boot() — make hidden problems explode in dev
Model::preventLazyLoading(! app()->isProduction());     // N+1 throws, doesn't ship
Model::preventSilentlyDiscardingAttributes(! app()->isProduction());

This is the N+1 query trap turned into a test failure instead of a slow production page — observability pushed left, into development.

Metrics and health checks: the trend and the pulse

Errors tell you what broke; metrics tell you what’s about to. Track request latency, error rate, queue depth, CPU/RAM, and database connections — a slow climb in any of them is a warning you can act on before it’s an outage. Pair that with a /health endpoint your load balancer and uptime monitor both hit:

Route::get('/health', fn () => response()->json([
    'db'    => DB::connection()->getPdo() ? 'ok' : 'down',
    'redis' => Redis::ping() ? 'ok' : 'down',
]));

Caveats and best practices

  • Alert on what’s actionable, or you’ll learn to ignore alerts. Alert fatigue is real — page a human only for things a human must act on now; everything else is a dashboard.
  • Logs need retention and rotation. Infinite logs fill the disk (a self-inflicted outage); too-short retention loses the trail before you investigate. Pick a window and a budget.
  • Add a correlation/request ID. One ID threaded through every log line for a request lets you reconstruct its whole journey across services in one query.
  • Watch security signals here too. Spikes in 403/429 or failed logins are attacks in progress — observability is a security tool, not just a debugging one.

Conclusion

Pillars → logs (what happened) + errors (what's broken) + metrics (the trend)
Logs    → structured/JSON, centralized, never secrets
Errors  → Sentry-style alerts with context + release tags
Shift-left→ preventLazyLoading et al: fail loud in dev
Pulse   → metrics + /health for the balancer and uptime monitor

You can’t fix what you can’t see. Structured logs, active error alerting, and trend metrics turn a black box into a glass one — so you’re the first to know, not the last. Final post: availability and recovery — what you do when, despite all of this, it still goes down.