Getting Visibility in Your Deployed Platform: Monitoring, Logging, and Error Tracking
Deploying a platform is not the finish line. It is the point where your system leaves the controlled environment of your laptop and starts meeting real users, slow networks, expired tokens, overloaded databases, bad deploys, background jobs, webhook retries, and third-party APIs that fail at the worst possible moment.
If you cannot see what is happening, production becomes guesswork. You refresh dashboards, search logs, restart services, and hope the same incident does not happen again. Visibility is what turns that chaos into engineering work.
This guide is a practical blueprint for building that visibility. Not a giant vendor tutorial. Not a promise that one dashboard will fix everything. The goal is simple: when your deployed platform is slow, broken, or behaving strangely, you should be able to answer:
- What is failing?
- Who is affected?
- When did it start?
- Which deploy, service, request, job, or dependency caused it?
- What should we fix first?
That is the job of monitoring, observability, logging, and error tracking.
Start With the Signals
The observability stack is easier to reason about when you separate the signals:
| Signal | What it answers | Common tools |
|---|---|---|
| Errors | What crashed or threw an exception? | Sentry, Bugsnag, Rollbar |
| Traces | What happened across this request or workflow? | OpenTelemetry, Tempo, Jaeger, Honeycomb |
| Metrics | How is the system behaving over time? | Prometheus, Grafana, Datadog |
| Logs | What did the service say happened? | Loki, Elasticsearch, OpenSearch |
| Alerts | What needs human attention now? | Grafana Alerting, PagerDuty, Opsgenie |
Do not start by installing every tool. Start by deciding which questions you need to answer. A small platform usually needs four things first:
- error tracking for crashes and exceptions
- request and job traces for latency and dependency debugging
- metrics for health, saturation, and business-critical counters
- centralized logs that can be searched by service, environment, deploy, and request ID
Everything else builds on those foundations.
The Practical Architecture
A good default architecture for a cloud-native platform looks like this:
Application services
-> Sentry for errors, releases, and user-impact context
-> OpenTelemetry SDK for traces, spans, and metrics
-> OpenTelemetry Collector for routing and normalization
-> Prometheus for metrics storage and alert rules
-> Loki or ELK for logs
-> Grafana for dashboards, exploration, and alerting
-> Terraform for provisioning dashboards, data sources, folders, and alert rules The important part is not the exact vendor mix. The important part is the flow:
- Instrument the application close to the code that handles requests, jobs, queues, and external calls.
- Attach correlation data such as
trace_id,request_id,user_id,tenant_id,environment, andrelease. - Ship telemetry to dedicated backends instead of leaving it trapped inside container logs.
- Use dashboards and alerts for operating the system, not for decorating the platform.
- Provision observability as code so production visibility survives team changes, redeploys, and environment rebuilds.
Error Tracking: Add Sentry Early
Sentry is often the fastest win because it gives you high-signal information quickly: exception stack traces, affected users, releases, browser/device context, breadcrumbs, and frequency.
The minimum useful setup is:
- configure Sentry in every frontend, backend, worker, and background-job process
- set
environmenttoproduction,staging, orpreview - set
releaseto the deployed version or commit SHA - attach user and tenant context when safe
- enable performance tracing only where sampling is affordable
- send source maps for frontend builds
- mark deploys/releases so regressions are visible after a deploy
The mistake I see most often is treating Sentry as “just error logging.” It is more useful than that. Sentry should tell you whether a new deploy caused a spike, whether one customer is affected or everyone is affected, and whether the same exception has been happening quietly for weeks.
A useful Sentry event should answer:
service=api
environment=production
release=2026.06.03-9f4a21c
user_id=usr_123
tenant_id=tenant_456
trace_id=7b3f...
transaction=POST /api/orders Without that context, an exception is just a stack trace. With that context, it becomes a debugging path.
Tracing: Use OpenTelemetry for the Whole Request
Logs tell you what a service said. Metrics tell you what changed over time. Traces show you the path of a single request or workflow.
OpenTelemetry is the right default because it is vendor-neutral. You can instrument once and send traces to Grafana Tempo, Jaeger, Honeycomb, Datadog, New Relic, or another backend later.
At minimum, trace:
- inbound HTTP requests
- database queries
- cache operations
- queue publish and consume operations
- external API calls
- background jobs
- scheduled tasks
The goal is not to trace every function. The goal is to see where time went and where failure entered the workflow.
A useful trace might show:
POST /api/checkout 2.8s
validate_cart 24ms
get_user 18ms
reserve_inventory 480ms
charge_card 1.9s
publish_event 30ms
render_response 8ms That is immediately more useful than a log line saying “checkout slow.” You can see the payment provider is the bottleneck.
For most teams, the OpenTelemetry Collector is worth adding early. It gives you one place to receive telemetry, apply sampling rules, enrich attributes, and route data to the right backend.
receivers:
otlp:
protocols:
http:
grpc:
processors:
batch:
resource:
attributes:
- key: deployment.environment
value: production
action: upsert
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [resource, batch]
exporters: [otlp] This is not a complete production config, but it shows the shape: receive OTLP data, enrich it, batch it, and export it.
Metrics: Prometheus for Health and Trends
Metrics answer questions over time:
- Is the API error rate rising?
- Did p95 latency change after the last deploy?
- Is the worker queue backing up?
- Is the database connection pool saturated?
- Are signups, payments, uploads, or messages dropping?
Prometheus is a strong default for service metrics because the data model is simple, the ecosystem is mature, and Grafana works with it naturally.
The metrics that matter first:
| Metric | Why it matters |
|---|---|
| request count | traffic, adoption, and sudden drops |
| error count/rate | production breakage |
| request duration | user-visible latency |
| queue depth | worker saturation |
| job duration and failures | async reliability |
| database pool usage | saturation before outage |
| external API latency/errors | vendor dependency health |
| deploy count/version | correlation with regressions |
Use metrics for aggregates, not forensic detail. A metric should tell you that POST /api/checkout has a 7% error rate. A trace or Sentry event should tell you why one checkout failed.
Good alerting starts with a few boring rules:
High API error rate for 10 minutes
p95 latency above user-facing threshold
Background queue age above SLA
No successful payments in the last 15 minutes
Database connection pool above 90% Avoid alerting on every spike. Alert when a human needs to act.
Logs: Loki or ELK, but Make Them Structured
Logs are still essential, but only if they are searchable and correlated. Raw text logs are where debugging goes to die.
Use structured logs:
{
"level": "error",
"service": "api",
"environment": "production",
"release": "2026.06.03-9f4a21c",
"trace_id": "7b3f...",
"request_id": "req_abc",
"user_id": "usr_123",
"route": "POST /api/checkout",
"message": "Payment authorization failed",
"provider": "stripe",
"duration_ms": 1932
} The stack choice depends on your team:
- Loki is a good fit when you already use Grafana and want cost-conscious log aggregation. It indexes labels, not every word, so label design matters.
- ELK/OpenSearch is a good fit when you need powerful full-text search, heavier indexing, and log analytics across many event shapes.
For many platforms, Loki plus Grafana is enough. For compliance-heavy environments, security analytics, or complex search requirements, ELK or OpenSearch may be worth the operational cost.
Whatever you choose, standardize labels:
service
environment
region
release
tenant_id
trace_id
request_id
job_name Those labels are the difference between “search all logs for checkout” and “show me production API errors for this tenant after the last deploy.”
Grafana: Build Dashboards Around Questions
Grafana dashboards should reflect how you operate the platform. Do not build one giant dashboard with every chart. Build small dashboards around decisions.
Start with:
- Platform overview: uptime, request rate, error rate, p95 latency, active incidents
- API health: route latency, route error rate, dependency latency, saturation
- Worker health: queue depth, queue age, job duration, retries, dead-letter count
- Business health: signups, payments, uploads, messages, orders, or whatever your platform must keep doing
- Deploy health: current release, error rate by release, latency by release, rollback indicators
Dashboards are not just for incident response. They are also for confidence. After a deploy, you should know whether the platform is healthier, worse, or unchanged.
Provision Grafana With Terraform
Manually clicking dashboards into existence works until someone deletes one, changes an alert, or needs the same setup in staging. Observability should be treated like infrastructure.
Terraform can provision Grafana folders, data sources, dashboards, notification policies, and alert rules. A minimal shape looks like this:
provider "grafana" {
url = var.grafana_url
auth = var.grafana_auth
}
resource "grafana_folder" "platform" {
title = "Platform"
}
resource "grafana_data_source" "prometheus" {
type = "prometheus"
name = "Prometheus"
url = var.prometheus_url
}
resource "grafana_data_source" "loki" {
type = "loki"
name = "Loki"
url = var.loki_url
} You do not need to model every panel in Terraform on day one, but the critical pieces should be code:
- data sources
- dashboard folders
- shared dashboards
- alert rules
- notification contact points
- environment-specific variables
The benefit is not just repeatability. It is reviewability. Alert changes should go through the same review process as application changes.
Correlation Is the Difference Between Data and Visibility
The stack only works when signals connect to each other.
When an alert fires, you should be able to move like this:
Grafana alert
-> Prometheus metric showing p95 latency spike
-> trace for a slow request
-> logs with the same trace_id
-> Sentry issue from the same release
-> deploy that introduced the regression That workflow depends on consistent attributes:
service.namedeployment.environmentservice.versionorreleasetrace_idrequest_iduser_idor anonymous session IDtenant_idwhen applicable- cloud region or runtime
This is where many teams fail. They install good tools but do not standardize context. The result is five disconnected systems and a lot of manual searching.
What to Build First
If you are starting from almost nothing, build in this order:
- Sentry for every runtime so exceptions and releases are visible.
- Structured logs with environment, service, release, request ID, and trace ID.
- OpenTelemetry traces for requests, jobs, database calls, queues, and external APIs.
- Prometheus metrics for traffic, errors, latency, saturation, and key business flows.
- Grafana dashboards for platform, API, workers, deploys, and business health.
- Alerts for user-impacting failures, not every technical fluctuation.
- Terraform provisioning for Grafana data sources, dashboards, and alerting rules.
This order gives you value quickly. Sentry catches obvious breakage. Logs give you detail. Traces explain workflows. Metrics show trends. Dashboards and alerts turn the data into operations.
The Production Visibility Checklist
Before calling a platform production-ready, I want these answers:
- Can I see which release is running?
- Can I see errors by release, service, route, and environment?
- Can I follow one request across services?
- Can I connect a Sentry issue to logs and traces?
- Can I see p50, p95, and p99 latency for critical routes?
- Can I see queue depth and job age?
- Can I tell whether a third-party dependency is failing?
- Can I tell whether users are blocked or only background work is delayed?
- Can I see business-critical flows such as payments, signups, uploads, or messages?
- Can I provision the observability setup again from code?
- Can an engineer on call debug an incident without asking five people where to look?
If the answer is no, the platform is not fully visible yet.
Conclusion
Observability is not a dashboard project. It is a production engineering habit.
Sentry gives you high-signal error tracking. OpenTelemetry gives you portable traces and metrics. Prometheus gives you time-series monitoring. Loki or ELK gives you searchable logs. Grafana gives you dashboards and alerting. Terraform makes the setup reproducible.
But the tools are only useful when they are connected by consistent context and aimed at real operational questions.
The best observability stack is the one that helps you move from “something is wrong” to “this deploy broke checkout for these users because this dependency got slower” in minutes, not hours.