Getting Visibility in Your Deployed Platform: Monitoring, Logging, and Error Tracking

Deploying a platform is not the finish line. It is the point where your system leaves the controlled environment of your laptop and starts meeting real users, slow networks, expired tokens, overloaded databases, bad deploys, background jobs, webhook retries, and third-party APIs that fail at the worst possible moment.

If you cannot see what is happening, production becomes guesswork. You refresh dashboards, search logs, restart services, and hope the same incident does not happen again. Visibility is what turns that chaos into engineering work.

This guide is a practical blueprint for building that visibility. Not a giant vendor tutorial. Not a promise that one dashboard will fix everything. The goal is simple: when your deployed platform is slow, broken, or behaving strangely, you should be able to answer:

What is failing?
Who is affected?
When did it start?
Which deploy, service, request, job, or dependency caused it?
What should we fix first?

That is the job of monitoring, observability, logging, and error tracking.

Start With the Signals

The observability stack is easier to reason about when you separate the signals:

Signal	What it answers	Common tools
Errors	What crashed or threw an exception?	Sentry, Bugsnag, Rollbar
Traces	What happened across this request or workflow?	OpenTelemetry, Tempo, Jaeger, Honeycomb
Metrics	How is the system behaving over time?	Prometheus, Grafana, Datadog
Logs	What did the service say happened?	Loki, Elasticsearch, OpenSearch
Alerts	What needs human attention now?	Grafana Alerting, PagerDuty, Opsgenie

Do not start by installing every tool. Start by deciding which questions you need to answer. A small platform usually needs four things first:

error tracking for crashes and exceptions
request and job traces for latency and dependency debugging
metrics for health, saturation, and business-critical counters
centralized logs that can be searched by service, environment, deploy, and request ID

Everything else builds on those foundations.

The Practical Architecture

A good default architecture for a cloud-native platform looks like this:

Application services
  -> Sentry for errors, releases, and user-impact context
  -> OpenTelemetry SDK for traces, spans, and metrics
  -> OpenTelemetry Collector for routing and normalization
  -> Prometheus for metrics storage and alert rules
  -> Loki or ELK for logs
  -> Grafana for dashboards, exploration, and alerting
  -> Terraform for provisioning dashboards, data sources, folders, and alert rules

The important part is not the exact vendor mix. The important part is the flow:

Instrument the application close to the code that handles requests, jobs, queues, and external calls.
Attach correlation data such as trace_id, request_id, user_id, tenant_id, environment, and release.
Ship telemetry to dedicated backends instead of leaving it trapped inside container logs.
Use dashboards and alerts for operating the system, not for decorating the platform.
Provision observability as code so production visibility survives team changes, redeploys, and environment rebuilds.

Error Tracking: Add Sentry Early

Sentry is often the fastest win because it gives you high-signal information quickly: exception stack traces, affected users, releases, browser/device context, breadcrumbs, and frequency.

The minimum useful setup is:

configure Sentry in every frontend, backend, worker, and background-job process
set environment to production, staging, or preview
set release to the deployed version or commit SHA
attach user and tenant context when safe
enable performance tracing only where sampling is affordable
send source maps for frontend builds
mark deploys/releases so regressions are visible after a deploy

The mistake I see most often is treating Sentry as “just error logging.” It is more useful than that. Sentry should tell you whether a new deploy caused a spike, whether one customer is affected or everyone is affected, and whether the same exception has been happening quietly for weeks.

A useful Sentry event should answer:

service=api
environment=production
release=2026.06.03-9f4a21c
user_id=usr_123
tenant_id=tenant_456
trace_id=7b3f...
transaction=POST /api/orders

Without that context, an exception is just a stack trace. With that context, it becomes a debugging path.

Tracing: Use OpenTelemetry for the Whole Request

Logs tell you what a service said. Metrics tell you what changed over time. Traces show you the path of a single request or workflow.

OpenTelemetry is the right default because it is vendor-neutral. You can instrument once and send traces to Grafana Tempo, Jaeger, Honeycomb, Datadog, New Relic, or another backend later.

At minimum, trace:

inbound HTTP requests
database queries
cache operations
queue publish and consume operations
external API calls
background jobs
scheduled tasks

The goal is not to trace every function. The goal is to see where time went and where failure entered the workflow.

A useful trace might show:

POST /api/checkout  2.8s
  validate_cart       24ms
  get_user            18ms
  reserve_inventory  480ms
  charge_card        1.9s
  publish_event       30ms
  render_response      8ms

That is immediately more useful than a log line saying “checkout slow.” You can see the payment provider is the bottleneck.

For most teams, the OpenTelemetry Collector is worth adding early. It gives you one place to receive telemetry, apply sampling rules, enrich attributes, and route data to the right backend.

receivers:
  otlp:
    protocols:
      http:
      grpc:

processors:
  batch:
  resource:
    attributes:
      - key: deployment.environment
        value: production
        action: upsert

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp]

This is not a complete production config, but it shows the shape: receive OTLP data, enrich it, batch it, and export it.

Metrics: Prometheus for Health and Trends

Metrics answer questions over time:

Is the API error rate rising?
Did p95 latency change after the last deploy?
Is the worker queue backing up?
Is the database connection pool saturated?
Are signups, payments, uploads, or messages dropping?

Prometheus is a strong default for service metrics because the data model is simple, the ecosystem is mature, and Grafana works with it naturally.

The metrics that matter first:

Metric	Why it matters
request count	traffic, adoption, and sudden drops
error count/rate	production breakage
request duration	user-visible latency
queue depth	worker saturation
job duration and failures	async reliability
database pool usage	saturation before outage
external API latency/errors	vendor dependency health
deploy count/version	correlation with regressions

Use metrics for aggregates, not forensic detail. A metric should tell you that POST /api/checkout has a 7% error rate. A trace or Sentry event should tell you why one checkout failed.

Good alerting starts with a few boring rules:

High API error rate for 10 minutes
p95 latency above user-facing threshold
Background queue age above SLA
No successful payments in the last 15 minutes
Database connection pool above 90%

Avoid alerting on every spike. Alert when a human needs to act.

Logs: Loki or ELK, but Make Them Structured

Logs are still essential, but only if they are searchable and correlated. Raw text logs are where debugging goes to die.

Use structured logs:

{
	"level": "error",
	"service": "api",
	"environment": "production",
	"release": "2026.06.03-9f4a21c",
	"trace_id": "7b3f...",
	"request_id": "req_abc",
	"user_id": "usr_123",
	"route": "POST /api/checkout",
	"message": "Payment authorization failed",
	"provider": "stripe",
	"duration_ms": 1932
}

The stack choice depends on your team:

Loki is a good fit when you already use Grafana and want cost-conscious log aggregation. It indexes labels, not every word, so label design matters.
ELK/OpenSearch is a good fit when you need powerful full-text search, heavier indexing, and log analytics across many event shapes.

For many platforms, Loki plus Grafana is enough. For compliance-heavy environments, security analytics, or complex search requirements, ELK or OpenSearch may be worth the operational cost.

Whatever you choose, standardize labels:

service
environment
region
release
tenant_id
trace_id
request_id
job_name

Those labels are the difference between “search all logs for checkout” and “show me production API errors for this tenant after the last deploy.”

Grafana: Build Dashboards Around Questions

Grafana dashboards should reflect how you operate the platform. Do not build one giant dashboard with every chart. Build small dashboards around decisions.

Start with:

Platform overview: uptime, request rate, error rate, p95 latency, active incidents
API health: route latency, route error rate, dependency latency, saturation
Worker health: queue depth, queue age, job duration, retries, dead-letter count
Business health: signups, payments, uploads, messages, orders, or whatever your platform must keep doing
Deploy health: current release, error rate by release, latency by release, rollback indicators

Dashboards are not just for incident response. They are also for confidence. After a deploy, you should know whether the platform is healthier, worse, or unchanged.

Provision Grafana With Terraform

Manually clicking dashboards into existence works until someone deletes one, changes an alert, or needs the same setup in staging. Observability should be treated like infrastructure.

Terraform can provision Grafana folders, data sources, dashboards, notification policies, and alert rules. A minimal shape looks like this:

provider "grafana" {
  url  = var.grafana_url
  auth = var.grafana_auth
}

resource "grafana_folder" "platform" {
  title = "Platform"
}

resource "grafana_data_source" "prometheus" {
  type = "prometheus"
  name = "Prometheus"
  url  = var.prometheus_url
}

resource "grafana_data_source" "loki" {
  type = "loki"
  name = "Loki"
  url  = var.loki_url
}

You do not need to model every panel in Terraform on day one, but the critical pieces should be code:

data sources
dashboard folders
shared dashboards
alert rules
notification contact points
environment-specific variables

The benefit is not just repeatability. It is reviewability. Alert changes should go through the same review process as application changes.

Correlation Is the Difference Between Data and Visibility

The stack only works when signals connect to each other.

When an alert fires, you should be able to move like this:

Grafana alert
  -> Prometheus metric showing p95 latency spike
  -> trace for a slow request
  -> logs with the same trace_id
  -> Sentry issue from the same release
  -> deploy that introduced the regression

That workflow depends on consistent attributes:

service.name
deployment.environment
service.version or release
trace_id
request_id
user_id or anonymous session ID
tenant_id when applicable
cloud region or runtime

This is where many teams fail. They install good tools but do not standardize context. The result is five disconnected systems and a lot of manual searching.

What to Build First

If you are starting from almost nothing, build in this order:

Sentry for every runtime so exceptions and releases are visible.
Structured logs with environment, service, release, request ID, and trace ID.
OpenTelemetry traces for requests, jobs, database calls, queues, and external APIs.
Prometheus metrics for traffic, errors, latency, saturation, and key business flows.
Grafana dashboards for platform, API, workers, deploys, and business health.
Alerts for user-impacting failures, not every technical fluctuation.
Terraform provisioning for Grafana data sources, dashboards, and alerting rules.

This order gives you value quickly. Sentry catches obvious breakage. Logs give you detail. Traces explain workflows. Metrics show trends. Dashboards and alerts turn the data into operations.

The Production Visibility Checklist

Before calling a platform production-ready, I want these answers:

Can I see which release is running?
Can I see errors by release, service, route, and environment?
Can I follow one request across services?
Can I connect a Sentry issue to logs and traces?
Can I see p50, p95, and p99 latency for critical routes?
Can I see queue depth and job age?
Can I tell whether a third-party dependency is failing?
Can I tell whether users are blocked or only background work is delayed?
Can I see business-critical flows such as payments, signups, uploads, or messages?
Can I provision the observability setup again from code?
Can an engineer on call debug an incident without asking five people where to look?

If the answer is no, the platform is not fully visible yet.

Conclusion

Observability is not a dashboard project. It is a production engineering habit.

Sentry gives you high-signal error tracking. OpenTelemetry gives you portable traces and metrics. Prometheus gives you time-series monitoring. Loki or ELK gives you searchable logs. Grafana gives you dashboards and alerting. Terraform makes the setup reproducible.

But the tools are only useful when they are connected by consistent context and aimed at real operational questions.

The best observability stack is the one that helps you move from “something is wrong” to “this deploy broke checkout for these users because this dependency got slower” in minutes, not hours.