Metrics Tell You Something Broke. Tracing Tells You What, Where, and Why.

June 3, 2026 · By Justyn Larry

Complacency is a killer. The monitoring stack that I built works, and it’s reliable, so leaving it alone seems like the most obvious thing to do. Focusing on marketing, documentation, taking time away from it all seem like good options, but there’s always a better way to do something, to solve a problem you didn’t realize you had.

In my spare time, I look through Reddit and Dev.to for ideas or inspiration. Systems that others are using that I’m not, or that I’m not aware of. Distributed traces jumped out at me from both forums: I can tie a system event to the metrics, instead of stumbling around logs? This is a monitoring goldmine. How had I missed this?

// WHAT EXACTLY IS DISTRIBUTED TRACING?

For any kind of multi-step processes running on your system, distributed tracing provides a timeline of exactly what happened, and how long each step took. It’s like getting a receipt for the work showing you where time and resources were spent. Each request or job gets a trace ID, and every step records a span (a named block with a start time, end time, and any attributes you want to attach). Those spans assemble into a waterfall, and you can see at a glance where time was spent, what succeeded, and what failed.

This added visibility can take a technical team from “this seems slow” to a detailed accounting of how long a process took and what the system was actually doing when the process was lagging.

// THE ORIGINAL CORE STACK

Irin Observability runs on Prometheus, Grafana, Loki, Grafana Alloy, and Alertmanager. I’ve built a robust monitoring stack that tracks metrics for request rates, error rates, LLM call counts, and report generation status. There are also logs flowing from all the services through Loki, so overall, I believed that the stack was well-instrumented and very readable.

The alert system that I built runs through five internal services to process each alert through an alert annotator and to generate a monthly report in sequence:

An alert comes in from a client’s infrastructure
The alert annotator calls a local LLM to add a plain-English explanation for a panel on one of the dashboards
The annotated result gets pushed back into Loki
At the end of the month, the aggregation script gathers all findings for report generation
The LLM narrative layer writes a summary
The report generator assembles everything into a PDF and sends it

Each of those steps runs in a different process. Some run as Docker containers, some as host Python scripts. When auditing the reports and something didn’t look right, I had to check the logs on the Loki Log Exporter Dashboard or grep logs across multiple services, correlate timestamps manually, and piece together what happened. This was both frustrating and time-consuming. The platform should be telling me what the problem is in addition to telling me that something is wrong.

// THE SOLUTION: OPENTELEMETRY

OpenTelemetry (OTel) is an open source standard for collecting telemetry data (traces, metrics, and logs) from applications. It’s vendor-neutral, well-maintained, and has solid Python libraries.

Grafana Tempo is an open source backend for storing and querying traces. It integrates directly with Grafana, so once it’s running you can navigate from a log line to a trace, or from a trace to the logs that were happening at the same time.

Getting this running involved three parts. First, I deployed Tempo as a Docker Compose service, with a config file and a Grafana datasource. The second step was to wire up Grafana Alloy as the collector. Since Alloy is the agent already running on my servers to ship metrics and logs, I was able to add an OTLP receiver block to accept traces from internal services and forward them to Tempo, one config change, and the heartbeat API distributed the updated config files to all the monitored servers. The final step was to instrument the Python services. This is where things got a little more difficult, but it also taught me some valuable lessons.

// THE PYTHON IMPLEMENTATION

The OTel Python SDK has two modes. The first is auto-instrumentation, which handles the common cases automatically. If you’re running a Flask or FastAPI app, importing two libraries and calling .instrument() captures every HTTP request with no further changes. If you’re using psycopg2 for Postgres queries, one more library call and every query becomes a span.

The second, manual spans, are for the logic your code owns (units of work that typical instrumentation frameworks can’t see automatically). I used these to capture the LLM call itself (duration, prompt size, whether the response parsed cleanly), each section of the aggregation script so I can see which Prometheus query is slow, and the overall per-tenant run so every trace carries a tenant name.

// LESSONS LEARNED

Short-lived scripts need an explicit flush.

The aggregation script and report generator run once and exit. The default OTel exporter batches spans and sends them on a timer. If the process exits before the batch fires, you lose all your spans. I fixed it by adding two lines: force_flush() and shutdown() in a try/finally block before exit. I lost my first few test traces before I figured this out.

The psycopg2-binary package breaks auto-instrumentation silently.

The OTel instrumentation library checks for a package literally named psycopg2. If you installed psycopg2-binary (the same library, different distribution name), the check fails and you receive no database spans, no error message, nothing reported. The fix is one parameter: Psycopg2Instrumentor().instrument(skip_dep_check=True).

Background tasks break parent-child trace linkage.

My alert annotator returns a 200 response immediately and processes the alert in a background thread. The HTTP span closes when the response is sent, but before the real work begins, which means each alert generates two separate traces: a brief HTTP span and an orphaned processing span. The model behavior was correct, not a bug, but it looked confusing until I understood the threading model. I accepted it and correlate the two traces by alert fingerprint when necessary.

// THE BIG DIFFERENCE

This is where things get interesting, and how the original monitoring stack differs from its current iteration.

Prior to integrating distributed tracing, I knew that the report pipeline ran. That’s it: pass/fail, true/false. If something went wrong, where did it happen, and why? What was the system state at the time of the failure? Now I can open a trace in Grafana Tempo and see:

report.generate: total duration 4m 12s
  db.get_contacts: 41ms
  aggregation.run (per tenant): 2m 18s
    aggregation.stability: 39ms
    aggregation.resources: 1.2s  (slow Prometheus query range)
    aggregation.alerts: 88ms
  llm.narrative_generation: 1m 44s
    llm.build_prompt: 12ms
    llm.call attempt 1: 119s  (timeout)
    llm.call attempt 2: 44s   (success)
    llm.parse: 3ms
  report.build_pdf: 8s
  report.send_email: 2s

That waterfall tells me that the Ollama model timed out on the first attempt and succeeded on the second. I don’t have to go digging through logs in an approximate time frame to figure out what happened. The Prometheus query for resource metrics was the slow step in aggregation. PDF build and email delivery were fast. The problem isn’t solved, but I know exactly what the problem is.

Through the alert annotator, I can now see every alert as a trace. The system shows me the dedup check against Loki, the LLM call, the result push. I can filter by tenant, by alert name, by whether the LLM call succeeded. A 55-second LLM call that I used to see only as a latency spike in a Prometheus histogram is now a named span with the prompt size, the response size, and whether the JSON parsed cleanly.

// THE IMPLICATIONS

If you have any experience with monitoring, you have almost certainly hit the “something seems wrong but I can’t tell what” problem. The logs are probably available, you can see the metrics, but you’re stuck sifting through them in sequence trying to reconstruct what happened.

Distributed tracing changes the diagnostic workflow from “search for clues” to “read the receipt.” The trace tells you what happened, in order, with timing, which virtually eliminates investigation time and lets you go directly to the problem at hand. The same layering applies further down the stack: reachability checks tell you something broke, system metrics tell you why, and traces take that one level deeper.

It also changes how you think about reliability. When I see the LLM call timing out on first attempt consistently, I know to tune the timeout or check model load before it impacts the client. Being proactive in monitoring is a moving target, but it is still the goal.

// THE TOOLCHAIN

Everything I used is open source and self-hostable:

OpenTelemetry Python SDK (opentelemetry-sdk, exporter packages, auto-instrumentation libraries)
Grafana Tempo for trace storage and querying
Grafana Alloy as the collector and forwarder
Grafana for visualization, with native Tempo datasource support and log/trace correlation

If you’re already running Prometheus and Grafana for metrics, adding Tempo for traces is a natural extension of the same stack. You can use the same agent, dashboards, and query interface. You’re adding one more signal type, but no new tooling paradigm.

The monitoring stack I run for Irin clients is the same stack I use to observe both Irin and my private infrastructure. It’s what lets me catch instrumentation gotchas and gives me a reliable view of all of my systems. I built Irin because I believe that monitoring your system shouldn’t be a full-time job. If the monitoring stack does what it’s supposed to, you should be able to check it intermittently through the day. It should tell you at a glance if something’s wrong, and send an alert if the problem merits it. If it’s noisy, crowded, and you don’t know where to begin when there’s a problem, the system doesn’t work, and the real problems get drowned out.