The LLM narrates. The code decides.

June 25, 2026

Most of the “AI for observability” work I see right now hands the language model the judgment. I think that’s backwards. Feed it the alert, feed it some metrics, ask it what’s wrong, what should be done, and let it make the judgement call. Based on my experience working with language models, I decided that inverting the process provides better results.

The short version: in my alerting pipeline, the set of allowable classifications is fixed in deterministic Python, and the model has to pick from it. The LLM’s only job is to turn a structured verdict into an easily digestible sentence. It never decides whether something is bad, how bad it is, or what category of problem it is. It narrates within a decision space the code has already locked down.

// THE PROBLEM

I run a small managed monitoring service. Alertmanager fires, a webhook lands, and historically that webhook produced a line like HighMemoryUsage on host web-vm, severity warning, which is accurate, but not terribly helpful. The person reading it still has to know what HighMemoryUsage implies, whether this host always runs hot, and whether to care. I wanted plain-English context attached to the alerts without altering the alert delivery process.

The obvious move was to throw the whole alert at an LLM and ask it to explain. I tried that in the first iteration of this experiment, expecting it to be somewhat accurate, but not entirely reliable, and it did not disappoint, the model was confidently inconsistent. The same alert, fired three times, produced three different “root cause” categories. One run called a test alert a “Configuration or setup issue,” the next called it “Configuration/Testing,” the next something else again. If you’re storing that output to do any kind of aggregation later (I am, I want to know when three different clients hit the same class of problem in the same week), free-form model output fragments into noise. Grouping on a field the model changes at random won’t work.

I kind of knew from the onset that I wouldn’t get amazing results, and that it would be harder than it looked. So I started doing some research, and decided to flip the design. The little voice in the back of my head was right all along, don’t let the model make the decisions.

// THE SPLIT

I created a pipeline that has a hard wall down the middle.

On the deterministic side, Python does the classifying. The output is constrained to an eight-value enum: memory_pressure, cpu_saturation, disk_pressure, service_unavailability, network_issue, configuration_error, external_dependency, unknown. I aggregate on that field because it can only ever be one of eight strings. If nothing fits, the answer is unknown, which is itself a useful signal rather than a hallucinated/variant guess.

On the narrative side, the LLM (llama3:8b, running locally on a box on my own LAN, data/network secure) must choose its classification from that fixed eight-value set, and it writes two short fields alongside it: what the alert is, in plain English, and what it means operationally. The code defines the shape of the answer; the model only fills in a slot that already exists. It is explicitly instructed not to suggest fixes and not to invent a cause, so it performs translation instead of analysis.

The prompt returns strict JSON, grammar-constrained, so I get {what, means, likely_cause_class} every time and the enum value is validated against the allowed set on the way out. If the model returns something off-list, I capture the bug instead of storing a row.

// CONTEXT HYDRATION

A naive version of even the narration step gets you alarmist prose. When tuning the system I received a DiskFillPredicted alert, which on its face looked worth investigating. Then I looked at the host in question, which has had a flat disk-utilization baseline for months. The prediction was a rounding artifact, “your disk is about to fill” is actively misleading. The model had no way to know that from the alert alone, so it just wrote something.

I fixed it by giving the model the same context a human would look at before reacting. Prior to the LLM call, Python does a fast lookup against the metrics backend for that host’s recent baseline, and the prompt carries an explicit rule: if historical context is provided, weigh it over the alert’s literal text. A predicted-disk-fill on a host with a stable months-long baseline is informational, not urgent.

The latency budget for that lookup is five seconds. The LLM call itself takes about eighty seconds, because this runs deliberately on modest CPU-only hardware, a 4th generation Intel i7 with 16GB RAM and no GPU. That is a choice, not a constraint I am apologizing for: the whole posture of the service is that nothing leaves my LAN, so a slow local model beats a fast remote one. And the eighty seconds never reaches the person being alerted. Because the annotator rides alongside the existing path (more on that below), the raw alert lands in Slack and email instantly; the narrated version shows up as a separate annotation a minute or so later. Nothing is ever waiting on the model. Against that eighty-second call, five seconds of pre-fetch is under seven percent overhead and invisible. What mattered most was that if the metrics backend is slow or unreachable, the system fails immediately and falls back to the un-hydrated path, so the enrichment step can never block the result.

// FAIL-CLOSED

That fallback instinct runs through the whole process, the ethos is that the annotator is additive, a ‘nice-to-have.’ Alertmanager routes to it with continue: true, so it sits alongside the existing Slack and email delivery processes, and can never block them. The webhook always returns 200, even when the LLM box is down or when the JSON is malformed, and a static fallback annotation gets used instead. The worst case is that the end-user gets a less flowery alert, but never misses one. The narration is an amenity layered on top of a delivery path that doesn’t depend on it at all.

// ADVICE, FOR WHAT IT’S WORTH

“Let the model narrate, not analyze” is the easy version of the lesson, and it’s true, but it isn’t the hard part. The hard part is the enforcement: constraining the output to a fixed set you can validate, and keeping the model off the critical path so its failures cost you prose and never an alert. The model’s value is fluency, not judgment, and fluency is the most replaceable thing in the stack. Push every actual decision into code you can test, constrain anything you’ll later aggregate down to an enum, and treat the model as the last, most replaceable stage in the pipeline. If you can swap the model out tomorrow and your data stays clean, you’ve drawn the line in the right place.

Mine’s been running against live infrastructure for a few weeks. The prose is good, I’m sure when the hardware is upgraded and a more powerful model is put in place it will be better, but the reason I trust (tentatively) it is that the prose isn’t doing the work.