Uptime Kuma Tells You That a Service Broke, Not Why.

June 22, 2026

Before I get into the post, this is not intended to disparage Uptime Kuma, it’s a truly amazing and easy to use service, and genuinely excellent software. If you run a homelab or a small fleet and you are not using it, you probably should be. It’s free, self-hosted, beautiful, and it does the thing it was built to do better than almost anything else at any price.

There’s always a but though…so before we get into that, I’m going to spend a little time on what Uptime Kuma does well.

TL;DR: Uptime Kuma is excellent at telling you when a service becomes unreachable, but it cannot explain why a service is slow or unhealthy while still responding. That requires internal metrics from tools like Prometheus, Grafana, and Alloy. Reachability monitoring and systems monitoring solve different problems, and mature environments typically use both.

Where Uptime Kuma Excels

Uptime Kuma answers one question extremely well: is this reachable? It will ping a host, hit an HTTP endpoint and check the status code, watch a TCP port, validate a TLS certificate’s expiry, query a DNS record, check a keyword on a page, watch a Docker container, even poke a game server. It checks on a tight interval, it shows you a clean history, and when something stops responding it fires a notification through basically any channel you can name, and has ninety-plus notification integrations. Status pages you can hand to your users. Two-factor auth. A genuinely nice UI.

For “tell me the moment my website, my reverse proxy, my Plex, or my home assistant stops answering,” it is close to perfect. The interval is short, the setup time is measured in minutes, and there’s practically no maintenance required. It has earned a famously loyal userbase for a reason.

I’m not here to tell you that it’s not the answer to your problems, or convince you to ditch it for something else. I still think everyone running infrastructure of any size should have something like it watching their endpoints, I have it running in a Proxmox container on my homelab. There is a gap I noticed while using it. This post is about the specific moment when you ask Uptime Kuma a question it’s not designed to answer, and what you do when that moment arrives.

Growing Pains

Typically there comes a time as your homelab or business starts to grow, and your database starts feeling slow, queries that used to be instant are taking a little longer. It’s not a real problem yet, but you can tell that something is off.

The Uptime Kuma dashboards may be showing all green, the database port is answering, the HTTP healthcheck returns 200, and every light on the board is on. Uptime Kuma is correctly reporting that the service is up.

It’s not wrong. The service is reachable, and Uptime Kuma is reporting exactly that. The service is reachable, but “reachable” and “healthy” mean different things, and as businesses evolve, they can begin to occupy the space between reachable and healthy. If the disk that your database lives on is pinned at 100% IO utilization because a backup job and a big query are fighting over it, your queries are queuing behind that contention, and from the outside the port still answers in time to pass the check. The board is green, but the database is slow, there’s no contradiction.

Uptime Kuma doesn’t see any of that, and the reason it can’t is not a missing feature, it’s the architecture. It’s checking your systems from the outside looking in, and has no way to see what’s happening inside your servers. What are the disk, memory, CPU, and kernel actually doing?

What you need at that moment is something standing inside the box, analyzing the system from within, which is a different category of tool.

Reachability versus Internals

There are two types of monitoring that explain why mature setups typically end up utilizing both: reachability monitoring and systems monitoring.

Reachability monitoring (Uptime Kuma) asks very basic questions:

Can I get to it?
Is the port open, the page loading, the cert valid, the container running?

It only reports what it can see externally, and that’s exactly what you need when the question is whether users can reach the service. It’s easy, simple, and honest about what it knows.

Systems monitoring (the Prometheus world) asks questions that are a little more complex:

What is going on inside the machine?
How busy is each CPU core?
How much memory is actually available once you account for cache?
What is the disk IO utilization, the queue depth, the read and write latency?
How much network throughput, how many dropped packets?
Is memory slowly leaking over days?

It is an internal view that answers questions about why a service is behaving the way it is.

Neither monitoring system replaces the other. Reachability tells you that something is wrong. Systems metrics tell you why. The database scenario above needs both: Uptime Kuma to eventually notice when the slowness becomes an actual outage, and system metrics to explain the slowness before it becomes a problem.

The Internal View

The standard way to get the inside view on Linux is a tiny agent called node_exporter. It is a small binary that runs on the box, reads metrics straight from the kernel, and exposes them for a time-series database (Prometheus) to collect. Pair it with Grafana for dashboards, and for logs, pair Loki with a shipper. The traditional choice there was Promtail, though Grafana has since put Promtail into long-term support and now steers you toward Grafana Alloy, which can handle both metrics and logs in a single agent. I’ve written a comparison here.

With either node_exporter or Alloy running, the database scenario stops being a mystery. The exact moment things felt slow, you can pull up:

Disk IO utilization on that box, and watch it pin to 100% right when the slowness started.
The specific disk and the read/write split, so you can see it was the backup volume contending with queries.
CPU broken out by mode, so you can rule out CPU as the cause.
Memory availability over the past week, so you can see whether pressure had been building.

And if you have Loki collecting logs alongside the metrics, you can line up the disk IO spike against the log line where the backup job kicked off, and the whole story assembles itself in one view. Uptime Kuma tells you that the service is up, and the system metrics tell you the backup job is strangling your database disk, which is what you need to know to fix it before it affects production.

(If you’re interested, I wrote about the five most important PromQL queries and Grafana panels you need to monitor a Linux server separately here.)

Setting up a stack to gain access to what’s happening inside each of your servers is not as simple as setting up Uptime Kuma. That simplicity is a real part of why Uptime Kuma is so loved, and moving to system metrics comes with a cost.

node_exporter or Alloy needs to be installed on every server, with Prometheus running somewhere in your infrastructure to collect from them. Grafana dashboards must be built or imported from the Grafana community and then tweaked to fit your setup to make them readable instead of overwhelming. Alert rules must be written to fire on real problems without creating noise. Metric and possibly log retention must be set up, and then the monitoring stack needs consistent maintenance.

This is the irony nobody warns you about, you now have a monitoring stack that itself needs monitoring, which is partly why you want predictive disk alerts on the box running Prometheus.

None of this is hard, but it is an ongoing process with no end, and it is a different commitment than the near-zero maintenance of an Uptime Kuma container you set up once and edit when new services come online.

Uptime Kuma is the right tool for reachability and status pages, and it costs you almost nothing to run. System metrics come with a higher cost, but become relevant when you start wondering why services aren’t responding the way they should, but are still showing as up and running on the Uptime Kuma dashboards.

Eventually every monitoring journey follows the same progression. First you want to know if something is down. Later you want to know why it is slow. Eventually you want to know that it is going to become a problem before users notice. Each step requires more visibility than the last.

The Pitch

When you need greater visibility into why you’re having problems, you have two viable options. Run the metrics/logging stack yourself (node_exporter/Promtail/Alloy, Prometheus, Grafana, Loki), which works if you have a team member to stand it up and maintain it, or let someone run it for you.

That second path is the reason I built Irin Observability. You get the inside view, run as a managed service, for small teams and homelabs that have outgrown pure reachability checks but do not want a second full-time job maintaining a metrics stack. A lightweight Alloy agent goes on each box, your dashboards and alerts come pre-built and tuned, and the Prometheus, Grafana, and Loki side lives on my infrastructure instead of yours. It is meant to work alongside services like Uptime Kuma, not to replace it. Reachability and internals are different jobs, and the mature answer is to do both.

The thing I actually want you to take away is the distinction between reachability and systems monitoring, because that understanding will outlive any particular tool. Outside-in tells you something broke, while inside-out can tell you why. Uptime Kuma is one of the best outside-in tools ever made, and it will happily keep doing that job for you forever. It just wasn’t built to be the tool that explains the why of your problems.