What We Built on the Spare Server

What We Built on the Spare Server

Field Notes from 2026-05-03. One real blind spot, two tools fragmenting or aging out, a stack of terminals to coordinate. One tiny dormant server. The Grafana platform that pulled it all onto one screen.

A customer ticket came in last month with a Redis error message in the WordPress admin: OOM command not allowed when used memory > 'maxmemory'. Five minutes of work to fix that one site — raise the per-user Redis cap, switch the eviction policy off the upstream noeviction default. Eight hours of work to fix everything we found while we were in there. Pulling on the thread surfaced multiple sites whose object cache had been silently broken for weeks — cache writes returning false, every read a miss, no error reaching the admin notice area, no monitor trip, just degraded performance indefinitely. The earliest had been running that way for 76 days.

We documented all of that in silent Redis cache failures. But the question that didn’t go away: how many other sites are in some other silent-degraded state right now, and we just don’t have a way to ask?

That question — plus two adjacent ones we’d been answering with tools that were either fragmenting across terminals or aging out of maintenance — is what triggered this build. This post walks through what was scattered, the substrate problem we had to solve first, the platform we built on top of it, and the three dashboards we now check daily. It runs for $0 in marginal cost on a 1 vCPU box that was already paid for.

What was scattered, what was aging out, and one real blind spot

Three answers we already had, just not on one screen and not over time. Plus one we genuinely didn’t have at all, which is what kicked the build off.

Per-tenant resource attribution. top -c on every fleet host has always told us which Plesk subscription is using the most CPU and memory right now. We have a Termius window open at all times with the fleet hosts running top -c side by side — that’s how any host investigates a load spike, ours included. What we didn’t have was history. Yesterday at 3 a.m. we couldn’t tell you who was hammering which box, because nobody was watching top at 3 a.m. and nothing was recording. We also didn’t have per-Plesk-subscription disk I/O in any tool — top's per-process I/O view doesn’t cleanly aggregate to a tenant when a single subscription spawns dozens of PHP-FPM workers.

Backup state. Three independent backup layers run across the fleet: WordPress Toolkit nightly per-site to AWS S3, Plesk server-level disaster recovery to Wasabi (full restore of any site, any file, any database), and UpCloud’s own daily server-snapshot service at the hypervisor level (whole-VM rollback if a host catastrophically fails). We’ve never lost a client restore in eighteen years — that redundancy is exactly why. What we didn’t have was a single dashboard showing latest-success-per-site, per-server, per-layer, at a glance. Confirming the safety net held last night meant SSH plus log greps plus S3 listings, three different ways. Doable, but only when somebody decided to look.

Email reputation. Lightmeter Control Center ran on every fleet host for years and gave us bounce rates, queue depth, delivery analytics, and per-domain reputation flags from Postfix logs. It worked extremely well. It is also no longer maintained, and we’d been quietly looking for a successor before the gap between releases and reality got too wide. DMARC aggregate reports were a separate stream Lightmeter never parsed — those XML reports arrive in a mailbox the team only opens when something visibly breaks.

The actual blind spot: per-site Redis cache hit rate. No tool we ran exposed it. Which is exactly why the silent-Redis bug got past us — there was no metric anywhere to ask the question against, so the question never got asked.

One real blind spot. One tool aging out. One audit trail you had to hand-craft. One terminal grid that worked but couldn’t answer historical questions. That isn’t consolidation — it’s coordination, and coordination falls apart the moment everyone is busy. So we built consolidation.

The substrate problem (the part that’s actually hard)

Getting per-tenant resource attribution into Prometheus — not just visible to a human watching top, but queryable, time-series, alertable, and inclusive of disk I/O — turned out to be where most of the work lived. The other two areas were assembly. This one required us to fix something deeper first.

To get per-Plesk-subscription CPU, memory, and disk I/O on a fleet host into a metrics layer, you need to read from cgroup v2’s per-slice metrics. Specifically, /sys/fs/cgroup/user.slice/user-NNN.slice/io.stat for each Plesk subscription’s systemd user slice. Two things have to be true for this to work.

First, io.stat has to be available for each user slice. By default, cgroup v2 only delegates the cpu, memory, and pids controllers down to user slices. Disk I/O is not included unless you explicitly enable it. Second, the fix has to survive reboots and systemd reloads. A naive approach is what we tried first.

/sys/fs/cgroup/user.slice/cgroup.subtree_control What we tried first vs. what actually works
- echo +io > /sys/fs/cgroup/user.slice/cgroup.subtree_control # reverted in seconds
# /etc/systemd/system/user-.slice.d/50-io-accounting.conf
+ [Slice]
+ IOAccounting=yes

The naive write worked for about ten seconds, then the controller vanished from subtree_control. Took us three rounds of cat /sys/fs/cgroup/cgroup.subtree_control across two reboots before we accepted that systemd actively manages that file and was reverting our writes within seconds. The right fix is to declare the requirement at the unit level via a drop-in: IOAccounting=yes on user-.slice, then a single systemctl daemon-reload. systemd handles the controller delegation correctly across all reloads from then on. io.stat stays readable per slice, and the per-user I/O metrics we needed start flowing.

That single drop-in is the foundation everything else sits on. Without per-slice I/O attribution flowing into Prometheus, you can answer “who’s hammering this box right now” with top -c, but you can’t answer “who was hammering it at 3 a.m. last Thursday,” and you can’t graph it, alert on it, or attribute it cleanly when one subscription is running thirty PHP-FPM workers across the process tree.

The wiring

Once the substrate was real, the rest was assembly. Four moving parts, in sequence.

1
Per-server · Prometheus

Local Prometheus on each fleet host

Each fleet host runs a Prometheus instance scraping node_exporter for host metrics, a small custom cgroup exporter that reads /sys/fs/cgroup/user.slice and emits per-user CPU/memory/IO metrics tagged by Plesk subscription username, and per-vhost redis_exporter sidecars (one per Plesk subscription’s Redis socket, exposing hit rate, evictions, memory, connections).

2
Central · Federation

Central Prometheus federates from every fleet host

A central Prometheus instance on the dormant box (see Act 5) federates from each fleet host’s Prometheus over an SSH tunnel. One query lands the whole fleet. Federation is cheap because each target Prometheus does the aggregation work locally; the central instance just pulls the rolled-up series.

3
Postgres · Business + aggregate

Postgres for what doesn't fit Prometheus's data model

Same box. Stores everything Prometheus is bad at: business KPIs (Stripe revenue, Mercury balance, WHMCS subscription health, sales pipeline forecast), per-site real-time visitor counts via IAWP fleet-wide, backup status history, mail delivery and DMARC. Prometheus is brilliant for "every 15 seconds, what's CPU on user X" but useless for "what's our 30-day MRR." Postgres is brilliant for the latter and middling at the former. Use both.

4
Grafana · One pane of glass

Grafana on top of both datasources

Same box again. Reads from both Prometheus and Postgres. Dashboards are defined as code in versioned wire-*.py scripts, not authored through the UI. Re-running a wire script overwrites the dashboard deterministically. Diff-able, replicable, no "who edited what" drift across sessions.

That split (Prometheus for high-cardinality infrastructure, Postgres for business and aggregate, both behind one Grafana) is the core pattern. The "dashboards as code" rule was non-negotiable for us; we'd been burned before by clicking edits that nobody could later reproduce. Source-of-truth lives in git.

Three dashboards, one pane of glass

Three dashboards came out of the build. Each one consolidates an answer that used to be reachable but scattered — plus the one Redis-shaped gap that nothing was reaching at all.

Fleet Overview

The everyday operational dashboard. Top stats line: subscriptions monitored, Redis instances up, federation tunnels healthy, active visitors fleet-wide, and (the one that surprised us) the count of sites with low cache hit rate. Per-server panels for load, memory, and disk are filterable to one or all. The middle of the dashboard is what the cgroup-v2 work newly enabled: per-Plesk-subscription CPU%, memory, and disk I/O bytes/sec as time series alongside top-N bargauges across the fleet. top -c already told us who was using the box right now; the dashboard answers “who was using it at 3 a.m. yesterday” and surfaces disk I/O attribution that top’s per-process view never made easy. The Redis cache health table at the bottom sorts sites by lowest hit rate — the metric the silent-Redis post made us want to find, now self-serve and refreshed every 30 seconds.

Backup Health

The “did our safety net hold?” dashboard. Three independent backup tracks (WP Toolkit per-site to AWS S3, Plesk server-level disaster recovery to Wasabi, UpCloud hypervisor-level daily server snapshots) all on one screen. A header strip summarizes last night across the fleet: sites backed up, sites stale or missing, DR servers OK, snapshot age per host. Below that, a “sites needing attention” panel names any backup that didn’t land cleanly, and a per-server status table surfaces age-of-oldest-backup. The bottom of the dashboard is a storage-class breakdown of the AWS bucket over time. We watched a multi-terabyte cliff drop happen on 2026-05-01 as a 7-day-retention lifecycle policy kicked in, with the daily-cost timeseries dropping in lockstep.

The win we didn’t expect: catching stale WPTK backups before they’d become anyone’s problem. Last month two sites had their WPTK cron silently stop after a plugin upgrade. Plesk’s server-level DR backups still had restore points for both (that’s exactly why we run three layers), but the dashboard caught the WPTK side within 24 hours and we re-armed it before the redundancy got tested. Belt and suspenders, both visible at a glance.

Mail Health

The reputation dashboard. Per-server outbound volume, bounce rate, queue depth, RBL listing status across the fleet. A “top bounced recipient domains” panel lets us spot “the gmail throttle is back” before clients call. These are the metrics Lightmeter Control Center gave us for years; the dashboard now gives us the same view with a maintenance path we control.

The eye-opener was the panel Lightmeter never had: per-domain DMARC aggregate-report pass rate over 30 days, sorted by worst-first. Aggregate-report parsing requires a separate XML-ingestion path Lightmeter didn’t do, so those reports had been stacking up in a mailbox we only opened reactively. The first time we rendered that panel, two domains came back with single-digit DMARC pass rates that the unread mailbox had been telling us about for weeks — both with a third-party sending source whose SPF or DKIM had drifted out of alignment. We covered the original email-auth fleet rollout in a separate piece; this dashboard is what makes that work observable continuously instead of on demand.

The icing: $0 marginal cost

The whole platform runs on a 1 vCPU / 1 GB RAM UpCloud instance we were already paying for as a backstop. It had been provisioned for something else, then sat dormant for months while we considered what to do with it. Postgres + Prometheus + Grafana fit comfortably; CPU rarely tops 5%, memory sits around 600 MB.

So: zero incremental UpCloud spend, zero SaaS observability bill, zero per-host agent fees. The Datadog quote we got for fleet-wide WordPress observability at the kind of granularity we wanted was multiple thousands per month. Self-hosted, dashboards-as-code, on one tiny box we already owned: free.

The honest caveat: "free" assumes you have someone who knows how to run Postgres and Prometheus. We do. If you're a small business owner reading this, this is exactly the kind of work you outsource to your managed host instead of building yourself.

Alerting comes next, after baselines are real

We deliberately ran the platform without alerts for the first 30 days. Reason: alert thresholds set on day one are guessing. After 30 days of clean historical data, we know what "normal" looks like. What's the actual weekday-vs-weekend hit rate baseline, how often does a Plesk subscription spike to 80% memory and recover on its own, what's a normal mail-queue depth at 2 a.m. UTC. With baselines, alerts can fire on real anomalies instead of drumming false positives every other hour.

Alerting candidates, roughly in the order we'll wire them:

  1. Mail bounce rate above 5% sustained for an hour, on a single server.
  2. Backup older than 36 hours, any site.
  3. DMARC pass rate drops by more than 10 points week-over-week, any domain.
  4. Plesk subscription consuming over 50% CPU or 800 MB sustained.
  5. Redis hit rate below 70% on a site with traffic above 10 visits per hour.

That's the next chapter. For now the platform's job is to teach us the shape of normal.

What this means for clients

This isn't an internal-only tool. Every dashboard above is the substrate for what we tell clients in their monthly Website Health reports. The per-site backup status, the spam-blocked counts in making spam protection visible, the per-domain email send tracking in mailmon, the three-layer monitoring story in server monitoring. All of them read from this stack.

The shift

A year ago, "is your site OK?" was a question we answered by checking. Today, it's a question the dashboard already answered for us this morning. That's not a feature we're working on. That's the dashboard above, refreshed every 30 seconds, in production.

Is this you?

Three questions about your hosting situation. Each one maps directly to an answer that was scattered, aging out, or genuinely missing for us before this build — and that we now have on one screen.

  1. On your shared host, can you tell me right now which other tenant is consuming the most disk I/O — and what they were doing at 3 a.m. yesterday? “Right now” is answerable from a terminal if your host runs top. The historical question almost never is, and it’s the one that matters when the speed complaint comes in after the fact.
  2. When did your site last back up successfully, and how would you find out without opening a ticket? Dashboards beat tickets here. If your only path to that answer is asking your provider, the answer is whatever the provider says it is — which is fine until it isn’t.
  3. What’s your domain’s DMARC aggregate-report pass rate this month? If you can’t answer in seconds, those reports are likely arriving in a mailbox nobody is parsing. We’ve found client domains sitting at sub-50% pass rates that the unread aggregate-report mailbox was telling us about; the dashboard is what made them visible without ceremony.

If those answers made you uncomfortable, that’s exactly the operating posture we walked away from with this build. For every site we host, we can answer all three at a glance, and so can our clients in their monthly report.

Get in touch if you want to talk about WordPress hosting that operates at this level. We’ve been managing WordPress infrastructure for 18 years. Take a look at our managed WordPress hosting plans to see what’s included.

The Author

Ryan Davis

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment