Field Notes from 2026-05-03. One Redis bug. Three blind spots. One tiny dormant server. The platform that closed the gap.
A few weeks ago, a client mentioned their site felt sluggish. We dug in. Redis was running. The LiteSpeed Cache plugin's object-cache.php dropin was in place. Everything looked correct. Except the cache was silently doing nothing, because a Plesk extension upgrade had overwritten our patched version of the dropin and WordPress had quietly fallen back to its default in-memory cache. No errors. No alerts. Just a slow site, with no surface symptom telling us why.
We documented the fix in silent Redis cache failures. But the question that didn't go away: how many other sites in our fleet are in this exact state right now, and we just don't know it?
That question, plus two adjacent ones we already knew we couldn't answer, is what triggered this build. This post walks through what we couldn't see, the substrate problem we had to solve first, the platform we built on top of it, and the three dashboards we now check daily. It runs for $0 in marginal cost on a 1 vCPU box that was already paid for.
What we couldn't see
Three blind spots. Each one was the kind of thing that "wasn't on fire so it stayed invisible," which is the worst possible posture for an operations team that markets itself as proactive.
Per-tenant resource attribution. Plesk shows per-domain disk quota. New Relic showed host-level CPU and memory. Neither of them answered the actual question we get asked: which customer is the cause of this load right now? When a shared-host customer complains about speed and we look at the host's CPU graph and see 60% sustained, we have no built-in way to know which of the 30 sites on that box is consuming most of it.
Backup state. WordPress Toolkit ran nightly backups to AWS S3 for every site. Plesk ran weekly disaster-recovery backups to Wasabi for every server. They ran. Mostly. Sometimes they didn't, and we'd find out because a client asked for a restore and the restore wasn't there. There was no audit trail in front of us showing the latest success per site.
Email reputation. Mail flowed through Postfix on every host, hundreds of messages a day per server, fanning out to gmail.com, outlook.com, yahoo.com. Bounce rates and DMARC pass rates lived in maillog files we'd grep manually only when something visibly broke. The DMARC reports themselves arrived in a mailbox we weren't reading.
Anything that wasn't on fire was invisible. That's not the operating posture we wanted, so we built one we did.
The substrate problem (the part that's actually hard)
Per-tenant resource attribution turned out to be where most of the work lived. The other two blind spots had straightforward solutions; this one required us to fix something deeper first.
To get per-Plesk-subscription CPU, memory, and disk I/O on a fleet host, you need to read from cgroup v2's per-slice metrics. Specifically, /sys/fs/cgroup/user.slice/user-NNN.slice/io.stat for each Plesk subscription's systemd user slice. Two things have to be true for this to work.
First, io.stat has to be available for each user slice. By default, cgroup v2 only delegates the cpu, memory, and pids controllers down to user slices. Disk I/O is not included unless you explicitly enable it. Second, the fix has to survive reboots and systemd reloads. A naive approach is what we tried first.
The naive write worked for about ten seconds, then the controller vanished from subtree_control. Took us three rounds of cat /sys/fs/cgroup/cgroup.subtree_control across two reboots before we accepted that systemd actively manages that file and was reverting our writes within seconds. The right fix is to declare the requirement at the unit level via a drop-in: IOAccounting=yes on user-.slice, then a single systemctl daemon-reload. systemd handles the controller delegation correctly across all reloads from then on. io.stat stays readable per slice, and the per-user I/O metrics we needed start flowing.
That single drop-in is the foundation everything else sits on. Without per-user I/O attribution, you can't answer "which client is hammering this box right now?" That's the question that matters when a shared-host customer calls about speed.
The wiring
Once the substrate was real, the rest was assembly. Four moving parts, in sequence.
Local Prometheus on each fleet host
Each of the six fleet servers (001 through 006) runs a Prometheus instance scraping node_exporter for host metrics, a small custom cgroup exporter that reads /sys/fs/cgroup/user.slice and emits per-user CPU/memory/IO metrics tagged by Plesk subscription username, and per-vhost redis_exporter sidecars (one per Plesk subscription's Redis socket, exposing hit rate, evictions, memory, connections).
Central Prometheus federates from all six
A seventh Prometheus instance on upcl-nyc1-000 (the dormant box, see Act 5) federates from each fleet host's Prometheus over an SSH tunnel. One query lands all servers. Federation is cheap because each target Prometheus does the aggregation work locally; the central instance just pulls the rolled-up series.
Postgres for what doesn't fit Prometheus's data model
Same box. Stores everything Prometheus is bad at: business KPIs (Stripe revenue, Mercury balance, WHMCS subscription health, sales pipeline forecast), per-site real-time visitor counts via IAWP fleet-wide, backup status history, mail delivery and DMARC. Prometheus is brilliant for "every 15 seconds, what's CPU on user X" but useless for "what's our 30-day MRR." Postgres is brilliant for the latter and middling at the former. Use both.
Grafana on top of both datasources
Same box again. Reads from both Prometheus and Postgres. Dashboards are defined as code in versioned wire-*.py scripts, not authored through the UI. Re-running a wire script overwrites the dashboard deterministically. Diff-able, replicable, no "who edited what" drift across sessions.
That split (Prometheus for high-cardinality infrastructure, Postgres for business and aggregate, both behind one Grafana) is the core pattern. The "dashboards as code" rule was non-negotiable for us; we'd been burned before by clicking edits that nobody could later reproduce. Source-of-truth lives in git.
Three dashboards, three layers of visibility
Three dashboards came out of the build. Each closes one of the three blind spots.
Fleet Overview
The everyday operational dashboard. Top stats line: subscriptions monitored, Redis instances up, federation tunnels healthy, active visitors fleet-wide, and (the one that surprised us) the count of sites with low cache hit rate. Per-server panels for load, memory, and disk are filterable to one or all. The middle of the dashboard is the part the cgroup-v2 work made possible: bargauges of top users by CPU and memory across the fleet, plus CPU% and I/O bytes/sec by user as time series. We can finally point at a row and say "that one is the cause." The Redis cache health table at the bottom sorts sites by lowest hit rate. The list of suspect-silent-cache sites that the original Redis post made us want to find: now self-serve, refreshed every 30 seconds.
Backup Health
The "did our safety net hold?" dashboard. Three independent backup tracks (WP Toolkit to AWS S3 per site, Plesk to Wasabi per server, Postgres self-backup) all on one screen. A header strip summarizes last night across the fleet: sites backed up, sites stale or missing, DR servers OK, Postgres backup age. Below that, a "sites needing attention" panel names any backup that didn't land cleanly, and a per-server status table surfaces age-of-oldest-backup. The bottom of the dashboard is a storage-class breakdown of the AWS bucket over time. We watched a multi-terabyte cliff drop happen on 2026-05-01 as a 7-day-retention lifecycle policy kicked in, with the daily-cost timeseries dropping in lockstep.
The win we didn't expect: catching stale backups before a client needs the restore. Last month we had two sites where the WPTK cron silently stopped running after a plugin upgrade. The dashboard caught both within 24 hours. The clients never knew.
Mail Health
The reputation dashboard. Per-server outbound volume, bounce rate, queue depth, RBL listing status across all six servers. A "top bounced recipient domains" panel lets us spot "the gmail throttle is back" before clients call. The eye-opener was the per-domain DMARC pass rate over 30 days, sorted by worst-first.
The first time we rendered that panel, two domains came back with single-digit DMARC pass rates that we had been completely unaware of. Both with a third-party sending source whose SPF or DKIM record had drifted out of alignment with the actual mail provider. Without the dashboard, those would have stayed broken until somebody noticed an email never arrived. We covered the original email-auth fleet rollout in a separate piece; this dashboard is what makes that work observable in production.
The icing: $0 marginal cost
The whole platform runs on upcl-nyc1-000, a 1 vCPU / 1 GB RAM UpCloud box we were already paying for as a backstop. It had been provisioned for something else, then sat dormant for months while we considered what to do with it. Postgres + Prometheus + Grafana fit comfortably; CPU rarely tops 5%, memory sits around 600 MB.
So: zero incremental UpCloud spend, zero SaaS observability bill, zero per-host agent fees. The Datadog quote we got for fleet-wide WordPress observability at the kind of granularity we wanted was multiple thousands per month. Self-hosted, dashboards-as-code, on one tiny box we already owned: free.
The honest caveat: "free" assumes you have someone who knows how to run Postgres and Prometheus. We do. If you're a small business owner reading this, this is exactly the kind of work you outsource to your managed host instead of building yourself.
Alerting comes next, after baselines are real
We deliberately ran the platform without alerts for the first 30 days. Reason: alert thresholds set on day one are guessing. After 30 days of clean historical data, we know what "normal" looks like. What's the actual weekday-vs-weekend hit rate baseline, how often does a Plesk subscription spike to 80% memory and recover on its own, what's a normal mail-queue depth at 2 a.m. UTC. With baselines, alerts can fire on real anomalies instead of drumming false positives every other hour.
Alerting candidates, roughly in the order we'll wire them:
- Mail bounce rate above 5% sustained for an hour, on a single server.
- Backup older than 36 hours, any site.
- DMARC pass rate drops by more than 10 points week-over-week, any domain.
- Plesk subscription consuming over 50% CPU or 800 MB sustained.
- Redis hit rate below 70% on a site with traffic above 10 visits per hour.
That's the next chapter. For now the platform's job is to teach us the shape of normal.
What this means for clients
This isn't an internal-only tool. Every dashboard above is the substrate for what we tell clients in their monthly Website Health reports. The per-site backup status, the spam-blocked counts in making spam protection visible, the per-domain email send tracking in mailmon, the three-layer monitoring story in server monitoring. All of them read from this stack.
A year ago, "is your site OK?" was a question we answered by checking. Today, it's a question the dashboard already answered for us this morning. That's not a feature we're working on. That's the dashboard above, refreshed every 30 seconds, in production.
Is this you?
Three questions about your hosting situation. Each one maps directly to one of the blind spots we just closed.
- Right now, on your shared host, can you tell me which other tenant is consuming the most disk I/O? If the honest answer is "I'd have to ask support," that's a no. The shared-host neighbor problem is real, and it's almost always invisible to tenants and to undermanned support teams alike.
- When did your site last back up successfully, and how would you find out without opening a ticket? Dashboards beat tickets here. If your only path to that answer is asking your provider, the answer is whatever the provider says it is, which is fine until it isn't.
- What's your domain's DMARC pass rate this month? If you can't answer in seconds, your email reputation is being scored by Gmail and Outlook based on data you don't have visibility into. We've watched real client domains sit at sub-50% pass rates for weeks before we built the dashboard that surfaced them.
If those answers made you uncomfortable, that's exactly where we were a year ago. The difference now is that for every site we host, we can answer all three at a glance, and so can our clients in their monthly report. No "we're working on it." It ships today.
Get in touch if you want to talk about WordPress hosting that operates at this level. We've been managing WordPress infrastructure for 18 years. Take a look at our managed WordPress hosting plans to see what's included.
No comments yet. Be the first to comment!
Leave a Comment