Why route alert email through an external SMTP provider instead of the host's own Postfix?

An alert that depends on the working of the thing it is alerting about is not an alert; it is a hope. If a fleet mail server goes down, alerts about that mail server queue silently inside the very thing that is broken. The fix is to route Grafana SMTP through a fully independent path, like AWS SES, so the alerter survives any failure of the systems it watches. NodePing follows the same rule with its own notification paths.

How do you make sure your alerter does not go silent itself?

A NodePing PUSH check expects a heartbeat from the metrics host every minute. A one-line cron on the metrics host calls Grafana's /api/health locally and only forwards the heartbeat if Grafana responds healthy. If Grafana hangs or the box is down, the heartbeat stops, NodePing notices the silence, and pages out via Pushover priority-1 over a path that has zero dependency on Grafana. Match the cron interval tighter than the PUSH eval window or you will page yourself on cold start.

Why did stale backup alerts keep firing for sites that were already offboarded?

Our backup ETL discovers sites by walking S3 prefixes and inserts or updates rows in a backup-status table. It never deleted. So when a site was offboarded or migrated between fleet servers, its row stayed behind and aged cleanly past the alert threshold forever. We added a sweep step at the end of each ETL run that deletes any row whose scope is no longer present in the in-flight S3 inventory, with a guard to refuse to sweep if discovery returned zero. Now offboarding is just deleting the rolling-backup S3 prefix; the metrics layer self-heals at the next ETL run.

We Shipped the First Tier of Alerts | WebOps

May 5, 2026 11 min read

Ryan Davis

We Shipped the First Tier of Alerts (and Immediately Found Two Stale Backups)

Field Notes from 2026-05-03. Two stale backups in the first thirty minutes. One false alarm that accidentally validated the page path. The cleanup gap we didn’t see coming.

Two days ago we said the next chapter of the observability platform was alerting, and that we’d wait thirty days for baselines before turning anything on. We ended up turning a chunk of it on inside 48 hours, and the first half hour after we did, three notifications hit. Two were emails about real stale backups: one site fifteen days behind, another four. The third was a Pushover priority-1 page at full volume — false alarm, cron timing bug — that proved the wake-me-up path actually pages. All three were the system working as designed, in opposite directions.

This is the writeup of what shipped, what didn’t, what we got wrong, and the architectural gap the real findings surfaced.

Defer the anomaly tier; ship the hard-fail tier now

The previous post said we’d run the platform for thirty days before wiring alerts. That reasoning was correct for one class of alert and overcorrected for another.

Anomaly alerts — traffic spikes, revenue drops, “this Plesk subscription is using more CPU than its baseline” — do need real seasonal data. Without three or four weeks of weekday-vs-weekend, time-of-day, end-of-month patterns, every threshold is a guess and most fire false positives. Those stayed deferred. We’ll come back to them later in May.

Binary alerts — is X broken right now, yes or no — don’t need baselines. They need a query and a threshold and a notification channel. “The most recent backup for this site is more than 36 hours old” doesn’t get more or less true based on what time of day the question was asked. So we shipped that tier alone, two days into the thirty.

Two-tier routing, not five

Two channels. Pushover priority-1 (full-volume, override-do-not-disturb) for the wake-me-up class: Postgres down, every ETL silent simultaneously, multi-site backup failure, IAWP poller daemon down, the metrics host itself unreachable. Email for the look-at-this-tomorrow class: a single site’s backup missed, mail queue elevated on one server, an RBL listing, a DMARC pass-rate drop on a domain that was previously passing, an individual ETL silent.

Two priorities, not five. No graduated escalation, no “page if no ack within fifteen minutes” laddering. Either it’s the kind of thing that should pull us out of bed, or it isn’t. The Grafana notification policy is one rule on severity=critical: critical goes to Pushover with repeat: 1h, everything else falls through to email with repeat: 4h, grouped by [alertname, host] so a single backup-engine outage produces one page, not ten. The whole config lives in wire-alert-rules.py, same idempotent re-run pattern as last post’s dashboard wirings — source of truth in git, not the UI.

Don’t let alerts depend on the thing you’re alerting about

The first thing we caught and undid was the most embarrassing kind of mistake. Grafana was set to relay its email through the fleet’s own Postfix — the same Postfix we wanted alerts about. If a fleet mail server went down, alerts about that outage would queue silently inside the very thing that was broken. We’d find out by client ticket, exactly the failure mode we said we were getting away from.

Wired Grafana SMTP to AWS SES instead — already in our stack for fleet transactional email, completely independent of any host’s Postfix. NodePing follows the same rule: its Pushover and email paths run through infrastructure unrelated to the thing being watched. An alert that depends on the working of the thing it’s alerting about is not an alert. It’s a hope.

Meta-monitoring the alerter

The next problem: Grafana itself can fail silently. Postgres can lock up. The metrics box can OOM. The cron that drives the data flow can stop. If Grafana stops sending alerts because Grafana is down, the absence of alerts looks identical to the absence of problems.

The fix is a heartbeat. A NodePing PUSH check expects a tick from the metrics host on a tight interval. A one-line cron on the metrics box calls Grafana’s /api/health locally and only forwards the heartbeat if Grafana responds healthy. If Grafana hangs, no heartbeat. If the box is down, no cron. NodePing notices the silence and pages via Pushover priority-1, over a path with zero dependency on Grafana.

We learned this works on day one, the hard way. The first cron interval was */3 — every three minutes. NodePing’s eval window was five. The very first heartbeat arrived after NodePing had already declared the check down, which paged us at full volume on a Saturday morning. The fix took two seconds: tighten the cron to every minute. The accidental upside: end-to-end proof that the page path works. Pushover delivered, the phone overrode do-not-disturb, the check recovered, NodePing logged the recovery. Best false alarm we’ve ever had.

First-day findings

The two stale-backup emails that arrived inside the first half hour were both correct.

The first: 15 days behind. The site had been removed from the fleet entirely two weeks earlier. The S3 prefix it backed up to was no longer being written.

The second: 4 days behind. The site had been migrated from one fleet server to another four days prior, and the old prefix on the source server was still being walked by our backup-status ETL even though no new backups would ever land in it again.

The alerts were right. Both sites had stopped backing up at the prefixes the ETL was watching. What surprised us was that both alerts were also kind of wrong — the underlying sites were fine, the live backups elsewhere were current. The backups they were complaining about were ghosts. Neither client was actually unprotected. The alerts didn’t know that.

The cleanup gap

This is the part we didn’t see coming, and it’s the most interesting thing the build taught us.

Our backup ETL discovers sites by walking S3 prefixes under each fleet server’s bucket. Every site found becomes a row in backup_status_daily with a last-seen timestamp. The ETL has only ever inserted and updated. It has never deleted.

So every offboarded or migrated site leaves a row behind, and that row’s timestamp ages past the 36-hour threshold and stays there forever. The alert was working perfectly. It was just firing on sites we’d already cleaned up. Without a sweep, the only way to silence it is hand-cleaning rows in Postgres after every offboard — a contract that dies the first time someone forgets, which is week two.

Fixed in etl-backups.py as a sweep_wptk() step at the end of every run. The in-flight S3 walk’s (server, domain) set is the inventory; any row in backup_status_daily whose scope no longer appears gets deleted. The sweep refuses to run if discovery returned zero, which prevents a bad S3 list from truncating the table.

- aws s3 rm --recursive s3://wptk-backups-{server}/example.com/

- ssh metrics-host "psql -c \"DELETE FROM backup_status_daily WHERE domain='example.com'\""

- # ...and the same DELETE per metrics table, per site, every offboard, forever

+ aws s3 rm --recursive s3://wptk-backups-{server}/example.com/ # done; ETL self-heals next run

The offboarding skill and the onboarding-migration skill have both been updated to reflect the new contract. “Delete the rolling-backup S3 prefix” is now the entire database step. The metrics layer cleans up after itself.

What shipped vs. what was teased

The previous post listed five alerting candidates in rough wiring order. Here’s the reality, two days later.

Backup older than 36 hours, any site. Live, email tier. Found two ghost rows within thirty minutes — one offboarded site, one inter-server migration — which surfaced the cleanup gap above.

DMARC pass-rate drop. Live, modified. Original: “drops by more than ten points week-over-week.” Real version: yesterday’s rate against the previous seven-day baseline, fires only on domains that were passing (above 95%) over that baseline. Without that filter, we’d page on the long tail of domains we already know are stuck at zero.

Mail bounce rate / queue elevated. Live, split into two rules. Bounce-rate spike per server is one; sustained queue depth above a fixed threshold is another. Sustained-for-an-hour was too narrow as a single rule.

Plesk subscription over 50% CPU or 800 MB sustained. Deferred. The rule that most needs baselines — what counts as “sustained” varies wildly by site type and time of day. Revisit with three weeks of cgroup data.

Redis hit rate below 70% on traffic above ten visits per hour. Deferred. Hit-rate baselines are bimodal by content type and time of day; not enough data yet to set the floor without firing constantly.

Three of five shipped, two parked behind real data. About right for the confidence we have today.

Closing

The platform now tells us when something is broken, and it doesn’t tell us about things we’ve already cleaned up. Both halves matter. A noisy alerter gets ignored as fast as a missing one.

The anomaly tier — “this site looks weird compared to its own usual” rather than “this site is on fire” — comes next, after the seasonal data is in. For now, in roughly thirty hours, we crossed from “we’d find out from a client ticket” to “we get paged before the client notices, with one false alarm earned by validating the page path on day one.”

The shift

An alert is a contract. It only works when the alerter is independent of the thing it watches, when it watches itself, and when its data layer prunes what no longer exists. Get any one of those wrong and you have noise, not coverage.

Every alert above feeds the same substrate behind our monthly Website Health reports and the per-domain email visibility we already publish to clients. Alerting just makes that substrate self-monitoring instead of dashboard-only.

Get in touch if you want to talk about WordPress hosting that operates at this level. Eighteen years of doing this; see our managed WordPress hosting plans for what’s included.

The Author

We Shipped the First Tier of Alerts (and Immediately Found Two Stale Backups)

Defer the anomaly tier; ship the hard-fail tier now

Two-tier routing, not five

Don’t let alerts depend on the thing you’re alerting about

Meta-monitoring the alerter

First-day findings

The cleanup gap

What shipped vs. what was teased

Closing

Ryan Davis

Comments (0)

Leave a Comment

We Shipped the First Tier of Alerts (and Immediately Found Two Stale Backups)

Defer the anomaly tier; ship the hard-fail tier now

Two-tier routing, not five

Don’t let alerts depend on the thing you’re alerting about

Meta-monitoring the alerter

First-day findings

The cleanup gap

What shipped vs. what was teased

Closing

Ryan Davis

Comments (0)

Leave a Comment

Generate Password