A Caddy Cert Expired Because systemd-resolved Was Selectively Broken

14 min read Original article ↗

What do a SERVFAIL on one specific zone and one specific resolver path, instant resolution on every other zone, and forty-two hours of failed cert renewals that nobody was watching all have in common? My own stupidity.

The alerts

A little after 6:39 PM on a Sunday, my notifications fired two alerts back to back:

[Matrix Server] [DOWN] certificate has expired
[AHA] [DOWN] Child monitors down: Matrix Server

Uptime Kuma was watching the federation port for my Matrix homeserver, and the certificate served on it had just gone past its notAfter time. The federation port is the public-facing TLS endpoint that other Matrix homeservers connect to in order to deliver events. If it goes untrusted, my homeserver stops federating. Other servers can no longer send messages into rooms hosted on mine, and mine cannot push outbound. Locally-cached client traffic still works for a while, but anything cross-server breaks.

The cert was for matrix-fed.takeonme.org, served by a Caddy container on a small VPS that sits in front of Synapse. Caddy uses a Cloudflare DNS-01 challenge to renew Let's Encrypt certs automatically. Auto-renewal is one of the reasons you choose Caddy in the first place. The cert had been issued in early February and was good for ninety days. It was supposed to renew somewhere around day sixty.

It hadn't.

First reflex: are the containers up?

$ docker ps --filter name=matrix --format "table {{.Names}}\t{{.Status}}"
NAMES                  STATUS
matrix-synapse         Up 45 hours (healthy)
matrix-element         Up 45 hours (healthy)
matrix-postgres        Up 45 hours (healthy)
matrix-caddy           Up 45 hours
matrix-cinny           Up 45 hours
matrix-redis           Up 45 hours (healthy)
matrix-synapse-admin   Up 45 hours (healthy)
matrix-hookshot        Up 45 hours
matrix-maubot          Up 45 hours
matrix-draupnir        Up 45 hours

Everything was up. None of the containers had crashed or restarted. The cert problem was scoped to the TLS endpoint, not the application.

openssl s_client confirmed it:

$ echo | openssl s_client -servername matrix-fed.takeonme.org \
    -connect matrix-fed.takeonme.org:22443 2>/dev/null \
  | openssl x509 -noout -issuer -subject -dates

issuer=C=US, O=Let's Encrypt, CN=E8
subject=CN=matrix-fed.takeonme.org
notBefore=Feb  2 23:38:00 2026 GMT
notAfter=May  3 23:37:59 2026 GMT

Issued in February, expired today, with a real LE-prod issuer. Caddy had never managed to get a replacement.

What Caddy was actually doing

I pulled the Caddy logs over the last few days, filtering out the noisy proxy errors so I could see the certificate maintenance loop:

{"logger":"tls","msg":"certificate needs renewal based on ARI window",
 "subjects":["matrix-fed.takeonme.org"],
 "selected_time":1775227713,"renewal_cutoff":1775227113}

{"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal",
 "remaining":5172.357}

{"logger":"tls","msg":"certificate needs renewal based on ARI window",...}
{"logger":"tls.cache.maintenance","msg":"certificate expires soon; queuing for renewal", ...}

Caddy was queueing the renewal every ten minutes like clockwork. ARI (the ACME Renewal Information extension that lets the CA tell the client when to renew) was telling Caddy to renew. The cache maintainer agreed. But every actual attempt was failing somewhere downstream, far enough away that the ten-minute health-check loop kept marching on.

Further down in the log, the actual failure showed up:

{"logger":"tls.renew","msg":"renewing certificate",
 "identifier":"matrix-fed.takeonme.org","remaining":5642.66}

{"msg":"trying to solve challenge","identifier":"matrix-fed.takeonme.org",
 "challenge_type":"dns-01",
 "ca":"https://acme-staging-v02.api.letsencrypt.org/directory"}

{"msg":"cleaning up solver","identifier":"matrix-fed.takeonme.org",
 "challenge_type":"dns-01",
 "error":"no memory of presenting a DNS record for
   \"_acme-challenge.matrix-fed.takeonme.org\"
   (usually OK if presenting also failed)"}

{"logger":"tls.renew","msg":"could not get certificate from issuer",
 "error":"[matrix-fed.takeonme.org] solving challenges:
   presenting for challenge:
   could not determine zone for domain
   \"_acme-challenge.matrix-fed.takeonme.org\":
   unexpected response code 'SERVFAIL' for
   _acme-challenge.matrix-fed.takeonme.org."}

{"logger":"tls.renew","msg":"will retry","attempt":29,"retrying_in":21600,
 "elapsed":151329.147,"max_duration":2592000}

Two things stood out.

First: attempt: 29elapsed: 151329 seconds. Caddy had been failing this same renewal for forty-two hours straight. The next retry was scheduled six hours out. Caddy's retry backoff escalates aggressively after repeated failures, which makes sense as a way to avoid hammering a CA, but it also means a renewal can quietly slip past the cert's expiry while the client is still patiently waiting on its long backoff timer.

Second: the CA URL was acme-staging-v02.api.letsencrypt.org. Caddy had switched to LE staging at some point. After enough failures against prod, certmagic (the library Caddy uses for cert management) starts trying staging as a fallback. Even if the renewal had succeeded against staging, browsers and other Matrix homeservers would not have trusted the resulting cert. So the failure mode was double-broken: the immediate renewal attempts were failing, and the CA selection had drifted to one that would not have helped even if they hadn't.

But neither of those was the root cause. The root cause was the SERVFAIL.

The DNS challenge and what it actually does

The DNS-01 ACME challenge works like this. Caddy asks Let's Encrypt to issue a cert. LE responds with a token. Caddy needs to publish that token as a TXT record at _acme-challenge.<domain> so LE can verify it from the public internet, then tell LE to check.

To publish the TXT record, the Caddy Cloudflare provider (the lego library underneath) first needs to figure out which Cloudflare zone the record belongs to. It does this by walking up the DNS tree: query _acme-challenge.matrix-fed.takeonme.org for an SOA, see if you get one, walk up to matrix-fed.takeonme.org, then takeonme.org, until you hit the actual zone. That walk is what was getting SERVFAIL.

So the question became: who is returning SERVFAIL, and why?

Bypassing the local resolver

I started with the obvious test, querying directly against the upstream resolvers without going through anything local:

$ dig @1.1.1.1 _acme-challenge.matrix-fed.takeonme.org
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN

Instant NXDOMAIN. That's the correct, expected answer for a name that doesn't currently exist (Caddy publishes the TXT record only at challenge time, then deletes it). Cloudflare directly: also instant NXDOMAIN. NextDNS directly: also instant NXDOMAIN. The VPS provider's link DNS: also instant NXDOMAIN.

Everyone upstream was answering correctly and quickly. So the failing piece was somewhere closer to home.

$ dig @127.0.0.53 _acme-challenge.matrix-fed.takeonme.org
;; communications error to 127.0.0.53#53: timed out

There it is. The local stub resolver, 127.0.0.53 (systemd-resolved), was timing out. Not returning SERVFAIL with the response packet, just refusing to respond at all. The container's DNS chain runs container → Docker's embedded resolver (127.0.0.11) → host's stub resolver (127.0.0.53) → upstream. The host stub was the layer that broke.

I checked from inside the Caddy container directly:

$ docker exec matrix-caddy nslookup _acme-challenge.matrix-fed.takeonme.org
Server:		127.0.0.11
Address:	127.0.0.11:53
;; connection timed out; no servers could be reached

Same timeout, just propagated one level up.

The shape of the bug

This is where it got weird. systemd-resolved wasn't just down. Other queries through it worked fine:

$ dig @127.0.0.53 google.com               # answers immediately
$ dig @127.0.0.53 nope-xyz.example.com     # NXDOMAIN, immediately
$ dig @127.0.0.53 matrix-fed.takeonme.org  # answers immediately, real address
$ dig @127.0.0.53 nope.takeonme.org        # times out
$ dig @127.0.0.53 anything-else.takeonme.org  # also times out

Existing names under takeonme.org resolved instantly. Existing names anywhere else resolved instantly. NXDOMAIN responses for other zones came back instantly. NXDOMAIN responses for anything under takeonme.org timed out specifically. The bug was scoped to one zone, one response type. Everything else worked.

I did the basic remediation. resolvectl flush-caches. No change. systemctl restart systemd-resolved. No change. Same query, same hang.

Whatever was wrong, it survived a service restart. That ruled out a simple stuck connection or a bad in-memory cache entry. systemd-resolved was reconstructing the broken state from scratch every time, somehow.

Getting the cert renewed first, then debugging the rest

I needed to stop the bleeding. Federation was already broken because the cert was expired. The Caddy container's resolver chain ran through the broken host stub. Fixing that was the fastest way to let Caddy actually solve the challenge and renew.

Looking at the Matrix stack's compose file, I noticed the Synapse service already had this:

synapse:
  image: matrixdotorg/synapse:latest
  ...
  dns:
    - 1.1.1.1
    - 8.8.8.8

Past me had hit some other DNS problem on this host and pinned Synapse to use Cloudflare and Google directly, bypassing the host resolver. The pattern was right there. Add the same to Caddy:

caddy:
  image: ghcr.io/slothcroissant/caddy-cloudflaredns:latest
  ports:
    - "22443:22443"
  dns:
    - 1.1.1.1
    - 8.8.8.8
  environment:
    - CLOUDFLARE_API_TOKEN=${CLOUDFLARE_API_TOKEN}

Committed, pushed, redeployed via Komodo. Within thirty seconds of the new container coming up, Caddy was already attempting renewal. This time the upstream lookup actually completed:

{"logger":"tls.renew","msg":"renewing certificate"}
{"logger":"http","msg":"using ACME account",
 "account_id":"https://acme-v02.api.letsencrypt.org/acme/acct/REDACTED"}
{"msg":"trying to solve challenge","challenge_type":"dns-01",
 "ca":"https://acme-v02.api.letsencrypt.org/directory"}
{"logger":"tls.renew","msg":"certificate renewed successfully",
 "issuer":"acme-v02.api.letsencrypt.org-directory"}

Note the CA URL: acme-v02, not acme-staging-v02. The container restart wiped certmagic's in-memory state, including the staging-fallback decision, and on first renewal it went straight back to LE prod. Good.

$ echo | openssl s_client -servername matrix-fed.takeonme.org \
    -connect matrix-fed.takeonme.org:22443 2>/dev/null \
  | openssl x509 -noout -issuer -subject -dates

issuer=C=US, O=Let's Encrypt, CN=E7
subject=CN=matrix-fed.takeonme.org
notBefore=May  4 00:24:55 2026 GMT
notAfter=Aug  2 00:24:54 2026 GMT

Federation was back. Uptime Kuma went green.

But the underlying bug, the one in systemd-resolved, was untouched. Patching one container around it just meant the next service that needed the host resolver for takeonme.org would hit the same wall. I needed to actually fix the resolver.

Drilling into the real bug

I had four data points:

  • Direct queries to upstream over plain UDP: instant correct response.
  • Queries through the local stub: hang, scoped to one zone.
  • Cache flush: no effect.
  • Service restart: no effect.

The host's resolved.conf had an explicit upstream configuration:

[Resolve]
DNS=45.90.28.0#<profile>.dns.nextdns.io
DNS=2a07:a8c0::#<profile>.dns.nextdns.io
DNS=45.90.30.0#<profile>.dns.nextdns.io
DNS=2a07:a8c1::#<profile>.dns.nextdns.io
DNSOverTLS=yes

A NextDNS profile, accessed over DoT. The IPs are the standard NextDNS anycast addresses, and the hostname after # carries the profile ID so the upstream can apply the right per-profile filtering policy. DoT is the privacy/integrity transport: queries go to NextDNS over a TLS-wrapped TCP session instead of plain UDP/53.

I needed to know whether the bug was in NextDNS (do their DoT endpoints have an issue with NXDOMAINs in this particular zone?) or in systemd-resolved (does the stub mishandle something specific in the response it gets back?). I tested DoT directly to NextDNS, bypassing the stub:

$ dig +tls +tls-hostname=<profile>.dns.nextdns.io \
       @45.90.28.0 nope.takeonme.org

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 21401

Instant NXDOMAIN, over DoT, talking to the same NextDNS endpoint that systemd-resolved was supposedly using. The response came back fine at the protocol level.

For comparison, DoT to NextDNS for some other unrelated NXDOMAIN: instant. DoT to NextDNS for a real existing name under takeonme.org: instant. DoT to Cloudflare for the same NXDOMAIN: instant.

Every direct test worked. Only the path through systemd-resolved hung. So the bug was inside systemd-resolved's handling of some specific aspect of the response, for this specific zone, with these specific upstreams, over DoT. Whatever it was, it was sticky enough to survive a flush and a service restart.

I do not have a smoking-gun explanation for why this specific combination broke. I checked resolvectl status for DNSSEC validation issues, since that's a known source of weird stub behavior, and DNSSEC was disabled on the link. I checked for stuck TCP sessions in ss -tan, and there were no half-open DoT sockets to NextDNS. I checked the journal for systemd-resolved errors, and it was silent. The bug is there, it is reproducible, it survives normal recovery actions, and it is invisible to standard observability.

I filed it under "systemd-resolved + DoT + NextDNS + this zone, do not stand here again" and moved on to the actual fix.

Was this fleet-wide?

Before changing anything, I needed to know how exposed the rest of my homelab was. The matrix-fed cert is one of many things in my homelab that depends on host DNS working correctly. If every server with Tailscale access was using NextDNS over DoT (some bits of my Tailscale config push NextDNS as a global nameserver), the bug could be everywhere.

I ran the same diagnostic across every other host in the fleet:

for host in <list of hosts>; do
  echo "=== $host ==="
  ssh root@$host 'dig @127.0.0.53 nope-test.takeonme.org +time=4 +tries=1' \
    | head -4
done

Every other host returned NXDOMAIN immediately. Only the matrix VPS was broken.

I checked the resolver config on each host, and the picture made sense:

Host classResolver
Matrix VPSNextDNS DoT, hardcoded in /etc/systemd/resolved.conf
Public status page VPSTailscale MagicDNS at 100.100.100.100
All other hosts (Ubuntu)DHCP/netplan link DNS, empty Global config

Only one host in the fleet had explicit NextDNS DoT in resolved.conf. That was the host the bug was on. Every other host was either using Tailscale's MagicDNS forwarder (which goes through Tailscale's path, not direct stub-DoT) or just whatever DNS the link was handed at boot. None of them were exposed to the DoT-NextDNS combination in the way the matrix VPS was.

This was reassuring on the immediate scope (one host, not seven), and it also told me something about the failure mode. Whatever exact code path in systemd-resolved was breaking, it required the explicit DoT-to-NextDNS-with-profile-ID configuration to trigger. The Tailscale-MagicDNS forwarder, even though it presumably terminates at NextDNS too at some layer, takes a different path through the stub and doesn't hit the bug.

The actual fix

The matrix VPS doesn't need a NextDNS profile. NextDNS profiles are for client devices, where the per-device filtering policy is the whole reason you're paying for the service. A server doesn't browse, doesn't get tracked, and doesn't benefit from ad-blocking DNS. The DoT setup on this host was a leftover from when the box was originally configured, copied from a personal-device template that got applied where it didn't belong.

The fix was to remove the offending config and let systemd-resolved fall through to the link DNS that was already there (the VPS provider's resolver on eth0):

cp /etc/systemd/resolved.conf \
   /etc/systemd/resolved.conf.bak.$(date +%Y%m%d-%H%M%S)

sed -i -E 's/^(DNS=.*|DNSOverTLS=yes)$/#\1/' \
   /etc/systemd/resolved.conf

systemctl restart systemd-resolved

Backup, comment out, restart. Three lines, fully reversible.

After the restart, resolvectl status showed Global with no DNS configured and the eth0 link providing the VPS provider's resolver. The takeonme.org NXDOMAIN test through the stub came back instantly. Everything else still resolved fine.

I left the dns: [1.1.1.1, 8.8.8.8] override on the Caddy container in place, even though the host resolver now works. It mirrors the existing pattern on the Synapse service, costs nothing, and gives a concrete second line of defense if the host resolver ever breaks again. Defense in depth is cheaper than re-debugging.

What I learned

Caddy will queue a renewal forever and never tell you. The renewal failed every ten minutes for forty-two hours, with full structured-JSON error logs explaining exactly what was wrong, and nothing in my monitoring stack was watching for those errors. Caddy doesn't expose a Prometheus metric for "ACME renewal has been failing for N attempts." It just logs and retries with backoff. If you don't have a log-based alert on tls.renew errors or a separate cert-expiry probe at the public endpoint, you find out when the cert expires.

The CA selection was the second-order failure I almost missed. Caddy's built-in fallback from prod to staging is sensible: if you're getting rate-limited by LE prod, throwing more requests at it is wasteful. But the resulting cert is untrusted, which means even a "successful" renewal under the staging issuer still leaves you with a broken endpoint. If the symptom is bad enough to trigger fallback, the symptom is worth alerting on.

Don't trust 'the resolver works' as a single test. systemd-resolved was answering correctly for nearly every query I threw at it. The one zone that mattered was the one zone that hung. A health check that runs dig google.com proves nothing about your actual workload's resolution path. If you must health-check DNS, query the names your services actually depend on.

A six-hour retry interval is too long for a cert that expires in three. Caddy's exponential backoff makes sense for "ACME server is down" cases, where backing off is polite. It does not make sense for "your cert will be expired and untrusted before the next attempt fires." There should be a guard rail: if now + retry_interval > cert_notAfter, retry sooner. I do not know if Caddy has a knob for this. I am going to find out.

Configuration drift kills you in production but doesn't show up in audits. The host had NextDNS DoT in its resolved.conf because somebody (past me) configured it that way once and it worked, and it kept working for years until one specific upstream behavior changed enough to trigger an obscure stub bug. Nothing about my server-baseline checks would have flagged this config as "unusual for a server." It looked normal. It was wrong, and the wrongness was invisible until it killed a public-facing service.

The cert is good through August, the resolver is fixed, and Matrix federation is back online. There is now a follow-up project on my whiteboard for cert-expiry alerting, fleet-wide DNS audit, and a sweep for any other dangling resolver configs hiding on hosts I haven't looked at since the day I provisioned them. I'll get to it before the next cert tries to renew..