Dead Container Took Down Prod

It Was a Quiet Weekend Until It Wasn’t

Monday morning. Coffee in hand. You open Slack and the first thing you see is a wall of alerts that went off at 02:48 AM and nobody caught until now. The app is returning 502s site-wide. docker compose ps hangs. systemctl restart is ignored like it doesn’t exist. The machine is alive — you can SSH in — but Docker and half of systemd are frozen solid.

That was our Monday. Here’s exactly what happened, why it happened in three distinct acts, and what we actually fixed — including the embarrassing part where a container we’d been meaning to remove for months quietly made everything worse.

Act I: The Slow Bleed (Sunday, 04:23 AM)

The production server runs rootless Docker. That means containers run as a non-root user, which is good for security — no container process ever runs as actual root on the host. The tradeoff is that rootless Docker needs a userspace networking layer to give containers network access, and for a while that’s been slirp4netns.

slirp4netns is a single process that handles all the outbound networking for every rootless container on the host. It translates container network calls into regular unprivileged socket calls. One process. No redundancy. A single point of failure.

Sunday at 04:23 AM, it started leaking memory. Slowly at first. By the time the kernel OOM killer noticed, it had ballooned to 3.54 GB RSS:

Jun 21 04:23:52 prod-web-01 kernel: systemd invoked oom-killer: gfp_mask=0x...
Jun 21 04:23:52 prod-web-01 kernel: Out of memory: Killed process 837 (slirp4netns) \
  total-vm:9073940kB, anon-rss:3538292kB, file-rss:0kB, shmem-rss:0kB

When slirp4netns died, every rootless container lost outbound network access and DNS simultaneously. The frontend kept accepting incoming requests — nothing killed the nginx listener. But every backend service that needed to reach the outside world went dark. External payment gateway APIs started timing out. Requests that used to complete in 200ms now hung for 30 seconds before failing.

From the outside, this looked like “slow payment processing” and “intermittent errors.” Bad, but not a full outage. The kind of thing that generates some support tickets but doesn’t wake anyone up at 4 AM.

Here’s the thing about that 3.5 GB figure: it’s high, but not impossible. Under sustained traffic load, slirp4netns can hold a lot of state — connection tracking, socket mappings, buffered data. Whether this was a bug in the specific version, a traffic spike that exposed a latent leak, or a combination, we don’t know for certain. It’s what we observed. We’re not blaming a specific CVE; we’re describing what happened and what we changed so it can’t happen again.

Act II: The Log Flood (Monday, 02:48 AM)

Here’s where “bad” became “catastrophic.”

With the container network broken, every service that was trying to reach external endpoints was failing on every attempt. DNS timeouts. Connection refused errors. Retry loops. Each one generated a log line. And with the payment flow retrying every few seconds across multiple containers, the log volume went from “normal” to “firehose” essentially overnight.

The Docker json-file logging driver has a subtle behavior that bites you in exactly this scenario: in its default blocking mode, it buffers log output in memory when it can’t write fast enough. If the kernel is under memory pressure, writes slow down. Slow writes mean the buffer grows. A growing buffer means more memory pressure. You can see where this is going.

By 02:48 AM, the Docker logging process had consumed 6.25 GB of RAM:

Jun 22 02:48:01 prod-web-01 kernel: Out of memory: Killed process 1244 (docker-logging-) \
  total-vm:12911116kB, anon-rss:6257804kB, file-rss:0kB, shmem-rss:0kB

The OOM killer shot it. When the logging daemon died, Docker had a problem: it couldn’t collect logs from any container, and it couldn’t cleanly shut down containers that were still trying to write logs. The entire Docker daemon entered a deadlocked state. docker compose commands hung indefinitely. systemctl restart docker did nothing. The machine was technically reachable but operationally frozen.

502s everywhere. No way to restart anything. The only path forward was a hard reboot of the AWS instance to clear the deadlocked RAM and restore the network stack from scratch.

Act III: The Embarrassing Part

After the reboot, everything came back up — but we immediately ran into a second problem. Nginx wouldn’t start cleanly, and when it did, one location block was broken.

We run nginx as a local reverse proxy to route traffic to various services. At some point — probably six months ago — we spun up Formbricks (an open-source survey tool) to try for an internal feedback thing that never went anywhere. We never committed to it. We never removed it either. Classic “we’ll deal with that later” energy.

Here’s the nginx location block that was still sitting in our shared config:

location /surveys {
    proxy_pass http://${FORMBRICKS_CONTAINER_HOST}:${FORMBRICKS_CONTAINER_PORT};
    proxy_set_header Host $host;
    proxy_set_header X-Real-IP $remote_addr;
}

Notice the variables in proxy_pass. When nginx uses a variable in proxy_pass, it switches from static resolution (done at startup) to runtime DNS resolution on each request. And runtime DNS resolution in nginx requires an explicit resolver directive. We didn’t have one. With the container network dead (and slirp4netns having just been restarted), DNS wasn’t fully stable, and nginx was doing its best confused-dog impression trying to resolve the Formbricks hostname.

Here’s the nuance worth getting right: this broken proxy block did NOT take down incoming traffic to the whole site. That’s not how nginx works. A failed upstream lookup in a location /surveys block returns a 502 on requests to /surveys only. The site-wide 502s were caused by the upstream network being dead for everything — the payment gateways, the backend APIs, all of it. The Formbricks block was an extra mess that made restart troubleshooting harder, not the primary cause of the outage.

The lesson isn’t “that nginx block killed production.” The lesson is: dead weight you never cleaned up will find the worst possible moment to become an obstacle. We were debugging a chaotic restart and had to detangle a broken proxy block for a service that hadn’t served a real request in months. That’s what “we’ll deal with it later” costs you.

We removed the Formbricks block, tore down the unused container, and moved on. The location block was three lines. It took us 20 minutes to even realize it was there during the incident.

What We Actually Fixed

1. Docker Log Limits — Two Different Problems, Two Different Knobs

The first thing everyone does after this kind of incident is add log rotation. That’s the right call, but it’s worth understanding what each setting actually does:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "3"
  }
}

max-size and max-file control disk usage — how many log files exist and how large each one can grow before rotation. This is good hygiene. Without it, a chatty container will eventually fill your disk. We should have had this from day one.

But here’s the distinction that matters for our specific incident: the memory problem came from the json-file driver’s default blocking mode. When log output arrives faster than the driver can write it to disk, it buffers in RAM while waiting. Under memory pressure, writes slow down, the buffer grows, and you get exactly what we saw.

The sharper fix for that is the non-blocking mode:

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "50m",
    "max-file": "3",
    "mode": "non-blocking",
    "max-buffer-size": "10m"
  }
}

mode: non-blocking tells the driver to drop log messages rather than block and buffer when the pipeline is saturated. max-buffer-size sets how much you’re willing to buffer before dropping starts. Losing some log lines during a crisis is annoying. A 6 GB RAM balloon that freezes Docker is catastrophic. You pick your poison; we know which one we prefer.

After editing daemon.json, restart Docker to apply:

sudo systemctl restart docker

And verify:

docker info | grep -A 3 "Logging Driver"

2. The slirp4netns Problem — Don’t Just Throw It Away

The instinctive reaction here is “go back to rootful Docker, problem solved.” That’s lazy 2026 advice. Rootless Docker is good for security — container processes never run as actual root on the host, which meaningfully reduces your blast radius if something gets exploited. That’s worth keeping.

The real issue is slirp4netns specifically. It’s a single process with no restart-on-failure behavior, it’s not resource-capped by default, and it’s the only thing standing between your containers and the outside world.

Your options, roughly in order of how modern they are:

Option A: Monitor it, and make the rootless stack self-heal. Here’s the catch most guides skip: in rootless Docker you can’t put slirp4netns on its own diet. It isn’t a standalone systemd unit — it’s spawned by rootlesskit as a child of your user-level docker.service. There’s no slirp4netns.service to drop a MemoryMax onto, so don’t waste time trying.

What you can do is add a restart policy (and, if you want, a blunt cgroup memory cap) to the rootless docker.service itself via a user drop-in:

[Service]
Restart=on-failure
RestartSec=5
# Optional and blunt: this caps the WHOLE rootless Docker cgroup
# (dockerd + rootlesskit + slirp4netns + your containers), not
# slirp4netns alone. Set it high enough that normal load won't trip it.
# MemoryMax=6G

Then systemctl --user daemon-reload and restart the service. Restart=on-failure is the useful part — if the stack falls over, it comes back without a full machine reboot. The real win, though, is monitoring: alert on slirp4netns RSS so you find out before the OOM killer does, not after.

Option B: Migrate to pasta/passt. This is the modern move. pasta (or its library form passt) is the replacement for slirp4netns in newer rootless Docker and Podman setups. It has a smaller memory footprint, better performance, and is now the default in Podman 5.x. If you’re on a newer Docker release, check whether pasta is available on your platform — migration is mostly a config change:

[network]
default_rootless_network_cmd = "pasta"

(That’s a Podman containers.conf example — default_rootless_network_cmd is the key that actually selects the backend, not a path option. Docker’s equivalent depends on your version and distribution packaging.)

Option C: bypass4netns. For throughput-critical setups, bypass4netns lets container-to-host traffic skip the userspace networking layer entirely. More complexity, more performance. Worth researching if you care deeply about rootless networking throughput.

For our setup, we went with Option A in the short term — Restart=on-failure on the rootless docker.service plus an RSS alert on slirp4netns — while we evaluate pasta on a staging box. The important thing is: one of these gets done. Leaving slirp4netns uncapped with no restart behavior is just waiting for the next Sunday night.

3. Prune Your Configs Like You Prune Your Containers

The Formbricks block didn’t cause the outage. But it made the recovery harder, and it made us feel appropriately stupid. If a container isn’t serving traffic, it shouldn’t exist in your nginx config, your compose.yml, or anywhere else in your active infrastructure.

A good habit: when you spin something down, do the full cleanup. Stop the container, remove it from compose, remove its nginx block, remove its data volume if you don’t need it. Don’t leave ghosts in the config files. Your future self at 2 AM will thank you.

Key Takeaways

If you only remember four things:

1. Idle services are liabilities. An unused container still shows up in your nginx config, your logs, and your cognitive load during an incident. Delete things you’re not using. Actually delete them — not “stop” them, delete them.

2. Cap your logs. Both dimensions. max-size/max-file protects your disk. mode: non-blocking + max-buffer-size protects your RAM when logs spike. Set both. They solve different problems.

3. slirp4netns is rootless Docker’s weak link. It has no restart-on-failure by default, no memory cap, and it’s a single point of failure for all container networking. Either add a restart policy and cap its memory, or migrate to pasta (the modern default in newer rootless setups). Don’t leave it unattended.

4. Know how your reverse proxy resolves upstream DNS. Static proxy_pass http://hostname:port is resolved at startup — if it fails, nginx fails to start. Variable-based proxy_pass http://${VAR}:port is resolved at runtime per-request — it needs a resolver directive, and a dead upstream returns 502 on that location only, not the whole server. Neither is universally better, but you need to know which one you’re using and what the failure mode is.

The bill for this incident was roughly a morning of triage, cleanup, and root-causing, followed by the work of implementing the fixes properly and testing them. It could have been much worse — the hard reboot was all we needed to regain control, and we came out of it with log limits, a slirp4netns restart policy, and a shorter config than we went in with.

The embarrassing part was the Formbricks container. The expensive part was having no log limits. The subtle part was the slirp4netns failure mode that most rootless Docker tutorials never mention.

Anyway. How do you handle rootless networking in your setups? Have you migrated to pasta yet, or still on slirp4netns? Curious what you’ve run into — drop it in the comments.