Show HN: Premortem, a coding-agent-powered airplane blackbox
github.comA few weeks ago, I was getting random OOMs on my linux box and I had no idea what was causing them. At one point when I realized that the memory was getting sucked up by some process, I kicked off a claude code job to see if it could figure out what was happening in real time. And it did!
In real time the coding agent ran through a suite of system commands, figured out which jobs were causing problems, and then even started to dig into the explicit function calls (python and node processes can both be inspected at the function call level by sideloaded processes) before the entire system finally crashed.
Besides being extremely cool, I realized that with a few tweaks I could make this a legitimately useful tool. The basic idea: any time certain system vitals cross a threshold, spin up a coding agent and have the agent debug what is going on as aggressively as possible, with all logs being streamed to a third party server (in addition to being stored on disk). This basic abstraction would solve two huge problems:
- Most of the time it is very hard to figure out why exactly a machine went down. This tool would effectively act as an airplane blackbox, a sort of last record of what was going on that specifically is focused on debugging the failure as it happened. Massive speed up on figuring out system-breaking issues.
- Most of the time there are available interventions that someone could take that would prevent the system from going down at all, if a human was around when the crash was happening. For example, if I see that I’m about to OOM from vitest, I can just kill a bunch of the processes that are spiking memory and prevent the system from crashing that way.
We now have premortem running on all of our production machines.
Hope this is useful for other folks! >When running multiple intensive processes in parallel, pushing machines to their limits to maximize throughput, traditional monitoring only provides alerts when thresholds breach. >Premortem continuously watches system vitals (CPU, memory, disk, processes) and spawns Claude agents to diagnose problems when thresholds are breached. Surely you see the irony here... Sure do! I'm not saying that it won't bring about the end even faster. But you do get some very valuable things out of the machine as its taking its dying breaths I don't think you do. Your solution will also only start "firing alerts" once a threshold has been breached. updated, thanks. This is what I get for having an AI write the README o whoops. updating the readme lol