Why "top" missed a cron job that was killing our API latency

parth21shah.substack.com

4 points by parth21shah 2 months ago · 4 comments

Reader

OP here. I’ve been doing backend work for ~15 years, but this was the first time I really felt why eBPF matters. We had a latency spike that all the usual polling tools missed — top, CloudWatch, Datadog, everything looked normal. In the end it was a misconfigured cron job spawning ~50 short-lived workers every minute. Each one ran for ~500ms, burned the CPU, and exited before the next poll. So all our “snapshot” tools were basically blind. I wrote the post to show this exact gap: Polling = snapshots, Tracing = event stream. For stuff that appears and disappears between polls, only tracing really sees it.tools like execsnoop or auditd can catch this, but in our case the overhead felt too high to leave on 24/7 in production. I amm currently playing with a small Rust+Aya agent that listens on ring buffers so we can run this continuously with less overhead. If you just want to try the idea, the post has a few bpftrace one-liners so you can reproduce the detection logic without writing any C or Rust.

zahlman 2 months ago

I could already guess the answer and there is just so little actual content here with way too many words to explain a simple idea. Which is what you typically get when you let the LLM write for you.

danishSuri1994 2 months ago

This is a great example of the blind spot between sampling-based observability and event-driven tracing.

Anything that appears + disappears between polls is effectively invisible unless you’re streaming syscalls/process events. It’s surprising how often “short-lived, high-impact” processes cause the worst production spikes.

Curious whether you’re planning to surface this at the scheduler level (run queue latency/involuntary context switches) or stick to process-lifecycle tracing?

parth21shahOP 2 months ago

Right now I’m sticking to process lifecycle (sched_process_fork and sched_process_exit), mostly for correlation. It’s much easier to grab container ID / cgroup metadata at fork time and say “this pod/image is the bad actor” than it is to reconstruct that context off a firehose of sched_switch events. I agree that run queue latency / scheduler stats are the “better” signals for pure performance debugging. But scheduler switches generate a huge volume of events compared to forks. So I’m starting with fork/exec/exit + container/cgroup mapping If you’ve shipped scheduler-level tracing in production I’d love to hear how you handled filtering + aggregation.

Settings

Why "top" missed a cron job that was killing our API latency

Keyboard Shortcuts