Codex-in-Docker Debugging: From a “Weird SIGSEGV” Core Dump to a Real Fix (SEGV_PKUERR in V8)

5 min read Original article ↗

Haohang Shi

Press enter or click to view image in full size

Image by Nano Banana Pro

Disclosure: I work on Proton / Timeplus. Addresses and any sensitive details are redacted. The fix is open-source and linked below.

Proton is an Apache 2.0–licensed open-source DBMS/streaming engine forked from ClickHouse. We embed V8 as the runtime for JavaScript UDFs.

We recently hit a production crash that looked like a stack-unwinding bug. The server died with SIGSEGV, the fatal handler tried to print a stack trace, and that trace capture sometimes crashed inside libunwind. Logs showed Unknown si_code. Some frames hinted at V8 builtins (Builtins_*, JSEntry*), but the unwind often stopped early with frame did not save the PC.

The surprising part: once we built a deterministic postmortem environment, Codex (GPT‑5.2, xhigh) drove the end-to-end GDB investigation and authored the fix. Humans mostly set up the harness, reviewed the diff, validated it, and merged/backported.

This post focuses on the agent-driven debugging workflow that made the result possible.

TL;DR

  • We ran everything inside a container where core dump + debug binary + symbols + source paths all matched, then installed Codex inside that container.
  • Codex stopped trusting unstable backtraces and instead classified the fault via siginfo_t / ucontext_t.
  • Root cause: SEGV_PKUERR (Intel MPK/PKU), triggered by a thread-local PKRU mismatch on V8 entry.
  • Fix: normalize PKRU on V8 entry (Linux x86_64), block profiler signals in the fatal handler, improve crash logging, and add a minimal repro.

Even if you don’t care about PKU/PKRU, the workflow generalizes: when the backtrace lies, classify the signal first.

Step 0: Make the debugging surface deterministic (Docker + symbols + source mapping)

This was the enabling step. Without it, neither humans nor agents can reason reliably from a core dump.

We ran a debug-symbol image and mounted the workspace containing the core dump and the matching source tree:

docker run -it --rm \
-w /work \
-v "$PWD:/work" \
--network host \
--name codex-debug \
--user 0 \
timeplus/timeplusd:<debug-image> \
bash

Notes:

  • We ran this in an isolated debugging container. In our case we used --user 0 to install tools in-container (e.g., gdb, Node.js), and --network host as local convenience. Neither is required for core dump analysis.
  • If build paths differ from your mounted source paths, use GDB path substitution (or mount into the same paths used by the build).

Important: we installed Codex inside the container (via npm in our case) so the agent could run GDB and the build loop in the same filesystem context as the artifacts.

Once the binary, symbols, core, and source paths all lined up, the agent could iterate systematically instead of guessing.

What Codex did vs. what humans did

A realistic split looked like this:

Press enter or click to view image in full size

Image by Nano Banana Pro

Codex executed GDB and authored the patch; humans ensured the environment was correct and the change was safe to ship.

The turning point: stop trusting the backtrace, classify the SIGSEGV

Because the backtrace was unstable (and captured in signal context), Codex pivoted away from “unwinder bug” and inspected signal metadata.

In our fatal signal handler we already had siginfo_t* info and ucontext_t* context. From the core dump:

(gdb) p info->si_code
$1 = 4
(gdb) p ((ucontext_t*)context)->uc_mcontext.gregs[REG_TRAPNO]
$2 = 14
(gdb) p/x ((ucontext_t*)context)->uc_mcontext.gregs[REG_ERR]
$3 = 0x25

Codex’s cross-check:

  • TRAPNO = 14 → page fault (#PF)
  • ERR = 0x25 includes the protection-keys (“PK”) bit (0x20)
  • si_code = 4SEGV_PKUERR (not the usual SEGV_MAPERR / SEGV_ACCERR)

So the log’s Unknown si_code wasn’t noise. It was the clue.

Press enter or click to view image in full size

Minimal PKRU explanation

Intel MPK/PKU gates access via a CPU register called PKRU.

The only detail that mattered here: PKRU is per-thread. If a thread enters V8 with a PKRU state that denies access to pages V8 expects to read, it can fault immediately with SEGV_PKUERR. That can look “intermittent” because different threads have different register state.

The red herring: why libunwind was on top

This incident was painful because it looked like “libunwind is broken.”

We have a profiler that uses signals (e.g., SIGUSR1) to capture stack traces. During fatal SIGSEGV handling, there was a window where profiling signals could fire. That triggered unwinding inside a signal handler while the process was already compromised—and that unwinding sometimes crashed inside libunwind.

Net effect:

  • the original crash was PKU (SEGV_PKUERR)
  • the secondary crash was re-entrant unwinding, which obscured the root cause

What we shipped

Codex implemented the patch; humans reviewed and validated.

  1. Normalize PKRU on V8 entry (Linux x86_64 only).
    Capture a baseline PKRU after V8 init, restore it on each V8 entry via a small RAII guard.
  2. Harden fatal crash handling.
    Block profiler-related signals (SIGUSR1/SIGUSR2) during fatal crash handling to avoid re-entrant unwinding obscuring the primary fault.
  3. Improve diagnostics.
    Decode/log SEGV_PKUERR clearly so it no longer prints as Unknown si_code.
  4. Add a minimal repro.
    v8_pkru_repro deterministically fails without the guard and succeeds with it.

Details and exact diffs are in the PR below.

A checklist for when backtraces lie

  • Treat unstable backtraces as a symptom, not a diagnosis.
  • Classify the fault via siginfo_t (si_code) and ucontext_t (trapno + arch error bits).
  • Watch for secondary crashes inside profiling/unwinding/logging that obscure the first crash.
  • If the bug is “intermittent,” consider thread-local CPU state (e.g., PKRU) as a first-class suspect.
  • If you use signal-based profilers, harden the fatal handler to prevent re-entrant unwinding.

Links