feat: SovereignGraphGuard for Atomic Persistence & GraphFlow Recovery (#7043) by merchantmoh-debug · Pull Request #7164 · microsoft/autogen

2 min read Original article ↗

Why are these changes needed?
This PR implements SovereignGraphGuard to resolve the critical state persistence and "Zombie State" bugs identified in GraphFlow State Persistence Bug #7043.

The current GraphFlowManager state handling is susceptible to "Operational Ischemia" (process interruptions during atomic transitions), leading to states where:

Work remains (remaining > 0).
No agents are enqueued (ready is empty).
The workflow falsely reports "Digraph execution is complete."
The Solution (SovereignGraphGuard):

Zero-Capitulation Persistence: Replaces standard file writes with an atomic "Iron Seal" protocol (fsync + rename) to prevent partial state corruption during interrupts.
Ischemia Repair Protocol (Auto-Healing): Upon loading state, it runs a load_and_heal() check. If it detects a "Zombie State" (work exists but flow is stalled), it injects "adrenaline" by identifying pending agents and re-injecting them into the ready queue, forcing the workflow to resume.
Robust Serialization: Specifically patches the deque and Counter serialization issues that cause crashes when deep-copying agent locks, ensuring state can be captured safely even during complex execution.
Related issue number
Closes #7043

Checks
I've included any doc changes needed for https://microsoft.github.io/autogen/. See https://github.com/microsoft/autogen/blob/main/CONTRIBUTING.md to build and test documentation locally.
I've added tests (if relevant) corresponding to the changes introduced in this PR.
I've made sure all auto checks have passed.