Ralph loops for SDK migrations: a case study | StackToHeap

15 min read Original article ↗

In Part 3, I introduced Ralph loops: autonomous coding agents that pick up a user story from a PRD, implement it, pass quality gates, commit, and move on to the next story without me. I talked about the concept and the vaibhav tooling that drives it.

This post is a case study. A real Ralph loop on a real project, with timestamps, commit hashes, two attempts (one that did not work well), and the manual work that came after. The task: migrating DRS, my AI code review bot, from OpenCode SDK to Pi SDK.

The migration

DRS is a TypeScript project that reviews GitLab MRs and GitHub PRs using AI agents. It was built on top of the OpenCode SDK. I wanted to migrate the entire runtime to Pi, a different coding agent SDK, while preserving full feature parity across three review flows: review-local, review-mr, and review-pr.

This is not a trivial find-and-replace. The migration touches:

  • The agent runtime and session management
  • Tool registration (custom tools like write_json_output and drs_skill)
  • Agent/skill loading and path resolution
  • Platform-specific review flows (GitLab and GitHub)
  • CI pipelines, documentation, and configuration
  • All tests

In other words, it is exactly the kind of task that is well-scoped (clear before and after), dependency-ordered (runtime before flows before cutover), and verifiable (tests and lint must pass). That makes it a good candidate for a Ralph loop.

Attempt 1: too abstract

Before the Ralph loop, I had already done some manual exploration. I created a pi-migration branch on Feb 15, spent two days manually wiring up Pi SDK, pushing 18 commits with lots of trial and error around API keys, model configurations, and agent event handling. That gave me a working proof of concept and a feel for what the migration actually involved.

On Feb 21, I set up the first Ralph loop with a PRD that had 6 user stories:

#StoryTime
US-001Define SDK Parity Contract02:07
US-002Migrate Core Agent Runtime to pi-mono02:10
US-003Migrate Tools and Streaming/Event Integration02:15
US-004Simplify Skills Support02:19
US-005Update Tests, CI, and Deployment Configuration02:22
US-006Decommission OpenCode SDK02:30

Six stories in 23 minutes. 26 files changed, 1,401 insertions, 825 deletions. Ralph blazed through it.

But the result was not good enough. The PRD was too abstract. Stories like “Migrate Core Agent Runtime” and “Migrate Tools and Streaming/Event Integration” were broad buckets. Ralph could pass the tests and lint checks because it wrote new tests that validated its own assumptions, but the actual integration was shallow. The stories did not map to the real migration boundaries of the codebase: the three distinct review flows (review-local, review-mr, review-pr) that each needed their own migration path with platform-specific error handling.

The first attempt taught me something important about Ralph loops: the PRD is the product. If your stories are too abstract, Ralph will produce code that technically passes gates but does not actually solve the problem. Ralph follows signs. If the signs say “migrate the runtime” without specifying what “migrated” means for each concrete flow, Ralph will do something plausible but incomplete.

Attempt 2: concrete stories

I rewrote the PRD with 9 user stories that mapped directly to the codebase’s real boundaries:

#StoryStartedCommitted
US-001Introduce Pi SDK Integration Layer01:1101:19
US-002Migrate review-local to Pi SDK01:24
US-003Migrate GitLab MR Review Flow to Pi SDK01:33
US-004Migrate GitHub PR Review Flow to Pi SDK01:38
US-005Pi-Native Skill/Agent Path Handling01:49
US-006Migrate Existing Skills/Agents to Pi-Native Setup02:00
US-007Add Pi-Native Review Setup Enhancements02:11
US-008Hard Cutover and OpenCode Removal02:36
US-009End-to-End Validation and Documentation02:40

All timestamps are Feb 22, 2026 UTC. The loop started at 01:11 and the last commit landed at 02:40. Nine user stories in 81 minutes.

The difference in story design was critical:

  • US-001 created the Pi SDK wrapper layer (src/pi/sdk.ts) with error mapping. Foundation first.
  • US-002 through US-004 each migrated one specific review flow. Not “migrate all flows” but “migrate review-local,” then “migrate review-mr,” then “migrate review-pr.” Each story had its own acceptance criteria about platform-specific error handling.
  • US-005 through US-007 handled the supporting infrastructure: path resolution, agent definitions, skill discovery.
  • US-008 was the hard cutover: remove OpenCode SDK, update all docs and config.
  • US-009 was validation and documentation.

The dependency ordering meant each Ralph iteration built on what the previous one committed. US-002 could use the Pi wrapper from US-001. US-003 could follow the pattern established in US-002. By the time Ralph got to US-008 (remove OpenCode), everything was already wired through Pi.

What Ralph produced

The numbers:

  • 60 files changed, +3,978 lines / -1,978 lines
  • New files: src/pi/sdk.ts (657 lines), src/pi/sdk.test.ts, src/opencode/path-config.ts, 7 Pi-native agent definitions under .pi/agents/, migration docs
  • Major rewrites: src/opencode/client.ts, src/cli/review-mr.ts, src/cli/review-pr.ts
  • New test files: 6 new test files with comprehensive coverage for each migration boundary
  • Documentation: docs/PI_MIGRATION.md, updated README, DEVELOPMENT.md, architecture docs, all CI integration guides

The progress.txt that Ralph maintained across iterations is worth reading in full. Each iteration appended what it did, what files it touched, and what it learned. By US-003, Ralph had already established a codebase pattern:

For platform reviews, preload PR/MR metadata and changed files once at the CLI boundary, then pass them into executeUnifiedReview to avoid duplicate API calls and keep platform-specific error mapping close to fetch logic.

US-004 (GitHub PR flow) then followed this exact pattern because it read the progress log before starting. The “signs next to the slide” worked exactly as intended.

The manual work after

Ralph’s loop finished at 02:40. But the migration was not done. Between 10:47 and 17:16 on the same day, I ran 11 more commits through interactive agent sessions:

6074206 refactor: enforce in-process runtime and add review-local integration test
529277e fix(ci): sync lockfile with pi runtime dependencies
6d9dd91 refactor(skills): remove drs_skill tool and load .drs skills natively
2787fd9 refactor(skills): rely on native pi skills prompt only
46caaa7 feat(logging): show clipped agent/tool traces without debug
ba0900e test(e2e): add opt-in live review-local LLM test
6a3f11d fix(review-local): write --output JSON even without --json
a1973b4 feat(cli): load .env automatically via dotenv
a3b99b1 fix(ci): resolve opencode key from secrets in PR review workflow
d9d6d8c chore(release): bump to 3.0.0-alpha.0 and fix PR key env
1f542c3 fix(describe): suppress verbose output when posting descriptions

These 11 commits fall into three categories, and each one tells you something about where the loop’s boundary currently sits — and how that boundary could move.

CI and deployment issues (fix(ci): sync lockfile, fix(ci): resolve opencode key). The loop had no way to know that the GitHub Actions workflow referenced a secret called OPENCODE_API_KEY that needed renaming, or that the lockfile would drift when Pi runtime dependencies changed. These are integration points that live outside the codebase.

But they do not have to stay outside the loop. If I had wired up gh cli as part of the quality gate — push to a draft PR, run gh run watch, parse the output — Ralph could have caught these himself. The snarktank/ralph implementation already supports this pattern. Next time, I plan to include a CI verification step so the loop can self-correct on deployment failures instead of leaving them for me.

Runtime behavior discovered through real usage (refactor: enforce in-process runtime, fix(review-local): write --output JSON, fix(describe): suppress verbose output, feat(cli): load .env automatically). These came from actually running the migrated code against real APIs and noticing things: verbose output cluttering the review comments, JSON output not being written in certain flag combinations, .env files not loading automatically.

This is where better e2e test coverage would close the gap. If the test suite had included a live integration test that exercised a real review flow end-to-end, these issues would have surfaced during the loop’s quality gate phase, not after. I added one of those tests (test(e2e): add opt-in live review-local LLM test) during this phase — it will be there for the next loop.

Architectural decisions (refactor(skills): remove drs_skill tool, refactor(skills): rely on native pi skills prompt only). Removing the drs_skill custom tool entirely and relying on Pi’s native skill discovery was a design choice. It emerged from testing the migrated code and realizing that the abstraction layer Ralph had preserved was no longer necessary with Pi’s native capabilities.

This is the category that is hardest to automate. These were judgment calls about the system as a whole — not just “does this work?” but “is this the right design now that we are on Pi?” Ralph follows the plan. It does not question whether the plan itself should change mid-execution. That kind of holistic reasoning is where the human still adds the most value, and where the interactive agent sessions shine.

The overall pattern: the loop handles the mechanical transformation, and interactive sessions handle the verification and design refinements. The loop got the codebase from OpenCode to Pi. The interactive sessions made it production-ready. Today, that split was roughly 80/20. But the 80 is the hard part to do manually — 60 files, nearly 4,000 lines of additions, all dependency-ordered and test-passing. The 20 is work I could do in an afternoon with an AI agent at my side.

And the 80/20 split is not fixed. Each piece of post-loop work I did points to a specific improvement I can make for the next loop: add gh cli CI checks, add e2e integration tests, write more concrete acceptance criteria. The loop gets better each time, not because the models improve (though they do), but because the environment and feedback loops around them improve.

The PR was merged at 19:00 UTC on Feb 22. Total wall-clock time from loop start to merge: about 18 hours, most of which was me being away or doing other things.

Where Ralph loops work well for migrations

After going through this, I think SDK/framework migrations are one of the strongest use cases for Ralph loops right now. Here is why:

Clear before and after states. A migration has a well-defined goal: the code should do the same thing but use a different dependency. This makes acceptance criteria concrete and testable.

Natural dependency ordering. Migrations have a logical sequence: adapter layer → core flows → supporting infrastructure → cutover → validation. Each step depends on the previous one. This maps perfectly to a priority-ordered PRD.

Mechanical transformation with known patterns. Most of the work is replacing API calls, updating imports, and adapting to slightly different interfaces. This is exactly what LLMs are good at: pattern-following at scale.

Quality gates catch regressions. If the migration breaks something, the existing test suite catches it. Ralph will not commit broken code because npm run check:all has to pass.

Progress log enables pattern propagation. When Ralph migrates the first platform flow (GitLab), it records the pattern. The next platform flow (GitHub) reads that pattern and follows it. This is the “signs” mechanism working at its best.

Where it does not work (yet)

These are real limitations I hit, but the “yet” matters. Most of them are shrinking as tooling and techniques improve.

Integration and end-to-end testing. Ralph cannot spin up a real GitLab instance and test that MR comments actually post correctly. It can write unit tests and pass type checks, but verifying that the migrated code works against real APIs required me to run it manually after the loop.

This is one of the most active areas of improvement. As Anthropic’s harness engineering post documents, the critical work is not writing code but designing the environment and feedback loops. The more verification you can make programmatic, the more Ralph can handle autonomously. If I had wired up a staging GitLab instance with a test MR that the loop could exercise, Ralph could have caught the CI issues itself. Geoffrey Huntley’s original formulation emphasizes this: “always look for opportunities to loop Ralph back on itself.” Every feedback signal you add — test output, build logs, even screenshots via Playwright — makes the loop more self-correcting.

In practice, this means the gap between “what Ralph can verify” and “what needs a human” is a function of how much you invest in your quality gates. The DRS project had good unit test coverage and strict TypeScript checks, which is why the loop produced usable code. Projects with comprehensive e2e test suites will get even better results.

CI and deployment feedback. Secrets, environment variables, workflow triggers — these live outside the codebase and cannot be validated by local quality gates. After the Ralph loop, I spent several commits fixing CI: the opencode key name in GitHub Actions secrets needed updating, the lockfile needed syncing with new Pi runtime dependencies, and branch triggers needed adjusting.

But this is not a fundamental limitation. Ralph loops can be given access to tools like gh cli to check CI status after pushing, read workflow run logs, and course-correct. The snarktank/ralph implementation already supports this pattern: push, check CI, fix if red, iterate. The loop I ran did not do this because I had not wired it up, but the technique exists. Future runs could include a CI verification step as part of the quality gate — push to a draft PR, wait for the workflow, parse the result, and fix failures before moving on.

Compound errors. This is the concern that comes up most often in community discussions: what if Ralph makes a mistake early on and everything built on top inherits that mistake? In my first attempt, this is exactly what happened. The abstract PRD led Ralph to build parallel abstractions that technically passed tests but did not integrate well with the existing codebase. Each subsequent story built on that shaky foundation.

The mitigation is the same as in any engineering process: small stories, strong quality gates, and early verification. The second attempt worked because each story was small enough that mistakes were contained. US-001 created the Pi wrapper. If that wrapper was wrong, US-002 (which used it for review-local) would have failed its tests immediately, not six stories later. Dependency ordering is not just about build order — it is about error detection order.

The progress log also helps here. When Ralph records what it learned in each iteration, the next Ralph reads those learnings before starting. If US-003 discovered that “platform APIs already return patches, pass them through review instructions instead of relying on fallback git commands,” then US-004 inherits that knowledge for free. Mistakes propagate, but so do corrections.

Design decisions. The choice to remove the drs_skill custom tool and rely on native Pi skill discovery was not in the PRD. It emerged from testing the migrated code and realizing the abstraction was no longer necessary. Ralph follows the plan. It does not question whether the plan is the right approach.

This is where the human role remains essential. As Huntley puts it, “software is now clay on the pottery wheel and if something isn’t right then you just throw it back on the wheel.” Ralph handles the mechanical shaping. The potter decides what shape to make. In my case, the post-loop interactive sessions were where I made architectural calls: should skills be loaded natively or through a custom tool? Should the runtime enforce in-process mode? These are judgment calls that require understanding the system holistically, not just story-by-story.

That said, the line between “mechanical” and “judgment” keeps moving. A year ago, you would not trust an LLM to reason about SDK migration boundaries. Today, with a well-written PRD, it handles them competently. The design decisions I made manually were about system-level trade-offs that crossed multiple story boundaries. As PRDs get more sophisticated and models get better at holistic reasoning, some of those decisions will move into the loop too.

PRD quality is the bottleneck. My first attempt proved this decisively. Six abstract stories produced code that passed checks but missed the point. Nine concrete stories, each mapped to a real codebase boundary, produced a working migration. The time spent rewriting the PRD was the most valuable part of the entire process.

This is the universal lesson from the Ralph community. Every practitioner who has shared results — from Huntley’s three-month programming language build to YC hackathon teams shipping six repos overnight — emphasizes the same thing: the PRD is the product. You are not writing code anymore. You are writing specifications that a loop will execute. The quality of the output is bounded by the quality of the input.

For migrations specifically, this means your stories should map to real code boundaries, not conceptual categories. “Migrate the runtime” is a conceptual category. “Migrate review-local to Pi SDK” is a code boundary. The difference is whether Ralph has to make architectural decisions (bad) or follow a pattern (good).

The numbers

Attempt 1Attempt 2
Stories69
Duration23 min81 min
Files changed2660
Lines added1,4013,978
Lines removed8251,978
Post-loop commits11
Post-loop duration~7 hours
ResultDiscardedMerged (PR #84)

The second attempt took longer because it did more real work. Each story was concrete enough that Ralph had to actually integrate with existing code rather than writing parallel abstractions. More time per story, but the output was usable.

Closing thoughts

Part 3 was about how Ralph loops work in theory. This post is about how they work in practice — on a real codebase, with a real migration, including the failure mode (attempt 1) and the manual cleanup (post-loop verification).

The takeaway is not “Ralph loops can do everything.” It is that they are very good at a specific class of problem: well-scoped transformations with clear acceptance criteria, natural dependency ordering, and existing quality gates. SDK migrations fit this perfectly.

The progression across this series:

  1. Part 1: Code from my phone using a web UI
  2. Part 2: Any AI tool, same workflow, from anywhere
  3. Part 3: Ralph loops running autonomously
  4. Part 4: A concrete case study — migrating an SDK with Ralph

The next interesting question is: what other classes of problems fit this pattern? Dependency upgrades, framework migrations, API version bumps, test coverage campaigns, documentation generation — all of these have the same structure of well-defined scope, dependency ordering, and verifiable output. I will keep experimenting.

The DRS migration is at PR #84. The Ralph loop tooling is at github.com/manojlds/vaibhav.