Giving Claude Code Eyes: Round-Trip Screenshot Testing

Press enter or click to view image in full size

Round-trip screenshot testing with Claude Code

I’ve been using Claude Code heavily for my side hustle work, which includes quite a bit of front end development: generating views, tweaking layouts, wiring up multi-step flows. It’s remarkably good at writing code that works. But here’s the thing about front-end development: code that works and UI that looks right are two very different problems.

Claude Code operates in a fundamentally text-based world. It reads your codebase, writes code, runs tests — all through text. When it generates a view template or a CSS change, it’s reasoning about what the output should look like based on the structure of the markup. It’s working blind.

And that’s exactly the gap I wanted to close.

The feedback loop that was missing

Think about how you develop front-end features. You write some code, switch to the browser, eyeball it, adjust, repeat. That tight visual feedback loop is what keeps layouts from drifting, what catches the subtle things — a button that’s technically rendered but visually buried, a modal that works but overlaps the header, a flow that functions but feels broken.

Claude Code doesn’t get that loop. It writes the code and moves on, relying entirely on structural reasoning to predict visual outcomes. For simple changes, that’s often fine. For anything involving layout, spacing, multi-step flows, or responsive behaviour — it’s flying instruments-only in a task that really needs a window — so I gave it one.

The approach: screenshots as a verification layer

The idea is simple. After Claude Code makes front-end changes, it runs the relevant system tests with automatic screenshot capture enabled, then visually examines every screenshot to verify the UI looks correct. It’s a round-trip: write code, render it through a real browser, capture what the browser actually produced, and inspect the result.

This isn’t a new testing technique — screenshot testing has been around for years. What’s new is wiring it into an AI coding agent’s workflow so it can self-verify its own visual output. The agent goes from “I wrote markup that should render a card layout” to “I wrote markup, rendered it, and confirmed it actually looks like a card layout.”

The implementation has two pieces: a screenshot harness that automatically captures PNGs around every Capybara action during system tests, and a Claude Code custom command that orchestrates the whole flow — run the tests, find the screenshots, visually examine each one, and report what it sees.

How it works in practice

My setup is Ruby on Rails with Minitest system tests, but the approach generalises to any stack with browser-based testing.

The screenshot harness is a Ruby concern that hooks into Capybara’s action methods — visit, click_button, fill_in, and so on. When you run tests with SCREENSHOTS=1, it captures a PNG before and after click actions (to catch transient states like confirmation dialogs) and after everything else. The output is a structured directory of numbered screenshots that tell the story of the test:

tmp/screenshots/ParticipationsSystemTest/
  full_participation_flow/
    001_visit_participations.png
    002_before_click_button_Done.png
    003_after_click_button_Done.png
    004_before_click_link_Continue.png
    005_after_click_link_Continue.png

The custom command (/screenshot-test) tells Claude Code what to do with all of this. It runs the specified system tests with screenshot capture, then reads every captured PNG and describes what it sees — layout, content, visual state. It flags anything that looks broken, misaligned, or unexpected, and verifies that each screenshot makes sense for the action that triggered it.

The glue that ties it together is a few lines in my CLAUDE.md that tell Claude Code when and how to use this capability:

### Visual Verification with ScreenshotsSystem tests support automatic screenshot capture via `SCREENSHOTS=1`.
Use the `/screenshot-test` skill to run system tests and visually examine
every captured screenshot. This is useful for verifying UI changes,
debugging layout issues, or reviewing multi-step flows. Only use it to
verify new or modified functionality, not with the entire test suite.

What changed

Before this, my review process for Claude Code’s front-end work was: read the diff, eyeball the code, run the tests to make sure nothing’s broken, then manually check the UI in a browser. I was the visual verification layer.

Now, Claude Code catches a significant number of visual issues before I even start reviewing. Overlapping elements, missing content in specific states, flows that navigate correctly but render awkwardly — these get flagged in the same cycle as the code change. By the time I look at a PR, the obvious visual problems have already been caught and fixed.

This shifted my review from “does this look right?” to “does this look good?” — which is a much more valuable use of my time.

Adapting this to your stack

The core pattern is stack-agnostic:

A screenshot harness that captures browser state at meaningful points during test execution. In Rails, that’s the AutoScreenshots concern. In Playwright, you'd use page.screenshot() hooks. In Cypress, cy.screenshot(). The key is capturing around actions, not just at the end of a test.
A custom command that orchestrates the run-examine-report cycle. This is the part that tells your AI coding agent to actually look at the screenshots and reason about what it sees.
A CLAUDE.md entry (or equivalent project-level instruction) that tells the agent when to use the capability — after front-end changes, not on every commit.

The harness is the only part that’s language-specific. The command and the workflow pattern port directly.

The files

I’ve published all three pieces as a single gist for anyone who wants to adapt this:

auto_screenshots.rb — the Rails/Capybara screenshot harness
screenshot-test.md — the Claude Code custom command
(snippet from) CLAUDE.md — the project-level instruction that wires it together

Drop the concern into your test helpers, add the custom command to your .claude/commands/ directory, update your CLAUDE.md, and you're off.

The broader point

AI coding agents are extraordinarily capable at structural reasoning — understanding code, tracing logic, writing correct implementations (well… sometimes). Where they’ve been weakest is in tasks that require perception — seeing what something actually looks like, not just what it should look like in theory.

Giving Claude Code the ability to render its own work and inspect the result isn’t just a convenience. It closes a fundamental feedback loop. And if there’s one thing systems thinking teaches us, it’s that the quality of any system’s output is bounded by the quality of its feedback loops.

The tighter you make that loop, the better the output gets. Screenshots are a low-effort, high-leverage way to close the visual gap — and the difference in output quality is immediate.