Every visual testing tool has an image diffing engine at its core. Playwright uses pixelmatch. Percy uses their own proprietary system. Chromatic renders in their infrastructure. Most tools use some variation of the same approach: load two images, compare pixels, report differences.
They’re all… fine. Good enough. But nobody’s been pushing the boundaries of what image diffing can actually do.
I spent four years at Percy building visual testing (support + SDKs 👋🏼). When I left and came back to the space a few years later, I expected things to have moved forward. They hadn’t. The same slow diffing. The same basic pixel counting. The same workflows that made visual testing feel like a chore.
I built Honeydiff because I was disappointed. And because I knew it could be better.
Here are the numbers.
The Landscape Today
Before I get into benchmarks, let’s talk about what’s out there.
pixelmatch is everywhere. It powers Playwright’s visual comparisons, jest-image-snapshot, and dozens of other tools. It’s pure JavaScript, single-threaded, and does the job. But it doesn’t scale.
odiff was the answer to pixelmatch’s performance problems. Originally written in OCaml, then rewritten in Zig with SIMD optimizations. It’s genuinely fast, the fastest open-source option I’ve found. Visual testing tools that care about performance use odiff.
The big players? Percy, Chromatic, Applitools. They’re focused on workflow, integrations, and scale. The actual image diffing has historically been “good enough” rather than a core innovation area. I helped Percy through the BrowserStack acquisition. Trust me when I say: nobody in visual testing has been pushing the boundaries of image comparison. Until now.
When I benchmarked Honeydiff, I tested against both: pixelmatch (what most tools use) and odiff 4.1.1 (the fastest open-source option). If Honeydiff beats both, it beats everything.
Why I Built Honeydiff
I want to be clear about something: odiff is good software. The maintainers have done solid work, and it’s genuinely the fastest open-source option out there.
But I couldn’t build Vizzly on top of it. Here’s why.
I needed something built for modern visual testing at scale. Not a general-purpose image diffing library that happens to work for screenshots. I needed spatial clustering to power diff visualizations. SSIM scoring for smart auto-approval of dynamic content. Intensity statistics for understanding the nature of changes. These aren’t features you bolt on. They need to be architected from the ground up.
I needed extensibility and control. When you’re building a visual testing platform, you need to be able to add features, tune algorithms, and optimize for your specific use cases. Wrapping someone else’s library means you’re always working around their constraints.
I didn’t feel great building a business on someone else’s open source work. This is a personal thing, but it matters to me. Using odiff for a side project is fine. Building a company on top of it felt wrong. If Vizzly succeeds, the value should come from work I’ve done, not from leveraging someone else’s unpaid labor.
The project’s direction gave me pause. odiff started in OCaml, then got rewritten in Zig. That’s a major architectural shift. I watched the repo for a while and saw enough churn that I wasn’t confident about long-term stability. When your business depends on a dependency, you need to trust its trajectory. This isn’t a dig at the maintainers. Rewrites happen, priorities shift, and that’s totally valid. But combined with everything else above, it was the final push I needed to consider building something myself.
So I built Honeydiff. And in the process of building exactly what Vizzly needed, I ended up with something much better than I expected.
The Benchmark
Comparing identical white images tells you nothing useful. I used actual screenshots from production applications.
System: Apple Silicon Mac, macOS Darwin 24.6.0
Honeydiff: 0.3.0 (Rust, Rayon parallelism)
pixelmatch: 6.0.0 (JavaScript, what Playwright uses)
odiff: 4.1.1 (Zig rewrite with SIMD)
Test Images:
| Image | Dimensions | Total Pixels |
|---|---|---|
| Vizzly dashboard screenshots | 826 × 3,070 | 2.5 million |
| Full HD login screens | 1,920 × 1,080 | 2.1 million |
| Tall scrollable pages | 750 × 24,162 | 18.1 million |
All benchmarks: RGB mode, threshold 0, no anti-aliasing, no output files. Just raw comparison speed.
The Results
2.5 Million Pixels (Dashboard Screenshots)
Honeydiff: 20ms
pixelmatch: 80ms
odiff 4: 240ms
2.1 Million Pixels (Full HD)
Honeydiff: 15ms
pixelmatch: 80ms
odiff 4: 240ms
18 Million Pixels (Full-Page Scrollable)
Honeydiff: 80ms
pixelmatch: 450ms
odiff 4: 710ms
A few things jump out here. pixelmatch is actually faster than odiff on smaller images, but it doesn’t scale. By the time you hit 18 million pixels, it’s struggling. odiff stays consistent. Honeydiff beats both across the board.
That 18-million-pixel result is worth pausing on. That’s the kind of screenshot you get when you capture an entire scrollable page. Honeydiff processes it in 80 milliseconds. Under a tenth of a second. Done before you’ve finished glancing at your terminal.
Why This Gap Exists
pixelmatch is pure JavaScript. It’s simple, readable, and works everywhere Node runs. But it’s single-threaded and doesn’t optimize for large images. Once you’re past a few million pixels, the overhead adds up.
odiff is well-engineered. SIMD optimizations, efficient memory layout, fast PNG decoding. It’s not slow by any reasonable definition. But at the time of this benchmark, it’s also single-threaded.
Honeydiff uses Rayon for work-stealing parallelism. Every row of pixels gets compared in parallel across all available CPU cores. On my M-series Mac, I see 260-309% CPU utilization during comparisons, effectively saturating 2-3 cores with automatic load balancing.
Modern CPUs have multiple cores, but most image diffing tools ignore them. Honeydiff doesn’t.
But parallelism alone doesn’t explain the speedup. The architecture goes deeper:
Lazy computation: SSIM scoring, spatial clustering, intensity statistics are all opt-in. The basic comparison is as lean as possible.
Early exit paths: Set a maxDiffs threshold and Honeydiff stops the moment it hits that
count. Perfect for quick-fail scenarios in CI.
The result is throughput between 127-227 million pixels per second depending on image characteristics.
What This Means for Visual Testing
“Fast diffing is nice, but does it matter?”
Yes. More than you’d think.
Local development becomes instant. When I’m running vizzly tdd and iterating on UI changes, I
need feedback now. Not in a second. Not after I’ve context-switched to something else. At 20ms per
comparison, visual feedback arrives before my eyes finish moving to the terminal.
CI stops being a bottleneck. A comprehensive visual test suite might run 200+ screenshots. With odiff, that’s 48 seconds of pure diffing time. With Honeydiff, it’s 4 seconds. The difference between “grab a coffee” and “already done.”
Rich analysis becomes viable. SSIM perceptual scoring is computationally expensive. Spatial clustering adds overhead. With a slow diffing engine, you can’t afford these features. With Honeydiff, I can add sophisticated analysis and still be faster than tools that only count pixels.
Accuracy: Matching Where It Matters
Speed means nothing if you’re getting wrong answers. So let’s talk accuracy.
In RGB mode with no anti-aliasing, Honeydiff achieves pixel-perfect parity with odiff:
Image: vizzly-baseline.png vs vizzly-with-diff.png
Mode: RGB, threshold 0, no AA
Honeydiff: 11,726 different pixels (0.46%)
odiff 4: 11,726 different pixels (0.46%)
Result: 100% match
Same algorithms. Same color space math. Same results. If you’re migrating from odiff, your existing baselines work unchanged.
Both tools handle anti-aliasing detection. Here’s a real comparison with AA enabled:
// Honeydiff
import { compare } from '@vizzly-testing/honeydiff';
let result = await compare('baseline.png', 'current.png', {
antialiasing: true,
outputDiff: 'honeydiff-diff.png'
});
// Result: 11,914 pixels detected
# odiff
npx odiff baseline.png current.png odiff-diff.png --aa
# Result: 702 pixels detected
Click to open full size
Click to open full size
odiff’s aggressive AA filtering misses real changes. Honeydiff is more conservative. It flags more pixels, but those pixels represent actual differences. The images above tell the story.
But here’s the thing: AA detection is just the beginning. What actually matters is what you do with the diff data.
Beyond Pixel Counting
Every diffing tool answers the same question: “Are these images different?” That’s the wrong question.
The right question is: “What changed, where, and does it matter?” Honeydiff gives you the data to answer that.
Spatial clustering groups adjacent differences into distinct regions. Instead of “5,000 pixels changed,” you get “3 separate regions changed” with bounding boxes, centers of mass, and pixel counts for each.
{
"diffClusters": [
{
"pixelCount": 4521,
"boundingBox": { "x": 120, "y": 340, "width": 200, "height": 80 },
"centerOfMass": [220.5, 380.2],
"avgIntensity": 127.3
}
]
}
Here’s where it gets interesting. When the same regions change across multiple builds (timestamps, session IDs, live user counts), that’s a pattern. Vizzly tracks these “hot spots” and learns which changes are expected. Next time those regions change, they get auto-approved. No more reviewing the same dynamic content build after build.
SSIM perceptual scoring is the safety net. Here’s the problem it solves: a timestamp changes, the diff pixel count is small, it’s in a known hot spot. Easy auto-approve, right?
Not so fast. What if that timestamp change also pushed other content down? The pixel diff is still small, but the layout shifted. That’s a real bug hiding behind an expected change.
SSIM catches this. It measures structural similarity: whether the overall layout stayed the same, not just which pixels changed. Low diff + high SSIM = safe to auto-approve. Low diff + low SSIM = something moved, needs human eyes.
This is the difference between automation you hope works and automation you can actually trust. More details in the 40% problem post.
Intensity statistics reveal the nature of changes. Is it one pixel that’s way off, or a subtle shift across thousands? Min, max, mean, median, standard deviation. The data is there when you need it.
None of this is practical when your diffing engine is slow. These features are Honeydiff-only, and yes, they add overhead. Here’s what it actually costs:
| Image Size | Basic | + Clusters | + SSIM | All Features |
|---|---|---|---|---|
| 2.5M pixels (826 × 3,070) | 20ms | 25ms | 250ms | 256ms |
| 2.1M pixels (1,920 × 1,080) | 17ms | 150ms | 206ms | 347ms |
| 18M pixels (750 × 24,162) | 73ms | 2,000ms | 78ms | 2,100ms |
Clustering is the expensive one, especially on images with many diff pixels. SSIM adds consistent overhead. These aren’t fast compared to the base comparison we’ve been talking up. That’s why each feature is opt-in: you choose the tradeoff between speed and depth of analysis.
Using Honeydiff
Honeydiff powers Vizzly, but it’s not locked inside our platform. If you want fast and advanced image diffing in your own projects, the NPM package is available:
npm install @vizzly-testing/honeydiff
import { compare, quickCompare } from '@vizzly-testing/honeydiff';
// Fast boolean check
let isIdentical = await quickCompare('baseline.png', 'current.png');
// Full comparison with all the rich data
let result = await compare('baseline.png', 'current.png', {
antialiasing: true,
includeClusters: true,
includeSSIM: true,
});
console.log(result.diffPercentage); // 0.33
console.log(result.diffClusters); // spatial regions
console.log(result.perceptualScore); // SSIM score
Pre-built binaries for macOS, Linux, and Windows. Full TypeScript definitions. All the features I’ve talked about in this post. Want to run your own benchmarks? Grab two screenshots and time it against whatever you’re using now.
The source isn’t open, and I want to be upfront about why. I love open source. I’ve built my career on it and contributed to projects I care about. But Honeydiff is Vizzly’s core competitive advantage: the novel algorithms, the architecture decisions, the optimizations. Keeping the source proprietary lets me build a sustainable business while still giving the community a genuinely useful tool.
You get the fastest image diffing available. I get to keep building Vizzly. Fair trade.
What’s Next
I’m just getting started. The foundation is solid, but the architecture I built enables features that would be impossible with slower engines.
Accessibility testing is coming. Honeydiff already has WCAG color contrast analysis built in. It can detect accessibility violations directly from screenshots. Smart gradient filtering reduces false positives by 98%+. Color blindness simulation shows how your UI appears to users with different types of color vision deficiency. Shipping to Vizzly soon.
The whole point of building my own diffing engine was to go places that general-purpose tools can’t. Accessibility testing is just the start.
The Bottom Line
The benchmarks speak for themselves: 9-16x faster base comparisons, more accurate detection, and rich analysis features that other tools don’t offer. But the real story is what becomes possible when your diffing engine stops being a bottleneck.
Visual testing has always felt like it was fighting against its own infrastructure. Slow comparisons, basic pixel counts, no insight into what actually changed or why. Honeydiff changes that, and Vizzly puts it to work in a workflow designed around how developers actually build software.
Visual testing can finally keep up.
Want to experience Honeydiff? Sign up for Vizzly and try
vizzly tdd to feel the difference instant
visual feedback makes. Learn more in the docs.