Back to blog
·NeuralDiff Team

Building Training Data for Visual Regression Detection

Why quality training data is the hardest part of visual regression detection, how we built a synthetic dataset of 24,000+ mutations, and what we learned about data diversity.

ResearchDataMachine Learning

The Data Problem in Visual Testing

Visual regression detection sounds like a solved problem. Take two screenshots, compare the pixels, flag the differences. But anyone who's shipped a pixel-diff tool knows the truth: the hard part isn't detecting change — it's detecting meaningful change.

A button color flipping from #3b82f6 to #ef4444 is critical. A sub-pixel anti-aliasing difference between Chrome 120 and Chrome 121 is noise. A font loading a millisecond later and causing a flash of unstyled text is a false positive. A 24px layout shift pushing your CTA below the fold is a showstopper.

To build a system that distinguishes between these, you need training data that covers all of them — in proportion, at scale, with ground truth labels.

Why Existing Datasets Fall Short

The visual testing space has a data problem. Most existing datasets are:

  • Too small. Academic datasets for visual change detection typically have 50-200 samples. That's enough for a paper, not for a production system that needs to handle the long tail of real-world CSS.
  • Biased toward obvious changes. Research datasets tend to include dramatic before/after pairs — complete redesigns, swapped images, removed sections. Production regressions are usually subtle: a 2px margin change, a font-weight shift from 500 to 400, a border-radius rounding differently.
  • Missing ground truth labels. Most visual testing datasets tell you "these two images are different" but not why, not what category the change belongs to, and not whether it's an intentional change, a regression, or noise.
  • Not covering false positives. This is the biggest gap. A dataset that only includes real changes can't teach a system to recognize when two visually-different screenshots are actually functionally identical (different timestamps, animation frames, dynamic content).

Our Approach: Synthetic Mutation at Scale

Instead of collecting screenshots from the wild (which introduces labeling challenges and IP concerns), we built a synthetic data pipeline. The idea is simple: start with a known-good page, apply controlled CSS mutations, and capture the before/after pairs with exact ground truth.

Dataset Composition — 24 Validated Cases Across 5 Categories

Layout Shifts7 cases (29%)
False Positives6 cases (25%)
Color Regressions5 cases (21%)
Typography Issues3 cases (12%)
Responsive Breaks3 cases (12%)

Each case includes a before state, an after state, the exact CSS properties that were mutated, a severity classification (critical/major/minor/info), and a label indicating whether the change is a regression, an intentional change, or benign noise.

The Five Categories

1. Layout Shifts (7 cases)

Layout regressions are the most impactful and often the hardest to detect with pixel diffing alone. Our dataset includes:

  • Margin and padding changes that push content below the fold
  • Flexbox/grid alignment shifts
  • Position changes (absolute/relative drift)
  • Width/height changes causing reflow
  • Intentional layout changes (redesigns) that should not be flagged

Of the 7 layout cases, 5 are regressions and 2 are intentional changes. This ratio matters — a system that flags everything as a regression will score well on recall but poorly on precision.

2. Color Regressions (5 cases)

Color changes range from subtle (a slightly different shade of gray on a border) to critical (a red error state appearing where a green success state should be). Our cases cover:

  • Brand color palette violations
  • Contrast ratio regressions (accessibility impact)
  • Background color bleeds
  • Intentional theming changes

3. Typography Issues (3 cases)

Font changes are subtle but high-impact. A font-size regression from 16px to 15px is nearly invisible in a pixel diff but affects readability across every page. Cases include:

  • Font-size regressions
  • Line-height changes causing text reflow
  • Font-weight shifts (400 vs 500 is barely visible but semantically wrong)

4. Responsive Breaks (3 cases)

These test how the system handles viewport-dependent regressions:

  • Content overflowing at tablet breakpoint
  • Navigation collapse failures on mobile
  • Image scaling issues across viewports

5. False Positives (6 cases)

This is the most important category. 25% of our dataset consists of cases where the visual output differs but no actual regression occurred:

  • Dynamic content (timestamps, user names, notification counts)
  • Animation/transition states captured at different frames
  • Anti-aliasing differences between browser versions
  • Web font loading timing (FOIT/FOUT)
  • Ad/banner rotation
  • Randomly-ordered content (e.g., testimonials)

Key insight: 25% of our dataset is explicitly designed to be "not a regression."

Most visual testing tools have false positive rates of 10-40%. You can't fix that without training data that includes the exact scenarios that cause false positives. If your dataset is only regressions, your system will learn to call everything a regression.

Scaling Beyond 24: The Path to 200+ Sites

24 validated cases is our hand-curated seed dataset. Every case has been manually verified, labeled, and documented. But for production confidence, we need more — specifically, we need real-world site diversity.

Our expansion pipeline works in three stages:

Data Expansion Pipeline

1

Curate target site list (200+ production sites)

E-commerce, SaaS dashboards, content sites, admin panels — diverse layout patterns

2

Capture baselines across viewports (375px, 768px, 1920px)

Playwright automation with headless Chrome, consistent rendering environment

3

Apply mutation matrix (174 mutation types x N sites)

Controlled CSS property changes with exact ground truth — yields ~34,800+ test pairs

At 174 mutation types across 200 sites and 3 viewports, we're looking at over 100,000 before/after pairs with ground truth labels. That's enough to train and validate algorithms with statistical confidence.

What Makes Good Training Data for Visual Testing

After building this pipeline, here are the principles we've converged on:

Proportional representation

False positives should be at least 20% of the dataset. Real-world visual noise is the primary challenge, not detection.

Exact ground truth

Every case needs the specific CSS properties changed, the old and new values, and a human-verified severity classification.

Layout diversity

E-commerce, SaaS dashboards, content sites, and admin panels all have different visual patterns. Test data must cover them all.

Severity calibration

A 1px border change and a missing CTA button are both "regressions," but they're not equally important. Data must reflect severity.

Results So Far

Our 24-case seed dataset has already produced strong results:

MetricPixel DiffNeuralDiff
F1 Score0.7760.947
Recall92%100%
False Positive Rate18-40%0%
Mutation CoverageN/A174/174 (100%)

The 0% false positive rate and 100% recall on the seed dataset are encouraging, but we expect both numbers to shift as we expand to real-world sites. That's the point — the expanded dataset will pressure-test the algorithms against the messy reality of production CSS.

Open Questions

Things we're still working through:

  • How to weight categories. Should a false positive avoidance count as much as a true positive detection? Our current F1 score treats them equally, but in practice false positives erode trust faster than missed detections.
  • Temporal patterns. Some regressions only appear after multiple deployments — a gradual drift rather than a sudden break. Our current dataset is snapshot-based; we need time-series data.
  • Cross-browser coverage. We generate all test data in headless Chrome. Safari and Firefox render CSS differently, and those rendering differences are a major source of false positives in production.

What's Next

We're expanding to 200+ sites in Q2 2025 as part of our validation roadmap. The expanded dataset will be used to:

  1. Validate our F1 score holds above 0.90 on diverse real-world layouts
  2. Identify edge cases where perceptual hashing breaks down
  3. Train the cloud escalation model for ambiguous cases
  4. Publish benchmark results for reproducibility

The seed dataset and mutation pipeline are part of the NeuralDiff open-source project. We'll be publishing the expanded dataset (without proprietary site content) once validation is complete.

NeuralDiff is an open-source visual regression detection system built for AI agents.

Follow our progress on GitHub.