The Green Report | Why You Should Test with Real Data (Sometimes)

Why You Should Test with Real Data (Sometimes)

Jul 9th 2025 5 min read

medium

In test automation, we love mocking. It's fast, predictable, and clean. But relying only on mocked data can leave dangerous gaps—especially when our app behaves differently in the real world.

Here's why mixing in real data tests (even occasionally) is not just a good idea—it's essential.

Mocked Data Is Great, But...

Mocked APIs, fixtures, and hardcoded responses make tests reliable and fast. They eliminate third-party dependencies and ensure deterministic outcomes.

But they also create a false sense of security.

Imagine this:

You mock a product API response with 3 items.
Your UI works perfectly—grid layout, pagination, search.
Then the live API returns 1000 items, some missing fields.
Suddenly your layout breaks, pagination fails, and search behaves weirdly.

Your tests pass, but the app is broken.

Real Data = Real Scenarios

Testing with real, unpredictable data uncovers:

In short: it's what our users experience.

Real-World Examples That Mocks Miss

The Unicode Nightmare: A team mocked user names as simple strings like "John Smith." When they deployed, Korean names with special characters broke the entire user profile system. The display truncated mid-character, causing database errors.

The Pagination Trap: Mocked APIs returned exactly 20 items per page. In production, the API sometimes returned 19, 20, or 21 items due to real-time data changes. The "Load More" button disappeared randomly, confusing users.

The Performance Cliff: Mock responses returned in 50ms. Real API calls took 3-5 seconds during peak hours. Users saw loading spinners that never ended, leading to a 40% bounce rate spike.

Mix It: Mock Most, Test Real Sometimes

A balanced strategy:

You can also tag your tests:

                
test('User sees dashboard [mock]', async () => { 
  /* fast mock test */ 
});

test('Dashboard loads real data [real]', async () => { 
  // Uses actual API or database 
});

Then choose when to run which set:

                
npm run test --grep="real"
npm run test --grep="mock"  # for CI/CD pipelines

Types of Real Data Testing

Sanitized Production Data: Copy production data but scrub PII. Tools like faker.js can replace sensitive fields while maintaining data structure and relationships.

Synthetic Realistic Data: Generate data that mirrors production patterns—same field lengths, character distributions, and edge cases—without actual user information.

Staging Environment Data: Use a dedicated staging database that mirrors production complexity but contains only test data.

Tools That Make Real Data Testing Easier

Database Snapshots: Tools like pg_dump for PostgreSQL or mysqldump let you create sanitized copies of production data for testing.

API Mocking with Real Schemas: Tools like Prism or WireMock can validate real API schemas while still providing predictable responses.

Data Generation Libraries: Faker.js, Factory Bot, or Hypothesis can generate realistic test data that matches your production patterns.

Watch Out When Using Real Data

Some caution:

Privacy and Compliance Considerations

GDPR/CCPA Compliance: If using production data, ensure you have proper data processing agreements and user consent. Consider anonymization techniques that preserve data utility while protecting privacy.

Data Retention Policies: Real data tests should follow the same retention policies as your production data. Don't accidentally create compliance violations by storing test data longer than allowed.

Access Controls: Limit who can run real-data tests and ensure proper audit trails. Use service accounts with minimal necessary permissions.

When NOT to Use Real Data

Real data testing isn't always the answer:

Early Development: When APIs are still changing rapidly, mocks are more flexible and won't break when endpoints change.

Flaky Infrastructure: If your staging environment is unreliable, real data tests will create false negatives that erode team confidence.

Regulated Industries: Financial services, healthcare, and other regulated sectors may have restrictions that make real data testing impractical.

Performance Testing: For load testing, you often need predictable data volumes and patterns that real data can't provide.

Getting Team Buy-In

Start Small: Introduce one real-data test for your most critical user journey. Show the team what it catches that mocks don't.

Measure Impact: Track how many production bugs real-data tests catch vs. mock tests. Use this data to justify the additional complexity.

Address Concerns: Common pushback includes "it's too slow" or "it's too flaky." Show how you can run real-data tests separately and make them more reliable with proper setup.

Making Real Data Tests Maintainable

Idempotent Operations: Ensure your real-data tests can run multiple times without affecting each other. Clean up after yourself or use read-only operations.

Parallel Execution: Design tests so they don't interfere with each other. Use unique test data or separate test environments.

Monitoring and Alerts: Set up monitoring for your real-data tests. If they start failing due to infrastructure issues, you need to know immediately.

TL;DR

Mocked data is great for speed and reliability.
Real data is essential for catching real-world issues.
Combine both for a strong, balanced automation strategy.
Consider privacy, compliance, and team adoption when implementing.
Start small, measure impact, and scale gradually.

Even one real-data test can expose what 100 mocks can't. The key is knowing when and how to use each approach effectively.