The Green Report | How to Validate Data in Generated Reports

How to Validate Data in Generated Reports

Jun 1st 2025 14 min read

medium

reporting

Verifying downloaded reports is a critical yet often overlooked part of end-to-end test automation—especially when those reports are dynamically generated and consumed by users or stakeholders. Whether it's a CSV or Excel file, ensuring that the downloaded data is structurally correct and matches expected content can help catch regressions that UI tests alone might miss.

In this blog post, we'll walk through a practical example of automating file verification using Python. We'll generate sample report data, validate both file structure and content, and demonstrate how to write robust tests to make sure our downloads don't silently fail.

Understanding the Challenge

When it comes to exporting data from web applications—whether it's in CSV, Excel, or PDF format—several issues can arise that make reliable automation testing both necessary and challenging. One of the most common problems is inconsistent data. This might happen when the exported file includes outdated values, incomplete rows, or mismatched columns due to backend logic bugs or race conditions during export.

Another frequent issue is formatting errors. For example, numeric values may appear as strings, date formats might not align with localization settings, or special characters could break the structure of a CSV file if not properly escaped. These problems often go unnoticed during manual testing but can severely impact downstream users who rely on automated report parsing.

Lastly, there's the risk of corrupted or incomplete files, especially when large datasets are involved or when export operations time out. Sometimes the export button triggers a download, but the resulting file is truncated or fails to open, leading to a silent failure that may not surface until much later.

Because of these potential pitfalls, robust QA testing for exported tables is essential. Automated checks that validate the structure, content, and integrity of downloaded files can catch subtle bugs that aren't visible through the UI. This ensures not only that the export functionality works but also that the exported data is reliable and accurate—something your end users and business intelligence tools depend on.

Strategy Overview

To ensure that data exports from our application are both functional and trustworthy, a structured testing strategy is essential. We recommend a three-step approach that covers the full spectrum of potential issues: File Integrity Verification, Structure Validation, and Content Verification. Each step plays a critical role in catching different classes of bugs and ensuring that the exported files are usable and accurate.

1. File Integrity Verification

The first check is simple but crucial—did the file actually download, and is it readable? This step verifies that the export mechanism successfully triggered a download and produced a file that isn't empty, corrupted, or unreadable. Whether it's a .csv, .xlsx, or .pdf, the file must open properly and contain a minimum expected size or content length to be considered valid. Catching failures here helps identify backend issues, timeouts, or frontend misfires that silently prevent exports from completing.

2. Structure Validation

Once a file is confirmed to be intact, the next step is to validate its structure. This includes checking that headers are present and correctly ordered, the number of columns per row is consistent, and special characters (like commas in CSV files) are properly escaped. This layer catches formatting bugs that might not break the file but can still cause parsing errors for users or scripts downstream. Automated structure checks help ensure that exported data can be reliably imported or analyzed elsewhere.

3. Content Verification

Finally, content verification compares the actual data inside the file against expected values. This might involve querying the backend API or database to retrieve the expected dataset and comparing it to the exported file. This step ensures that the data is complete, current, and accurate—especially important when filters or dynamic queries affect the export. Without this check, it's easy to miss issues like off-by-one errors, outdated entries, or missing rows.

Setting Up Sample Data

Before diving into file verification, we need consistent and controlled test data to work with. This is handled by a Python script named setup_and_run.py. Its primary responsibility is to generate both CSV and Excel files containing simulated export data that mirrors what a production system might output—like user lists, inventory reports, or financial summaries.

To start, the script uses Python's built-in random module along with pandas to create structured tabular data. To guarantee consistency between test runs, a random seed is set before data generation. This ensures that any “random” data is actually reproducible each time the tests are executed.

Here's the part of the script where the seed is set and data is generated:

                
import random
import pandas as pd
from datetime import datetime, timedelta

random.seed(42)

rows = 10
data = {
    "ID": [i for i in range(1, rows + 1)],
    "Name": [f"User_{random.randint(1000, 9999)}" for _ in range(rows)],
    "Score": [round(random.uniform(0, 100), 2) for _ in range(rows)],
    "Date": [
        (datetime.today() - timedelta(days=random.randint(0, 30))).strftime("%Y-%m-%d")
        for _ in range(rows)
    ],
}

df = pd.DataFrame(data)

This snippet creates a table with four columns: ID, Name, Score, and Date. The data includes randomly generated names, floating-point scores, and recent dates, but because the random seed is fixed, the output is stable and repeatable.

After creating the data, the script writes the results to both .csv and .xlsx formats. Here's how that's done:

                
df.to_csv("exported_data.csv", index=False)
df.to_excel("exported_data.xlsx", index=False)

By exporting to both file types, we can later validate different export pipelines and ensure our test coverage spans multiple file formats. This is especially helpful for applications that offer multiple export options to users.

To ensure test consistency, the script can also delete any existing files and regenerate fresh ones before each test run. This avoids false positives from comparing against stale or manually edited files.

                
import os

for file in ["exported_data.csv", "exported_data.xlsx"]:
    if os.path.exists(file):
        os.remove(file)

This setup step ensures that every test starts from a clean, known state—an essential practice for reliable and reproducible QA automation. With consistent files in place, we can confidently move on to testing their integrity, structure, and content.

Building the Table Verifier

With the sample data files ready, the next step is to build the verification engine that tests those files. The table_verifier.py script is responsible for loading the exported files and systematically checking their integrity, structure, and content. This script performs the core of the QA work by ensuring the files are not corrupted, follow the expected schema, and contain valid, meaningful data.

The script begins by loading either a CSV or Excel file using pandas. It auto-detects the file type based on the extension and reads it accordingly. Here's how that's implemented:

                
import pandas as pd
import os

def load_file(file_path):
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"File not found: {file_path}")

    if file_path.endswith(".csv"):
        df = pd.read_csv(file_path)
    elif file_path.endswith(".xlsx"):
        df = pd.read_excel(file_path)
    else:
        raise ValueError("Unsupported file type")

    return df

Once the file is successfully loaded, the script validates the table structure. This means checking whether all the expected columns are present and whether their data types match expectations. For example, we expect "ID" to be an integer, "Name" to be a string, and so on:

                
EXPECTED_COLUMNS = {
    "ID": int,
    "Name": str,
    "Score": float,
    "Date": str,
}

def validate_structure(df):
    for column, expected_type in EXPECTED_COLUMNS.items():
        if column not in df.columns:
            raise ValueError(f"Missing expected column: {column}")

        non_null_value = df[column].dropna().iloc[0]
        if not isinstance(non_null_value, expected_type):
            raise TypeError(f"Column {column} has wrong type: expected {expected_type}, got {type(non_null_value)}")

This structure validation acts as a schema gatekeeper. If any required column is missing or has unexpected data types, the test fails immediately—preventing bad data from slipping through.

Next comes the content validation step. This step inspects the actual values inside the table to ensure they meet business rules. The script checks for nulls in required columns, ensures uniqueness for primary keys, and applies custom logic like regular expression checks or value ranges:

                
import re

def validate_content(df):
    if df["ID"].isnull().any():
        raise ValueError("Null values found in 'ID' column")

    if df["Name"].isnull().any():
        raise ValueError("Null values found in 'Name' column")

    if df["ID"].duplicated().any():
        raise ValueError("Duplicate IDs found")

    if not df["Score"].between(0, 100).all():
        raise ValueError("Score values must be between 0 and 100")

    if not df["Name"].apply(lambda name: re.match(r"User_\d{4}", name)).all():
        raise ValueError("Invalid name format found")

Each of these rules helps catch specific issues. For instance, the duplicated() check ensures no user ID is repeated. The between() check ensures scores are within a logical range. And the regex check enforces naming conventions that could be critical for integration with other systems.

Finally, a single verify_table() function ties everything together. It loads the file, then calls the structure and content validators:

                
def verify_table(file_path):
    df = load_file(file_path)
    validate_structure(df)
    validate_content(df)
    print(f"{file_path} passed all checks.")

Writing Automated Tests

To bring the data export validation workflow full circle, we integrate automated tests using the test_table_verification.py script. This test file connects the sample data creation logic from setup_and_run.py and the verification logic from table_verifier.py into a reliable test suite. It ensures that exported files—both CSV and Excel—consistently meet all quality criteria.

The tests are encapsulated within a class called TestTableVerification, which allows for structured setup and test execution. Before each test runs, fresh data is generated to ensure consistency and isolation. This is handled in the setup_method() function:

                
class TestTableVerification:

    def setup_method(self):
        create_sample_data()

The test for CSV verification instantiates the TableVerifier class with the CSV file path, then calls its methods to perform multiple layers of validation. It begins by checking the file's integrity, ensuring it's readable and non-empty. Next, it verifies that the structure matches expected headers:

                
def test_csv_verification(self):
    verifier = TableVerifier("test_downloads/user_analytics.csv")

    assert verifier.verify_file_integrity()

    expected_headers = ['user_id', 'username', 'email', 'registration_date',
                        'last_login', 'total_sessions', 'subscription_type']
    structure = verifier.verify_structure(expected_headers)
    assert structure['headers_match']

    print(f"CSV test passed - {structure['actual_row_count']} rows verified")

The test will fail if the headers are misaligned or any expected columns are missing. A success message confirms how many rows were verified, making it easier to debug and validate.

For Excel file testing, the logic follows a similar structure. The test again verifies file integrity and loads the file's contents. Then, it checks a specific content rule: every entry in the subscription_type column must be 'premium'. This kind of rule is common in filtered or segmented data exports:

                
def test_excel_verification(self):
    verifier = TableVerifier("test_downloads/premium_users.xlsx")

    assert verifier.verify_file_integrity()
    verifier.load_file()

    assert all(verifier.data['subscription_type'] == 'premium')
    print(f"Excel test passed - all {len(verifier.data)} users are premium")

By separating structure checks from content checks, these tests ensure that each exported file is not only technically correct but also semantically accurate.

Observations and Best Practices

Running the tests reveals much more than basic pass/fail outcomes—it helps spot real data quality issues. The CSV test catches missing or reordered columns, while the Excel test confirms that filters and logic (like showing only premium users) are working as expected.

These tests are easy to customize. CSV exports benefit from extra checks like consistent delimiters or line endings, while Excel files might need validations for sheet names or formulas. Adapting tests to the file type ensures relevance and reliability.

To keep your suite robust, abstract common validations (e.g., null checks, regex, uniqueness) and consider loading rules from config files. This makes it easy to scale and maintain tests across different data exports. Integrating them into CI ensures quick feedback and continuous data reliability.

Conclusion and Future Enhancements

Automating table data verification using Python scripts and tests, as shown in this post, helps catch structural issues, formatting problems, and incorrect data before it reaches end users. This approach turns what's often a manual and error-prone process into a repeatable and reliable check that can be run as part of any test suite.

Looking ahead, this system can be extended to support more file formats like JSON or XML, include advanced validations like cross-field consistency or reference lookups, and integrate directly into CI/CD pipelines to catch issues earlier in development. You could also adapt it for localized exports or multilingual datasets with minor configuration changes.

A complete and more advanced version of this project, with modular code and broader validation rules, is available on our GitHub page—feel free to explore it and adapt it to fit your needs. Thanks for reading, and happy testing!