Or press ESC to close.

Building a Failure Pattern Recognition Tool with TensorFlow

May 12th 2024 16 min read

Flaky tests: the bane of every developer's existence. They waste time, erode confidence in the test suite, and can mask real regressions. This blog post delves into the exciting realm of failure pattern recognition using machine learning and automation, offering a hands-on guide to building a tool that can significantly improve our testing workflow. We'll break down the process step-by-step, from gathering historical test data to integrating the trained model into our testing framework. Let's build a weapon against flakiness and create a more stable, reliable test suite.

Data Collection: Fueling the Model

Building a robust failure prediction tool starts with gathering the right data. This data acts as the fuel that powers our machine-learning model, allowing it to identify patterns associated with flaky test behavior.

This section delves into the specific information needed for the model:

Storing this data in a structured format like a CSV file or a database table is crucial for efficient analysis and model training. By meticulously collecting and organizing this historical test data, we lay the foundation for building an effective failure prediction tool.

Feature Engineering: Extracting Meaningful Signals

The data collected is just raw material. To truly empower our failure prediction model, we need to transform it into a format that the model can understand and analyze effectively. This is where feature engineering comes into play.

This section explores the key steps involved in feature engineering:

Data Pre-Processing:

Handling Missing Values: Missing data points can disrupt the training process. Here, libraries like Pandas come in handy. We can utilize functions like .fillna() to impute missing values with a specific strategy (e.g., mean or median) or even drop rows with excessive missing data.

Here's a minimal Pandas example for illustrative purposes:

import pandas as pd

# Sample data with missing values
data = {'Test': ['A', 'B', None, 'C'], 'Value': [10, None, 5, 12]}
df = pd.DataFrame(data)
# Fill missing values in 'Value' column with the mean
df['Value'] = df['Value'].fillna(df['Value'].mean())

Categorical Data Conversion: Categorical data (e.g., operating system types) needs to be converted into numerical representations suitable for machine learning algorithms. Techniques like one-hot encoding can be employed here. Tools like Pandas' get_dummies function can automate this process.

Normalization: Features like execution times can have vastly different scales. Normalization techniques like min-max scaling or z-score normalization will ensure all features contribute equally to the model's analysis. Libraries like scikit-learn's StandardScaler can be used for this purpose.

Feature Creation:

Beyond the raw data points, we can extract even more insightful features that directly target potential indicators of flakiness:

Through data pre-processing and feature creation, we transform the raw data into a set of meaningful signals that the machine learning model can leverage to learn and predict flaky test behavior. This crafted data becomes the foundation for building an accurate failure prediction tool.

Test_Name Pass_Fail Avg_Exec_Time Error_Keyword_Presence Flaky
test_add_numbers 1 0.12 0 0
test_network_connection 0 0.85 1 1
test_file_download 1 0.31 0 0

Small example of the "test_data.csv" content

Model Selection & Training: Building the Prediction Engine

We've gathered the data and extracted meaningful features. Now it's time to build the core of our tool: the prediction engine! This section delves into selecting and training a machine learning model capable of identifying patterns associated with flaky tests.

Choosing the Right Model:

Many machine learning algorithms can be utilized for classification tasks like predicting flakiness. In this example, we'll focus on a simple yet effective model - Logistic Regression from TensorFlow. Logistic Regression excels at modeling the relationship between features and a binary outcome (flaky or not flaky), making it suitable for our task.

Model Architecture:

Imagine a black box that takes our extracted features as input and outputs a probability of a test being flaky. This black box is our Logistic Regression model.

The model architecture defines the structure of this black box. We'll have an input layer with several neurons equal to the number of features we extracted (e.g., average execution time, keyword presence). This layer feeds into a single output neuron that uses a sigmoid activation function to predict the probability (between 0 and 1) of a test being flaky.

The Training Process:

Here's where the magic happens! We take our prepared data and split it into two sets:

During training, we utilize an optimizer like Adam, which helps the model efficiently adjust its weights to minimize the difference between its predictions and the actual flaky/non-flaky labels in the training data.

A loss function, like binary cross-entropy, measures how well the model's predictions align with the actual labels. The optimizer iteratively adjusts the model's weights to minimize this loss, essentially guiding the model towards making better predictions.

A Glimpse of TensorFlow Code:

This code demonstrates how to build and train a machine learning model to predict flaky tests using TensorFlow. Let's break down the code section by section:

import pandas as pd
import tensorflow as tf
from sklearn.model_selection import train_test_split
data = pd.read_csv("test_data.csv")

This section imports necessary libraries: pandas for data manipulation, TensorFlow for building the model, and train_test_split from scikit-learn for splitting data into training and testing sets.

We then use pandas' read_csv function to load the preprocessed data from a CSV file named "test_data.csv" into a pandas DataFrame named data. This DataFrame is assumed to contain features extracted from historical test data, along with the corresponding labels indicating whether each test is flaky or not.

features = data[["Avg_Exec_Time", "Error_Keyword_Presence"]]
target = data["Flaky"]

Here, we separate the features (e.g., average execution time, keyword presence) and the target variable (flaky/not flaky) from the loaded DataFrame data. This essentially defines what information the model will learn from (features) and what it's trying to predict (target).

train_features, test_features, train_target, test_target = train_test_split(
    features, target, test_size=0.2

This section utilizes train_test_split from scikit-learn to split the data into two sets: training and testing. The training set (typically 80% of the data) is used to train the model. The testing set (20% of the data) is unseen by the model during training and is used to evaluate its performance after training is complete. This helps ensure the model generalizes well to unseen data.

model = tf.keras.Sequential([
    tf.keras.layers.Dense(1, activation="sigmoid", input_shape=(features.shape[1],))

This code snippet defines the architecture of our Logistic Regression model using TensorFlow's Sequential API. It creates a sequential stack of layers:

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy"])

Here, we configure the training process by compiling the model. This involves specifying:

model.fit(train_features, train_target, epochs=10)

Finally, we train the model by calling the fit method. This method iterates through the training data epochs (the number of times to go through the entire training set) specified as 10 in this example. During each epoch, the model learns by adjusting its weights based on the errors between its predictions and the actual labels in the training data, guided by the optimizer and the loss function.

We can evaluate the model's performance on the unseen testing set using the evaluate method. This helps assess how well the model generalizes to unseen data:

loss, accuracy = model.evaluate(test_features, test_target)
print(f"Test Loss: {loss}, Test Accuracy: {accuracy}")

Integration & Automation: Putting the Model to Work

We've trained our model to identify patterns associated with flaky tests. Now, it's time to leverage this power in our testing framework! This section explores how to integrate the model for automated analysis of new test results.

Saving the Trained Model:

Once we're satisfied with the model's performance, it's crucial to save its trained weights for future use. TensorFlow provides the save method to achieve this:


This code snippet saves the model's architecture and trained weights to a file named "flaky_test_predictor.keras". This allows us to load the pre-trained model later without retraining it from scratch.

Making Predictions on New Tests:

Here's where the magic happens! We'll create a script that runs after our test execution. This script will:

Here's an example script demonstrating these steps:

import pandas as pd
import tensorflow as tf
def predict_flakiness(test_data, model_path):
    model = tf.keras.models.load_model(model_path)
    features = pd.DataFrame({"Avg_Exec_Time": [test_data["execution_time"]],
                                                   "Error_Keyword_Presence": [check_for_error_keywords(test_data["error_message"])]})
    prediction = model.predict(features)[0][0]
    return prediction
test_data = {"execution_time": 2.5, "error_message": "Network connection timeout"}
flakiness_probability = predict_flakiness(test_data, "flaky_test_predictor.keras")
print(f"Predicted flakiness probability: {flakiness_probability:.2f}")

This script defines a function predict_flakiness that takes new test data and the path to the saved model as input. It then extracts features (replace the placeholder with your specific feature extraction logic based on your testing framework's data), loads the saved model, and makes a prediction using the model.

Flaky Test Classification and Notification:

We can set a prediction threshold (e.g., 0.7) to classify tests as potentially flaky. Here's how we can integrate this logic:

flaky_threshold = 0.7

if flakiness_probability > flaky_threshold:
    print(f"Test flagged as potentially flaky! (probability: {flakiness_probability:.2f})")
    # Trigger notifications or further investigation (e.g., send email, log message)

This code snippet checks if the predicted flakiness probability exceeds the threshold. If it does, the test is flagged as potentially flaky, and we can trigger notifications (email, logging) or further investigation to confirm the flakiness and address the root cause.

Helper function:

Here's an example implementation of the check_for_error_keywords function:

def check_for_error_keywords(error_message):
    error_keywords = ["Network timeout", "Connection reset", "Resource unavailable"]
    for keyword in error_keywords:
        if keyword.lower() in error_message.lower():
            return 1
    return 0

The check_for_error_keywords function aims to identify potential flakiness indicators within the provided error_message (string). It accomplishes this by searching for predefined keywords commonly associated with flaky tests. You'll need to customize the provided list of error_keywords to reflect the specific types of flakiness prevalent in your testing environment. The function iterates through this list, performing a case-insensitive search (using lower()) within the error message. If any keyword is found, the function returns 1, indicating a potential match. If the loop completes without finding any keywords, the function returns 0, suggesting no relevant keywords were present in the error message.

This is a basic example. You might need to refine your error_keywords list based on your specific testing environment and the types of flakiness you're encountering. Additionally, you could consider using more advanced techniques like regular expressions for more complex keyword-matching logic.

Integration with Your Framework:

The specific integration approach will depend on your testing framework. You'll need to create a script or process that runs after test execution, extracts features from the new test results and utilizes the predict_flakiness function to identify potential flaky tests.

By automating this process, you can continuously monitor your test suite for flakiness and proactively address issues, improving the overall stability and reliability of your tests.

Expanding the Tool: Beyond the Basics

The provided example serves as a solid foundation for building a robust automated flaky test detection system. However, there's always room for exploration and customization to tailor it to our specific needs.

Enhancing Feature Extraction:

The current implementation focuses on basic features like execution time and error keyword presence. Consider incorporating additional features that might be relevant to your testing environment. Here are some examples:

Exploring Different Algorithms:

While Logistic Regression is a good starting point, experimenting with other machine learning algorithms might yield better prediction accuracy depending on the complexity of your data. Some potential options include:

Building a User Interface:

Integrating a user interface (UI) can significantly enhance the usability and value of the tool. This UI could offer features like:

By continuously expanding the feature set, exploring diverse algorithms, and creating a user-friendly interface, we can transform this basic script into a powerful tool that streamlines flaky test detection and improves the overall stability and reliability of our software development process.

Important Consideration: Validating Model Performance

While achieving a high test accuracy can be exhilarating, it's essential to approach these results with a critical mindset. High accuracy doesn't always guarantee a flawless model. Here are some crucial factors to consider when evaluating your model's performance:

1. Beware of Overfitting

If your model performs exceptionally well on the test data but significantly less so on the training data, it may have overfit the training set. Overfitting occurs when the model learns to memorize the training data rather than capturing underlying patterns. Be vigilant for signs of overfitting, such as a large disparity between training and test accuracies.

2. Cross-Validation for Robustness:

Employ cross-validation techniques to assess your model's performance across multiple training/testing splits. This approach provides a more reliable estimate of how well your model generalizes to unseen data and helps mitigate the risk of overfitting to a particular data split.

3. Scrutinize Test Data:

Thoroughly examine your test data for any anomalies or biases that might artificially inflate model performance. Biased or non-representative test data can lead to misleading conclusions about your model's effectiveness.

4. Real-world Evaluation:

Validate your model's performance on real-world data whenever possible. The ultimate test of a model's utility lies in its ability to generalize to new, unseen instances in production environments.

Remember, while achieving high accuracy is undoubtedly a positive sign, it's crucial to complement these results with thorough validation and scrutiny. By taking these precautions, you can ensure that your model's performance reflects genuine predictive capability across a variety of scenarios and use cases.


Flaky tests can wreak havoc on our development workflow, hindering progress and confidence in our test suite's reliability. This blog post explored how to leverage machine learning with TensorFlow to build a system for automated flaky test detection. We started with the fundamentals of data preparation, model building, and integration with our testing framework. We then delved into strategies for expanding the tool's capabilities by enriching features, exploring different algorithms, and even constructing a user interface for enhanced usability.

By implementing these techniques, we can empower our team to proactively identify and address flaky tests, ultimately fostering a more stable and efficient testing environment. Remember, the journey doesn't end here! Continuous refinement and exploration will enable us to tailor this approach to our specific needs and achieve even greater test suite stability.

The code examples are also available on our GitHub page. Feel free to give them a try.