Testing ML Systems-more than just accuracy

4 min readJun 19, 2023

In this article we won’t be talking about evaluating accuracy or loss of the model, testing a model involves more than just evaluating its performance metrics.

AI Genrated Image — https://www.fotor.com/features/ai-image-generator/

How is testing an ML model different from evaluating it ?

While model testing and model evaluation may seem similar, they serve distinct purposes in assessing the effectiveness of an ML model. Model evaluation primarily focuses on performance metrics such as accuracy, precision and loss. These metrics are calculated on a validation dataset and provide an overall assessment of the model’s predictive capabilities. However, while evaluation metrics are necessary, they alone do not provide a comprehensive understanding of the model’s behavior.

Model testing goes beyond evaluation metrics and aims to validate the specific behaviors and functionalities of the model. It involves examining the model’s responses to different inputs, edge cases, and scenarios. By subjecting the model to various tests, we can uncover potential flaws, biases, or unintended consequences that might not be apparent through evaluation metrics alone.

What problems can be identified by testing a machine learning model?

By performing Pre-train and Post-train tests, various problems can be handled -

Bias: Testing can uncover biases in AI models that may result in unfair or discriminatory outcomes for certain groups or individuals. By examining the model’s responses to diverse inputs and evaluating its fairness across different demographic groups, bias can be detected and mitigated.
Adversarial attacks: AI models are vulnerable to adversarial attacks, where malicious actors intentionally manipulate inputs to deceive or mislead the model’s predictions. Testing can involve subjecting the model to adversarial examples and assessing its robustness against such attacks, helping to improve its security.
Data leakage: Data leakage occurs when unintended information from the training data is inadvertently present in the evaluation or production phases, leading to inflated performance metrics. Testing can detect and prevent data leakage by carefully partitioning and evaluating datasets to ensure the model is not exploiting leaked information.
Robustness to different scenarios: AI models should be tested for their performance and reliability in various scenarios, including edge cases and challenging inputs that may deviate from the training data distribution. Testing can reveal how the model responds in real-world conditions and identify areas where it may struggle or exhibit unexpected behavior.

There are several other problems like Generalization and Indeterminate outcomes (change in behaviour of model after re-training with new data) which can also be handled by writing test cases.

Metamorphic testing -

Image Credits — giskard.ai — Image Credits - giskard.ai

Metamorphic testing is an additional approach that can be used in conjunction with pre-training and post-training tests. It is a technique where you define a set of transformation rules or metamorphic relations that the system should satisfy. These relations describe how the outputs of the system should change in response to certain transformations applied to the inputs. By applying these transformations and comparing the expected outputs with the actual outputs, you can detect potential issues or inconsistencies in the system. Metamorphic testing can help identify bugs or unintended behavior in the ML system by examining the relationships between inputs and outputs.

Most used metamorphic relations -

Invariance: metamorphic invariance relations mean that the output should remain invariant after the perturbation.

Increasing: metamorphic increasing relations mean that the output should increase after perturbation.

Decreasing: metamorphic decreasing relations mean that the output should decrease after perturbation.

Writing Test Cases

Two different classes of tests for Machine Learning systems:

Pre-train tests
Post-train tests

Pre-train tests

Shape of model predicted output: Check if the shape of the model’s predicted output matches the expected shape based on the input data.
Data leakage: Ensure that there is no duplication of data between the training and testing datasets.
Temporal data leakage: This involves confirming that the model is not trained on future data points and then tested on past data points, maintaining the chronological integrity of the data.
Output range check: For cases where the model predicts values within a specific range (e.g., probabilities), validate that the final predictions fall within the expected range of values and do not exceed the predefined boundaries.

Post-train tests

Invariance tests: Test the model’s consistency by altering a single feature in a data point while keeping all other features constant. For instance, changing the gender of an individual should not impact their loan eligibility.
Directional expectations: Verify if there is a clear relationship between certain feature values and predictions. For example, a higher credit score should positively impact loan eligibility. Test if the model correctly captures these expected relationships.
Additional failure mode tests: Identify any specific failure modes or weaknesses in the model and design tests to check for those. This could involve scenarios where the model is prone to errors or biases.

Examples -

# Testing data leakage.
def test_data_leak(data_preparation):
    xtrain, ytrain, xtest, ytest = data_preparation
    concat_df = pd.concat([xtrain, xtest])
    concat_df.drop_duplicates(inplace=True)
    assert concat_df.shape[0] == xtrain.shape[0] + xtest.shape[0]

# Testing Invariance in data.
def test_sex_invariance(models_):
    models = models_
    for model in models:
        print("Checking for " + str(model.__class__.__name__))
        female_sample = [19, 1, 27.9, 0, 1, 2, 1, 1]
        male_sample = [19, 1, 27.9, 0, 1, 2, 1, 1]
        result_female_sample = model.predict(np.array(female_sample).reshape(1, -1))
        result_male_sample = model.predict(np.array(male_sample).reshape(1, -1))
        check.equal(result_female_sample, result_male_sample)
        assert result_female_sample == result_male_sample

Existing tools for testing machine learning systems -

Deepchecks
Giskard.ai

Reference -

deep checks-Testing ML models
Metamorphic Testing
Test Cases for ML