So... how do you test an AI system?

Attended a webinar hosted by my local BCS branch, with guest speaker Bryan Jones.

Artificial Intelligence (AI)

Will it take our jobs, or will it be a saviour and make our lives easier?

The bad - it won’t be the saviour of the world and it won’t solve all the issues in society.

The good - it won’t destroy the world, and there will always be plenty of jobs for people who need to maintain AI systems.

AI is defined by the House of Lords Select Committee as “Technologies with the ability to perform tasks that would otherwise require human intelligence, such as visual perception, speech recognition, and language translation” AI in the UK: ready, willing and able?

Legislation and Regulation

We were introduced to a series of different legislative frameworks, across the world there are multiple pieces of legislation, with more and more being developed as AI becomes more mainstream:

Problem with AI Testing

There is a serious issue of AI being released without adequate testing, but how do we test them?

The testing of AI is similar to traditional testing, but there are challenges that are specific to AI which requires us to keep an open mind and approach them with a different way of thinking.

Bryan directed our attention to what is know as the Oracle Problem:

No Expected Results - results based on complex patterns and models, makes it hard to define what is ‘correct’ outcome based on the input.
Non-Deterministic Outputs - different results can be given for the same input.
Probabilistic Answers - can make it complicated to verify the answer given.

This effectively means we don’t always have results to compare or draw conclusion to, so we can’t just put a tick next to something to compare it to what already exists as easily as we could with well established frameworks and systems.

Other challenges with AI include:

Code doesn’t represent the algorithm - we can’t just review the code to understand what is going on.
Data Complexity & Volume - AI uses large datasets, with the datasets themselves being of varying quality. There are also challenges related to scale when the data fed to the system continually increases in size.
Self-Optimising - if we run the same data set through it twice, we might get different answers.
System Complexity - the problems we give to AI can be really complex.
Attempting to Mimic Human Abilities - AI can struggle with human concepts such as ambiguity, reading emotions and decision-making.
Bias - problems of bias is known in humans, they also apply to AI.
Test Coverage - how do we measure how much testing we have done, and whether it is enough? Traditional coverage metrics may be inappropriate.

## Types of AI Testing

There are familiar tests, such as:

Unit or Component testing - can it handle out of range values, forced zeros, blanks and nulls, formatting errors, duplication, etc.
Integration testing - systems thinking is essential here, especially with facial recognition.
End-to-end testing - very familiar but beware of complexity. Involve experts to help decide what the ‘right’ answer is.
User Acceptance Testing - similar to what we already have, but watch out for automation bias (this is where humans can be biased towards automated systems).

We also have some new tests

Model Testing - evaluating how reliable and accurate the AI model being used is.
Data Testing - “the code is not the algorithm, the data defines the behaviour”
Monitor in live - considers performance, drift (context, concept or model drift as well as data drift)

Model testing isn’t always easy:

Statistical in nature
Large data volumes
Complex to interpret - data scientists are important here, as the data can be extremely complex to interpret.
Testers - need curiosity, critical thinking, system thinking and question asking

When it comes to data testing we need to check:

Bias & Fairness - the data the model is exposed to may be subject to bias, and there can be proxies to bias (such as occupations that may be mainly one specific gender)
Data Diversity and Representativeness - considers variety of data samples in the dataset, as well as how well the dataset truly reflects the data in the domain in which the model is being used.
Data Labelling and Annotation - important when training an AI, we need to ensure the data has correct classification and values, and if it is too broad or narrow then can result in inaccuracies.
Data Splitting Problems - large amount is provided to model, with a small amount held back for testing data.
Statistical analysis - can be used to check these issues, but the tester needs to also be aware of this.

Monitor in Live:

Performance - the accuracy can be more important than the overall speed, such as in healthcare.
Drift - context, concept or model drift is when the task that the model was designed to do changes over time, such as spam email detection, but with spammers changing the way they operate over time it could become inaccurate. Data drift relates to when data changes over time with an example being AI dealing with loans for a bank, this changed due to COVID due to the impact it had on society and the disruption it caused to normal practices and behaviours.
Remember: just because it’s working now, does not mean it always will!

Test Techniques

Further testing techniques can be used with AI:

Explainability - AI needs to be able to explain the parameters that result in the outcome. Example: rejected for a loan, the applicant has the right to know why they were rejected.
A/B Testing - also known as split testing, two versions of a model are compared to determined which performs better.
Parallel Testing - run the new system alongside the old one and compare the outcome, this is quite common in payroll systems.
Statistical Analysis - important to understand and use.
Exploratory Testing - AI is an ideal arena for exploration.
Use Experts - we can use human experts to check the output of AI systems.
Pairwise/Orthogonal Testing - mathematical approach that examines all possible combination of parameter values and identifying potential anomalies and defects.
Metamorphic Testing - changes in output data noticed when changes are introduced.
Adversarial Testing - providing inputs that try and break the system or manipulate it to provide incorrect answers. This is often used in security testing to see if it reveals data that it shouldn’t.
Model Backtesting - method widely used in finance sector to estimate performance of past models in trading, investment and fraud. Will take historic data and feed it into the model, if it gives similar results then it is considered to be working.
Dual Coding/Algorithm Ensemble - multiple models utilise various algorithms and given same dataset, and predictions are compared. The model that gives most expected outcome is taken as the default.
Coverage Data - datasets designed that it results in all neurons and nodes being triggered.
Cross Validation - tests a model iteratively, overall goal to see how it handles being fed new data at a later point.
Affordances Modelling - properties of object that dictates how it will be used, we are looking at how system is being used by user.

Conclusion

Old test techniques work with AI, but it is helpful to know more about AI & Data Science. It’s also important to consider that not all AI is machine learning or Large Language Models (LLM).

It is vital that a tester is curious, using critical thinking, system thinking and asking questions when it comes to testing AI systems. Of equal importance to the testers are data scientists - they are the best friends of the AI testers!

Explainability is very important and should be used early in the designs.

This was a very interesting webinar and introduced me to certain factors I hadn’t even though of! The webinar shows that AI still very much relies on the human touch, so the doom and gloom that AI will steal all our jobs appears to be unfounded.

AI has limitations - if what we are giving it isn’t well understood then we have to ask the question of whether AI is the best solution for this? A key principle when it comes to AI systems appears to to be that a lot of what it can do and how accurate it may be is reliant on the data that is fed to it.

From a business perspective, it was also recommended that any organisation that wants to utilise AI should at the very least have testers and data scientists available to them before implementing AI systems.

Risk assessments are also important when it comes to AI, as if something goes wrong in some contexts then it could be catastrophic, especially in cases such as healthcare. This again requires human oversight to provide verification that everything is within a tolerance and operating as it should.

Due to my interest in this, and admittedly as I know little about AI at this stage I have ordered a book from the BCS that tells me more about AI which can be found at this link.

So... how do you test an AI system?

Artificial Intelligence (AI)

Legislation and Regulation

Problem with AI Testing

Test Techniques

Conclusion

Related Posts

Getting Involved: Supporting BCS Pride and LQBTQ+ Representation

Personal Development, Project Management, Career Musings

Where I was, Where I am, Where I want to be