What is data poisoning and how do we stop it?

A digital representation of a human profile with binary code - data and machine learning
(Image credit: Shutterstock / Ryzhi)

The latest trend in businesses is the adoption of machine learning models to bolster AI systems. However, as this process gets more and more automated, this naturally puts them at greater risk of new emerging threats to the function and integrity of AI, including data poisoning.

About the author

Spiros Potamitis is Senior Data Scientist at Global Technology Practice at SAS.

Below, discover what data poisoning is, how it threatens business systems, and finally how to defeat it and win the fight against those who wish to manipulate data for their own gain

Machine learning models and how they work

Before we discuss data poisoning, it’s worth revisiting how machine learning models work. We train these models to make predictions by ‘feeding’ them with historical data. From these data, we already know the outcome that we would like to predict in the future and the characteristics that drive this outcome. These data ‘teach’ the model to learn from the past. The model can then use what it has learned to predict the future. As a rule of thumb, when more data are available to train the model, its predictions will be more accurate and stable.

AI systems that include machine learning models are normally developed by experienced data scientists. They thoroughly examine and explore the data, remove outliers and run several sanity and validation checks before, during and after the model development process. This means that, as far as possible, the data used for training genuinely reflect the outcomes that the developers want to achieve.

Data poisoning taints the automation process

However, what happens when this training process is automated? This does not very often occur during development, but there are many occasions when we want models to continuously learn from new operational data: ‘on the job’ learning. At that stage, it would not be difficult for someone to develop ‘misleading’ data that would directly feed into AI systems to make them produce faulty predictions.

Consider, for example, Amazon or Netflix’s recommendation engines. Think how easy it is to change the recommendations you receive by buying something for someone else. Now consider that it is possible to set up bot-based accounts to rate programs or products millions of times. This will clearly change ratings and recommendations, and ‘poison’ the recommendation engine.

This is known as data poisoning. It is particularly easy if those involved suspect that they are dealing with a self-learning system, like a recommendation engine. All they need to do is make their attack clever enough to pass the automated data checks—which is not usually very hard.

The other issue with data poisoning is that it could be a long, slow process. Hackers can afford to take their time to change the data by feeding in a few results at a time. Indeed, this is often more effective, because it is harder to detect than a massive influx of data at a single point in time—and significantly harder to undo.

Winning the fight against data poisoners

Fortunately, there are steps that organizations can take to prevent data poisoning. These include

1. Establish an end-to-end ModelOps process to monitor all aspects of model performance and data drifts, to closely inspect system function.

2. For automatic re-training of models, establish a business flow. This means that your model will have to go through a series of checks and validations by different people in the business before the updated version goes live.

3. Hire experienced data scientists and analysts. There is a growing tendency to assume that everything technical can be handled by software engineers, especially with the shortage of qualified and experienced data scientists. However, this is not the case. We need experts who really understand AI systems and machine learning algorithms, and who know what to look for when we are dealing with threats like data poisoning.

4. Use ‘open’ with caution. Opensource data are very appealing because they provide access to more data to enrich existing sources. In principle, this should make it easier to develop more accurate models. However, these data are just that: open. This makes them an easy target for fraudsters and hackers. The recent attack on PyPI, which flooded it with spam packages, shows just how simple this can be.

Humans are the unsung heroes of machine learning

It is vital that businesses follow the recommendations above so as to defend against the threat of data poisoning. However, there remains a crucial means of protection that often gets overlooked: human intervention. While businesses can automate their systems as much as they would like, it is paramount that they rely on the trained human eye to ensure effective oversight of the entire process. This prevents data poisoning from the offset, allowing organizations to innovate through insights, with their AI assistants beside them.

Spiros Potamitis is Senior Data Scientist at Global Technology Practice at SAS, specializing in the development and implementation of advanced analytics solutions across different industries. Spiros provides subject matter expertise in the areas of Forecasting, Machine Learning and AI.