Privacy-preserving artificial intelligence: training on encrypted data

A padlock against a black computer screen.
(Image credit: Pixabay)

In the era of Artificial Intelligence (AI) and big data, predictive models have become an essential tool across various industries including healthcare, finance and genomics. These models rely heavily on the processing of sensitive information making data privacy a critical concern. The key challenge lies in maximizing data utility without compromising the confidentiality and integrity of the information involved. Achieving this balance is essential for the continued advancement and acceptance of AI technologies.

Jordan Fréry

Machine Learning Tech Lead at Zama.

Collaboration and open source

Creating a robust dataset for training machine learning models presents significant challenges. For instance, while AI technologies such as ChatGPT have thrived by gathering vast amounts of data available on the internet, healthcare data cannot be compiled this freely due to privacy concerns. Constructing a healthcare dataset involves the integration of data from multiple sources including doctors, hospitals and across borders.

The healthcare sector is emphasized due to its societal importance, yet the principles apply broadly. For example, even a smartphone autocorrect feature, which personalizes predictions based on user data, must navigate similar privacy issues. The finance sector also encounters obstacles in data sharing due to its competitive nature.

Thus, collaboration emerges as a crucial element for safely harnessing AI's potential within our societies. However, an often overlooked aspect is the actual execution environment of AI and the underlying hardware that powers it. Today’s advanced AI models necessitate robust hardware, including extensive CPU/GPU resources, substantial amounts of RAM and even more specialized technologies such as TPUs, ASICs, and FPGAs. Conversely, the trend towards user-friendly interfaces with straightforward APIs is gaining popularity. This scenario highlights the importance of developing solutions that enable AI to operate on third-party platforms without sacrificing privacy, and the need for open-source tools that facilitate these privacy-preserving technologies.

Privacy solutions to train machine learning models

To address the privacy challenges in AI, several sophisticated solutions have been developed, each focusing on specific needs and scenarios.

Federated Learning (FL) allows for the training of machine learning models across multiple decentralized devices or servers, each holding local data samples, without actually exchanging the data. Similarly, Secure Multi-party Computation (MPC) enables multiple parties to jointly compute a function over their inputs while keeping those inputs private, ensuring that sensitive data does not leave its original environment.

Another set of solutions focuses on manipulating data to maintain privacy while still allowing for useful analysis. Differential Privacy (DP) introduces noise to data in a way that protects individual identities but still provides accurate aggregate information. Data Anonymization (DA) removes personally identifiable information from datasets, ensuring some anonymity and mitigating the risk of data breaches.

Finally, Homomorphic Encryption (HE) allows to perform operations directly on encrypted data, generating an encrypted result that, when decrypted, matches the result of operations performed on the plaintext.

The perfect fit

Each of these privacy solutions has its own set of advantages and trade-offs. FL, for instance, maintains communication with a third-party server, which can potentially lead to some data leakage. MPC operates on cryptographic principles that are robust in theory but can create significant bandwidth demands in practice.

DP involves a manual setup where noise is strategically added to the data. This setup limits the types of operations that can be performed on the data, as the noise needs to be carefully balanced to protect privacy while retaining data utility. DA, while widely used, often provides the least privacy protection. Since anonymization typically occurs on a third-party server, there is a risk that cross-referencing can expose the hidden entities within the dataset.

HE, and specifically Fully Homomorphic Encryption (FHE), stands out by allowing computations on encrypted data that closely mimic those performed on plaintext. This capability makes FHE highly compatible with existing systems and straightforward to implement thanks to open-source and accessible libraries and compilers like Concrete ML, that have been designed to give developers easy to use tools to develop different applications. The major drawback at the moment is the slowdown in computation speed, which can impact performance.

While all the solutions and technologies we discussed encourage collaboration and joint efforts, with its increased protection for data privacy FHE can drive innovation and facilitate a scenario where no more trade off is needed when it comes to enjoy services and products without compromising personal data.

We've featured the best encryption software.

This article was produced as part of TechRadarPro's Expert Insights channel where we feature the best and brightest minds in the technology industry today. The views expressed here are those of the author and are not necessarily those of TechRadarPro or Future plc. If you are interested in contributing find out more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro

Jordan Fréry is the Machine Learning Tech Lead at Zama.