Do AI chatbots learn from everything we enter? That’s probably the most common question I’m asked when it comes to AI and data security.
I can see the logic in asking this. Because AI chatbots talk to us in a human-like way, it’s easy to assume that they also learn in a human-like way. There’s also a tendency to assume that, just like a human, not only might they learn from a conversation, but they are probably awful at keeping secrets and might well share the information they have learnt with many other users.
Why does this matter? Because I think it’s driving how we approach AI and security, with people often focusing too much on the very limited risks around AI models and training data, and not enough on other, more pressing AI security issues, such as the overall security of the tools they are using.
At one end of the spectrum this is causing users to think any system is safe if it says it’s not using interactions for training, and at the other end, leading to the idea that developing in-house solutions to avoid data being used for training is the most effective way to get a secure AI system.
AI and training data
To understand this, it’s good to think a little more about the data involved. The data used to train large language models is typically scraped from the internet, is licensed from big media companies, or includes books obtained by potentially unethical means. As an aside, that means any data we intentionally or otherwise make available on the internet could be used to train AI models – it’s not just about interactions with AI tools.
Language models are trained infrequently, as each training run can cost millions of pounds. Their model isn’t updated as we use them.
As these are language models, the companies behind them are looking for high-quality texts, so things we tend to type into AI tools generally aren’t useful as training data. They are random chats, fragments of text, etc.
The companies will ask for permission to use our data in the terms and conditions primarily to understand how we use the tool, to help them improve its overall performance.
But even if our text did find its way into the training set, the chance of it being output by the model is close to zero. That’s because the size of the training set is vast—typically 10 trillion words. To put this into context, if you started reading it today, it would take over 120,000 years to read it all! (and yes, I should confess, ChatGPT helped me work that out).
Language models aren’t knowledge models, so they can’t look things up from the training set. Instead, each extra bit of data has a tiny, tiny impact on the overall output, which, in the end, is just a text prediction. It’s partly why we don’t see lots of personal data revealed by tools such as ChatGPT and Gemini.
Data security
So, does that mean we don’t need to worry about using AI tools at all from a security perspective? Absolutely not!
Legally we must prevent our staff and students’ private data from being used for purposes other than those necessary and contractually agreed. But we also need to understand what the actual risks are, and it’s much more about general data security.
The biggest risks are actually around data being poorly secured or shared with other third parties. Only a few weeks ago security researchers found a basic security flaw that meant anyone could access Chinese chatbot DeepSeek’s database and view all the chat history.
It’s tempting perhaps, to think the best way to get secure AI is to create or host your own solutions. But securing software is really hard! And don’t forget, our word processors and spreadsheets, for example, handle our most sensitive data, but we don’t build or host our own in-house versions of them. We manage the risks through contracts, user training, policies and technical controls.
So, what’s the best approach to using AI safely and securely? It’s simple. Don’t think of AI systems as gossipy friends eager to spill your secrets. Instead, treat them like any other IT system—where security and contracts matter most.
Contracts are key, and if you only take away one thing from reading this it should be to use AI systems with a robust agreement in place, ensuring responsible and secure data handling.
Your thoughts