
The Foresight AI model uses data taken from hospital and family doctor records in England
Hannah McKay/Reuters/Bloomberg via Getty Images
An artificial intelligence model trained on the medical data of 57 million people who have used the National Health Service in England could one day assist doctors in predicting disease or forecast hospitalisation rates, its creators have claimed. However, other researchers say there are still significant privacy and data protection concerns around such large-scale use of health data, while even the AI’s architects say they can’t guarantee that it won’t inadvertently reveal sensitive patient data.
The model, called Foresight, was first developed in 2023. That initial version used OpenAI’s GPT-3, the large language model (LLM) behind the first version of ChatGPT, and trained on 1.5 million real patient records from two London hospitals.
Now, Chris Tomlinson at University College London and his colleagues have scaled up Foresight to create what they say is the world’s first “national-scale generative AI model of health data” and the largest of its kind.
Foresight uses eight different datasets of medical information routinely collected by the NHS in England between November 2018 to December 2023 and is based on Meta’s open-source LLM Llama 2. These datasets include outpatient appointments, hospital visits, vaccination data and records, comprising a total of 10 billion different health events for 57 million people – essentially everyone in England.
Tomlinson says his team isn’t releasing information about how well Foresight performs because the model is still being tested, but he claims it could one day be used to do everything from making individual diagnoses to predicting broad future health trends, such as hospitalisations or heart attacks. “The real potential of Foresight is to predict disease complications before they happen, giving us a valuable window to intervene early, and enabling a shift towards more preventative healthcare at scale,” he told a press conference on 6 May.
While the potential benefits are yet to be supported, there are already concerns about people’s medical data being fed to an AI at such a large scale. The researchers insist all records were “de-identified” before being used to train the AI, but the risks of someone being able to use patterns in the data to re-identify the records are well-recorded, particularly when it comes to large datasets.
“Building powerful generative AI models that protect patient privacy is an open, unsolved scientific problem,” says Luc Rocher at the University of Oxford. “The very richness of data that makes it valuable for AI also makes it incredibly hard to anonymise. These models should remain under strict NHS control where they can be safely used.”
“The data that goes into the model is de-identified, so the direct identifiers are removed,” said Michael Chapman at NHS Digital, speaking at the press conference. But Chapman, who oversees the data used to train Foresight, admitted that there is always a risk of re-identification: “It’s then very hard with rich health data to give 100 per cent certainty that somebody couldn’t be spotted in that dataset.”
To mitigate this risk, Chapman said the AI is operating within a custom-built “secure” NHS data environment to ensure that information isn’t leaked out of the model and is accessible only to approved researchers. Amazon Web Services and data company Databricks have also supplied “computational infrastructure”, but can’t access the data, said Tomlinson.
Yves-Alexandre de Montjoye at Imperial College London says one way to check whether models can reveal sensitive information is to verify whether they can memorise data seen during training. When asked by New Scientist whether the Foresight team had conducted these tests, Tomlinson said it hadn’t, but that it was looking at doing so in the future.
Using such a vast dataset without communicating to people how the data has been used can also weaken public trust, says Caroline Green at the University of Oxford. “Even if it is being anonymised, it’s something that people feel very strongly about from an ethical point of view, because people usually want to keep control over their data and they want to know where it’s going.”
But existing controls give people little chance to opt out of their data being used by Foresight. All of the data used to train the model comes from nationally collected NHS datasets, and because it has been “de-identified”, existing opt-out mechanisms don’t apply, says an NHS England spokesperson, though people who have chosen not to share data from their family doctor won’t have this fed into the model.
Under the General Data Protection Regulation (GDPR), people must have the option to withdraw consent for the use of their personal data, but because of the way LLMs like Foresight are trained, it isn’t possible to remove a single record from an AI tool. The NHS England spokesperson says that “as the data used to train the model is anonymised, it is not using personal data and GDPR would therefore not apply”.
Exactly how the GDPR should address the impossibility of removing data from an LLM is an untested legal question, but the UK Information Commissioner’s Office’s website states that “de-identified” data should not be used as a synonym for anonymous data. “This is because UK data protection law doesn’t define the term, so using it can lead to confusion,” it states.
The legal position is further complicated because Foresight is currently being used only for research related to covid-19, says Tomlinson. That means exceptions to data protection laws enacted during the pandemic still apply, says Sam Smith at medConfidential, a UK data privacy organisation. “This covid-only AI almost certainly has patient data embedded in it, which cannot be let out of the lab,” he says. “Patients should have control over how their data is used.”
Ultimately, the competing rights and responsibilities around using medical data for AI leave Foresight in an uncertain position. “There is a bit of a problem when it comes to AI development, where the ethics and people are a second thought, rather than the starting point,” says Green. “But what we need is the humans and the ethics need to be the starting point, and then comes the technology.”
Article amended on 7 May 2025
We have correctly attributed comments made by an NHS England spokesperson
Topics: