When seeking to solve a medical or clinical research problem, significant amounts of data is required to provide insights and trends to correctly evaluate an approach or solution. When not enough data is available to reliably model the problem then synthetic data can be produced and supplied from various sources.
In the NHS, when synthetic datasets are generated, or synthetic data-producing models released, there is a gap in support for validating the privacy of the data and benchmarking the product against any standards. This leads to a dependence on the party generating the data to prove the privacy, which creates a conflict of interest.
A popular but complex approach is to use adversarial attacks on the data to prove what information can be ascertained if the attacker has different levels of information. Roke was asked to work alongside the NHS Transformation Directorate to develop tools which can assist in determining if datasets are vulnerable to such attacks.
Adversarial attacks to recover real information from synthetic datasets are highly varied, and specific, depending on what information and project artefacts the synthetic data publishers release. To develop this proof of concept, we identified two common situations. These were both examples of Membership Inference (MI) attack.
A suite of extensible tools were built, with example attack scenarios deployed, for a range of synthetic data models. By building for extensibility, additional attack scenarios can be accounted for in the suite in future. The included scenarios in this commission were:
Attacker Scenario 1: Researchers have released a model description, as well as the synthetic data produced from it, but not the trained model itself. Using the model architecture and training description, an attacker may reconstruct fresh instances of the model (so-called ‘shadow models’), and train them against the highly realistic synthetic data. These shadow models allow the attacker to look for clues at a per-datum level that give away when details of real individuals have leaked through the synthetic data model.
Attacker Scenario 2: Researchers have an upload facility (a black-box model) for anonymising user-uploaded datasets, have released example synthetic data and described their model, without releasing the trained version. By uploading many copies of the same realistic data (which may be gathered by the attacker, or example synthetic data released by the researchers), the attacker can measure exactly the differences introduced by the black box model, and train a new model to recognise when the changes introduced for anonymisation have led to large or small change to the input data. This attack model can then be run against the example dataset released by the researchers, and any data where the stochastic synthetic data model introduced little to no change can be recognised – this is then known by the attacker to be a real patient’s data leaked.
The project developed tools suitable to simulate the attack scenarios outlined above, in an extensible manner. Adaptability of the tools to a wide variety of models was enabled by expecting ‘code-injection’ (exploitation of a computer bug that is caused by processing invalid data) as the technique that attackers would most likely use. This theorises that skilled users with a good understanding of the target models would add small amounts of custom code to a specified location for the developed tools to ingest. This was a reasonable assumption of capability as, the intended users of the suite will always be synthetic data researchers wishing to attack their own releases to discover vulnerability.
This project has provided the opportunity to increase assurance against synthetic data attacks, for two particular data release scenarios, and has laid the groundwork for additional scenarios to be accounted for.
By creating the extensible platform within which these attacks can be rolled out quickly, it is clear that the technology created has the potential to provide a much-needed level of assurance against highly-bespoke attacks that existing open source libraries do not provide.
The adoption of this platform would improve system resilience by lowering the threat of unintentional data leaks, and helping to ensure data can be pre-emptively protected through testing. By applying our capabilities in this area, we’re able to help support the digitalisation of organisations such as the NHS, protecting their systems and reputation
Related news, insights and innovations
Find out more about our cutting-edge expertise.