The Simulacrum v2 is a collection of synthetic datasets based on confidential cancer patient data collected by the National Disease and Registration Service (NDRS) at NHS England.
Simulacrum enables researchers to safely explore realistic synthetic cancer data and refine their research questions before applying for anonymous data releases from the National Disease Registration Service (NDRS).
Since Simulacrum v1 was released in 2018, it has been used to facilitate many projects, using NDRS data, by the life science industry and academic research groups. Simulacrum v2 incorporates more recent cancer diagnoses and now provides a more comprehensive view of a patient’s treatment pathway, with the inclusion of chemotherapy, radiotherapy and genomic test data.
This will allow researchers to explore the impact of complex treatment combinations and specific genetic mutations on patient outcomes. Research areas such as these are pivotal in guiding advancements in treatments and precision medicines.
The NDRS collates, quality assures and maintains highly valuable and detailed data on all cancer patients in England, including their treatments and, more recently, their genomic information. This data is vital for understanding the landscape of cancer in the population, informing research into new treatments and ultimately improving patient outcomes.
However, the sensitive nature of NDRS data requires strict protection measures. Direct access is not granted to external researchers without the requisite legal and ethical approvals, which can prove challenging to secure. Alternatively, researchers can request anonymous data releases through the Data Access Request Service (DARS), but this process can still be time-consuming, costly and, at times, unfeasible, due to the resources required from the NDRS team and the complexity of requests.
The Simulacrum is made up entirely of artificial patients and contains no confidential information so it can be released without risk to patient privacy. It was developed to mirror the statistical properties and structure of NDRS data, meaning it can support researchers to safely:
- Learn about NDRS data, its structure, format and completeness
- Understand whether NDRS data is suitable for answering research questions about cancer
- Refine hypotheses and research questions based on the available data before applying for a data release
- Write code for entire analysis projects on the synthetic data before requesting that code is run on the real data and anonymous results returned
This can reduce the time and cost for researchers when applying for a data release from NDRS and enable entire analysis projects to be conducted without ever requiring direct access to the real data.
What does Simulacrum include?
The Simulacrum v2 is based on the following extracts from datasets collected by the National Cancer Registration and Analysis Service (NCRAS) in NDRS:
- Patient and tumour tables in the National Cancer Registration Datasets (NCRD):
- Patient and tumour characteristics for all patients in England diagnosed with cancer in years 2016-2019
- Systemic Anti-Cancer Therapy dataset (SACT):
- Information about all systemic anti-cancer therapy treatments, e.g. chemotherapy, hormone treatments received by cancer patients in years 2012-2022
- Radiotherapy dataset (RTDS):
- Information about all radiotherapy treatments received by cancer patients in years 2012-2022
- Genomics testing data:
- Genomic tests for patients diagnosed with cancer in years 2016-2019
For more detailed information about these datasets, please visit https://digital.nhs.uk/ndrs/about/ncras.
Simulacrum v2 was developed in collaboration with AstraZeneca, who provided data science, oncology and technology expertise and helped with user testing the prototype of the Simulacrum before it was released.
The development of Simulacrum v2 was ongoing from the latest update of Simulacrum v1 in January 2021 until the release of Simulacrum v2 in April 2023.
HDI Senior Data Scientist Lora Frayling has made a short video introduction to synthetic cancer patient data and the Simulacrum.
This is "NDRS bitesize series - The Simulacrum" by NHS England Digital on Vimeo, the home for high quality videos and the people who love them.