Our 2020 interns built a synthetic data service
Our 2020 internship programme looked quite different this year, not least because we were all working from home in the middle of a global pandemic. But instead of working individually on separate projects, this summer our interns worked collaboratively as a team to develop and deliver one big project.
Our 6 interns, Harry, Aisha, Faiz, Evonne, Lulu and Rebeca spent 12 weeks this summer working to create Syndasera, a prototype for an online service that generates synthetic data for researchers. Organisations that hold real confidential patient data will be able to use Syndasera to request synthetic versions of their data. Since it is often difficult to get access to patient data, for good reason, synthetic data allows others to use it more freely for research while also protecting patient confidentiality.
What did the interns work on?
The interns split up into smaller teams to build the Syndasera prototype in four steps:
– Building the synthetic data generator
– Evaluating the synthetic data
– Building a website for the service
– Setting up a secure server
Building the synthetic data generator
Evonne, Harry and Lulu worked on creating the synthetic data using a generative adversarial network (GAN). GANs can create synthetic data from multi-dimensional datasets, which include many different features like numerical and categorical data which is important for healthcare data. The interns also explored how to add privacy to the GAN to make sure individuals could not be identified in the synthetic data. The interns then tested the GAN with a publicly available dataset about malaria in Uganda.
Evaluating the synthetic data
Aisha, Lulu and Rebeca worked to evaluate the synthetic data, and compare it to the real data to make sure it is similar but does not identify people. Using an already published evaluation method for synthetic data, the interns adapted it to compare their synthetic data on malaria to the real data. They compared the distribution of the variables between the two datasets, as well as the relationship between the variables to make sure these were replicated accurately. Lastly, because the synthetic data will be analysed, the interns checked the analysis produced similar results.
Building a website
Faiz and Rebeca spent their time over the summer working to create the Syndersera website. Faiz built the website using Ruby on Rails and deployed it to Heroku. Users can make requests using a sign-in section where they can follow the progress of their request. There is also an interactive dashboard which displays visualizations of the synthetic data created so users can quickly analyse and visualise their data easily on the website.
Setting up a secure server
Evonne and Rebeca explored how to set up a secure server using Amazon Web Services that could be linked to the Syndasera website. This would be where users could securely upload the confidential data they wish to synthesize, and where it would be generated safely.
What’s next for Syndasera?
With the summer over and internship concluded, the interns have left us with an amazing prototype. We hope that over the coming year we can bring their end-to-end model of a working synthetic data service to life. We are already having discussions with the potential stakeholders that were involved throughout the internship and we hope that we can begin to pilot the prototype over the coming year.
In the meantime, we are adapting the interns’ deep learning algorithms to work on Public Health England’s own cancer patient pathway dataset. This dataset contains information on the pathway of each cancer patient in the country, i.e. the full sequence of each patient’s medical events. Creating a synthetic version of this dataset could open up new opportunities for research, as was done with the Simulacrum.
Despite the challenges of this year and the new organisation of the summer internship, the interns came together and worked incredibly well as a team. It demonstrates the great potential that can be realised when the interns, with their variety of skills and backgrounds, work all together on a single project. This has very much inspired our planning for next year’s internship.