Please note applications for this scheme have now closed for 2019 intake.

What is the opportunity?

We are offering up to five internships working on data collected by the largest cancer registry in the world.

You will have the opportunity to undertake an exploratory project guided and supported by developers and analysts. There are seven projects outlined below. However if you have a specific project of your own in mind that you would like to undertake, please let us know.

Who can apply?

We are looking for innovative and enthusiastic students who are keen to learn more about coding, using big data to solve problems and developing tools to support its use. We are inviting applications from students at all stages of their education, from undergraduate to PhD.

We can only accept applications from students studying in the UK who are eligible to work full-time in the UK.

What skills/experience do I need?

You will need to be innovative, creative, numerate and have some working knowledge of data analysis, such as via Excel or SQL. Candidates from all areas of study are welcome, particularly if they have interest or experience in one or more of these areas: data visualisation and communication, working with large datasets, data and statistical modelling, machine learning.

“Throughout the internship, I was given continuous support and had individual software development classes from my mentor, which I greatly enjoyed.” (2017 intern)

What do we offer?

Students on the placement will receive support and mentoring from a senior member of the National Cancer Registration and Analysis team. You will be working on anonymous cancer data.

Each internship will receive £1,000 per month as a salary and to cover expenses. We will cover reasonable travel expenses should this be a requirement of the placement.

When will I need to be available?

Placements are offered for two to three months between June and September. There will be a one-day induction in early June (date TBC) and a final day presentation and prize giving event at the end of September (date TBC).

Where will the placement be?

Staff are based across the UK – see project details below. Your location will be dependent on the staff member you are working with. The induction events will be in Cambridge or London.

How to apply for a summer placement

The summer placement is open to any student at a UK university who is eligible to work full-time in the UK.

To apply, please send your current CV and a covering letter outlining any achievements that you feel are relevant and why you want the placement. Please also include the title of the project that you are interested in working on from the list below OR outline your own project that you wish to pursue.

Closing date is midnight on Friday 22nd February 2019. An email confirming receipt of your application will be sent within three working days.

For each placement, a number of applicants may be shortlisted and you may be required to attend an interview or complete a longer application.

“Throughout my six weeks at HDI everyone I had the pleasure of meeting during my internship was extremely friendly and helpful. From the moment I started working there I felt extremely welcome and it was clear that this would be a supportive environment” (2017 intern)


Q: I am a non-EU student studying at a University outside of the UK. Am I eligible for your Placement Scheme?

A: No, unfortunately we are not able to sponsor students from non-EEA countries.

Q: I am a non-EU student studying at a University within the UK. Am I eligible for your Placement Scheme?

A: No unfortunately we are not able to offer the placement to non-EU students

Q: Are students who are completing a Masters Degree eligible to apply for a summer placement?

A: Yes we are not placing any restrictions on the level of education/range of qualifications an individual needs to be eligible

Q: I did not get an email to confirm you have received my application. What should I do?

A: All candidates should receive an email within 3 working days of submitting an application. Please check your junk email and if you have not received confirmation please contact

“During my internship at HDI I spent a lot of time playing with data in Oracle SQL. The training and guidance I received on this was superb.” (2017 intern)

This year’s project outlines:

Data visualisation for improved engagement with public cancer data

Main Aim: The Get Data Out programme at Public Health England is seeking a data visualisation intern to derive insights from and enable public engagement with openly-available data on cancer.
Brief Summary of the work involved: The Get Data Out programme routinely publishes cancer statistics produced by PHE in a consistent Standard Output Table – a table that collects patients into groups with common characteristics, and then publishes information such as incidence, survival, treatment rates and routes to diagnosis for these standard groups. Currently Get Data Out covers brain, ovarian, pancreatic and testicular tumours, and we hope to expand this output in the near future. All data and metadata can be found on our website: Data Out has been welcomed by the cancer community, including analysts and charities focussed on rare and less common cancers, but we believe that more can be done to make these data accessible to as wide an audience as possible – and we hope to use data visualisation to improve engagement with this information.

You will have scope to influence the output of the project based on your own preliminary findings, existing skills and learning preferences. Most of our existing projects use R and RShiny.


  • Use visualisation and data analysis tools to explore Get Data Out data.
  • Design and produce visualisations based on Get Data Out data. These could include:
    • Exploratory visualisations to enable analysts and researchers to seek out something new from this dataset,
    • Explanatory visualisations to engage the public with particular findings from the data,
    • Dashboards to give service commissioners, charities and clinicians an overview of key information,
    • Infographics and other public engagement tools based on the data.
  • Establish standard procedures and documentation to enable future sustainability of visualisations.
Skills Required: Education in a relevant discipline, including data analysis and visualisation.
Some knowledge of mathematics and statistics.
Some programming experience, preferably using R or Python to work with data.
Creativity and an interest in cancer research.
Enthusiasm and willingness to learn.Verbal, written and data communication skills.
Skills Desired: Knowledge of data visualisation tools, especially with a web focus (RShiny or d3.js a particular bonus).
Project base Cambridge


Analysis and quality of synthetic radiotherapy data

Brief Summary of the work involved: As part of the Simulacrum project, Health Data Insight CIC is developing a synthetic version of the National Radiotherapy Dataset.  The National Radiotherapy Dataset has been collected and organised by Public Health England since April 2016. The purpose of the Radiotherapy Dataset is to collect consistent and comparable data across all providers of radiotherapy services in England in order to provide intelligence for service planning, commissioning, clinical practice and research and the operational provision of radiotherapy services across England.

The synthetic radiotherapy dataset is in the early stages of development.  As part of the Simulacrum team, you will run quality checks on the synthetic data.  Some of these checks will involve computing metrics of comparison between distributions in the real and synthetic data.  In addition you will use clinical insight on radiotherapy treatment to sense check the synthetic data and then advise on potential improvements.  You will analyse and critique the standard techniques for generating synthetic data, and devise new and innovative approaches for fixing specific issues.

During this internship you will work very closely with analysts and developers operating at the forefront of synthetic data development.  Your solutions for improving the synthetic data will be tested and the development team will work to implement your solutions in the generation process.  There will also be opportunities to learn about the details of synthetic data generation and some of the mathematical techniques used to protect patient confidentiality.


  • Explore and understand the synthetic radiotherapy dataset and how it relates in structure and content to the National Radiotherapy Dataset using clinical insight and knowledge of cancer data
  • Run quality checks on the synthetic data – computing metrics of comparison between real and synthetic data
  • Advise the Simulacrum team on potential improvements to the generation process for synthetic radiotherapy data
  • Work closely with the Simulacrum development team to test your solutions and implement your fixes in the synthetic data generation process
Skills Required:
  • Data analysis experience – preferably using SQL but also R, Python or Excel
  • Programming experience in at least one scripting language – Python, Ruby, Matlab, R etc.
  • Ability to work well in a team
  • Enthusiasm and willingness to learn
Skills Desired:
  • Knowledge and interest in clinical practice, particularly with regard to radiotherapy treatment for cancer
  • Programming experience in a querying language e.g. SQL, Postgres
  • Experience working with relational databases
Project base Cambridge


Developing Machine Learning Models for Cancer Prediction and Patient Phenotyping

Brief Summary of the work involved: Health Data Insight has worked with Public Health England and the NHS Business Services Authority to develop the methodology to create a database of England’s primary care prescriptions data. This has been linked to the Cancer Analysis System (CAS), a national database of all cancer diagnoses and treatment in England. The aim of the Index of Suspicion project is to use machine learning to identify patterns in medication prescribed prior to the diagnosis of cancer and other patient data to derive an “index of suspicion” that will predict when a patient is at increased likelihood of developing subsequent cancer.

As an intern working on this project, you will work with a team of analysts and developers to further develop and improve upon the current machine learning methods and algorithms to better understand prediagostic prescribing and to strengthen prediagnostic indicators of cancer into a strong predictive signal. Key difficulties of the project are the size and complexity of the prescriptions dataset, with over a billion rows and the general issues that arise from working with real patient data. As part of this internship, you will have the opportunity to learn a range of skills in machine learning in healthcare and data analysis, as well as core transferrable skills such as working as part of a team, managing and delivering projects, and developing technical solutions to meet the needs of the HDI-NCRAS team and the patients, clinicians, and other individuals and organisations who will use the findings of the work. The internship also offers valuable experience working in the competitive data science industry.

Possible specific directions for the project would be given by combinations of the following, or alternative directions suggested, to be mutually agreed:

  • Develop methodologies and computational algorithms to better understand and refine the prediagnostic prescriptions signals for cancer identified by the machine learning models. With these, identify, extract, and strengthen patterns in prescriptions data indicative of future cancer diagnosis.
  • Use high-power computing resources to develop methodologies to scale current and future machine learning and computational methods to increasingly large datasets.
  • Develop visualisations of the patient data, specifically for aims such as better identification of prediagnostic prescription markers of cancer and to explore prescribing to the machine learning cohorts.
  • Work with the analytical team to incorporate other cancer registry datasets into the machine learning models.
  • Work with clinicians to develop clinically-led computational algorithms and machine learning methods, and/or better refine the models for the specific nature of prescriptions data and of cancer as a disease.
  • Investigate alternative machine learning approaches that meet the needs of the problem.


  • Data extraction and analysis using Oracle SQL, R, Python, or other appropriate languages.
  • The refinement and development of computational algorithms and machine learning techniques.
  • Implementation of algorithms and machine learning methods using R, TensorFlow and/or Keras, Python, Theano, or other suitable languages or packages.
  • Planning and managing the delivery of the project.
  • Collaborative working with the HDI-NCRA teams as well as clinicians and other individuals and organisations.
  • Sharing of technical knowledge with the wider analytical team through documentation, peer-to-peer learning, and seminars.

Supporting other analytical work within the team.

Skills Required:
  • Interest in data science and machine learning.
  • Creativity and an interest in cancer research.
  • Enthusiasm and willingness to learn.
  • Interest in working as part of a team.
  • Demonstrable ability to complete the project to a high standard.
Project base Cambridge


Science communicator internship –  communicating the value of real world data

Brief Description: We are looking for a creative scientific copywriter with graphic design skills to join a small team within Public Health England to bring fresh perspective about how we communicate with our stakeholders about the uses of, and benefits of, sharing data. With a creative flair for language, design and layout, we are looking for someone who isn’t afraid to challenge the status quo to produce information and visually appealing scientific communications to strengthen how the PHE Office for Data Release communicates with patients, the public and our customers.  With you, we want to maximise how the reach of our communications, to make applying for and understanding how PHE data is used,  more relevant, accessible and understandable.

Working with ODR programme managers , you will provide essential copy writing and graphic design support to develop and implement the ODR communication strategy and the effective development of the ODR’s visual identity by;

  • Creation of a consistent brand and visual image for ODR promotional materials , in line with corporate branding and editorial guidance
  • Development of branding for new applications portal (internet site to apply to access PHE data through)
  • Development of printed materials for conference attendance and online.
  • Copy for revised intranet and internet presence
  • Support with production of materials for upcoming cancer-related conferences

This is a challenging role that will allow you to expand your knowledge of health, public health and the utility of  data for medical research and service improvement; whiles getting your foot in the door of the competitive health and science communications industry.

Skills Required:
  • Graphic design experience (any section)
  • A keen eye for detail and ability to transform content into visual appealing and clear content
  • Good working knowledge of Illustrator, Photoshop, InDesign or other Adobe programs
  • Proactive, highly motivated and able to show initiative and bring new ideas to the table
  • Ability to multitask and meet deadlines
  • Self-motivated and self-starter
  • Excellent verbal and written communication skills
  • Desirable, interest in scientific communication and explaining to the public complex topics in accessible formats
Skills Desired:
  • Desirable, interest in scientific communication and explaining to the public complex topics in accessible formats
  • Other relevant software tools
Project base     London


Development of an epidemiology toolkit for rare cancer data using the National Cancer Registry

Brief Description of Project: We are seeking a new candidate to develop a new epidemiology toolkit that will enable NCRAS analysts perform new analysis and methods with ease which will increase the efficiency of our work. The toolkit will be developed targeted at rare cancers because of the issues with small numbers; if the methodology works with small numbers, the method(s) can be extended to larger cohorts.

Part A

The candidate will perform assessment of the current survival methodologies and whether it is feasible for rare cancers which have very small numbers. This is because the standard non-parametric methods usually only work well with groups that have a large sample size.

A new adaptation (Brenner’s alternative) has recently been coded in Stata to allow for non-parametric methods to cope with producing net survival for very small groups but this still has its disadvantages. Namely, the final output must be age-standardised and that the method requires there to be at least 1 person in each defined age group.

The aim of the first part of this internship programme is to (1) assess the viability of the current non-parametric methods in the production of survival and mortality statistics in rare cancers and (2) to extend or develop new methods to accurately estimate survival and mortality for such small groups.

Part B

The candidate will perform simple regression models to assess the pattern of incidence over time and develop either a new model or adapt APC models to accurate project incidence into the future. This will be developed for the rare cancer setting first, as the method will work for larger cohorts if it is robust enough for small cohorts.

The aim if the second part of this internship programme is to (1) assess the viability of trends of incidence over time for rare cancer sites and (2) to extend or develop new models to accurately project incidence into the future.


A combination of the following, depending on the intern’s interests, or an alternative project if proposed by the intern and mutually agreed:

  • Structured querying of national cancer database.
    • Analyse cancer cohorts to produce survival estimates using current non-parametric methods.
  • Analyse cancer cohorts to assess trends in data and to produce new models of incidence projection.
  • Liaise with cancer site-specific leads within NCRAS to discuss expectation of the results.
  • Extend or synthesis a new method for estimating survival for rare cancers.
  • Compare new developed method to current methods and results.
  • Conduct a sensitivity analysis where appropriate.

• Produce a ‘toolkit’ program in Stata (similar to MATA) that will be circulated to the NCRAS analysts to use.

Skills Required: The project involves analysing datasets containing anonymised personal information, so information governance training will be provided. Creativity and an interest in cancer research are expected. Some knowledge of mathematics, statistics and probability are required.
Skills Desired:  An interest in statistical theory and in particular, experience with survival or mortality analyses. Experience using  statistical software (such as Stata) would be highly beneficial for this internship.
Project base   Birmingham


Automation of Data Production Programming Internship

Brief Summary of the work involved: The Office for Data Release (ODR), as part of Public Health England, is responsible for providing a common governance framework for responding to requests to access data held by PHE for secondary purposes, including service improvement, surveillance and ethically approved research. The ODR is responsible for ensuring data governance and protection principles are applied to each release.

More information on the role of the ODR can be found here]

A large proportion of the data releases overseen by the ODR are to access cancer data. ODR staff work closely with analysts from PHE’s National Cancer Registration and Analysis Service (NCRAS) to respond to these requests. The data for servicing these requests is held in a large collection of linked datasets in Oracle databases. ODR and NCRAS have identified that the tasks undertaken in processing requests are similar across all requests. These include; cohort definition, data extraction, data linkage, identifiability checks, pseudonymisation, aggregation, quality assurance and metadata production. However, a lot of this work is duplicated for each new request and is subject to differing interpretation.

ODR have initiated a programme of work to undertake the development of automation tools to support and standardize this work, with various deliverables identified, including;

  • Automated SQL code production for extracting and pseudonymising data according to customer cohort definitions
  • Automated reporting on potential disclosure risk
  • Testing of disclosure against published standards
  • Reporting on options for data minimisation
  • Production of scripts to apply statistical disclosure control.
  • Production of documentation to accompany products, user guides, standard operating procedures etc

The initial focus of this will be developed in the context of cancer data, it would be a further aim that the model developed through this project would provide an exemplar to support release of data from other data assets held by PHE.

We are offering an intern placement to help deliver this useful work. The placement would suit an individual with a good analytical background who enjoys problem solving and attention to detail. Offering an opportunity to support a programme of work with clearly defined expectations and delivering operational software solutions. The outputs from this placement will improve the efficiency of both the ODR and PHE analytical teams, and provide an excellent intern opportunity to develop skills and knowledge whilst also demonstrating competency through successful project delivery.


  • Understand and document the analytical needs and process of data release in terms of standard operating procedure.
  • Plan and prepare scripts and interfaces supporting each part of this work.
  • Improve the user experience for both analysts and external data requesters.
  • Support other analytical and ODR work and identify other potential process improvements.
  • Develop a good understanding of anonymsiation and data minimisation techniques
Skills Required:
  • Data analysis and programming experience – preferably using R, Excel, SQL.
  • Ability to work well in a team
  • Enthusiasm and willingness to learn
  • Experience working with relational databases
Skills Desired:
  • Knowledge and interest in cancer incidence and treatment.
  • Programming experience in a scripting language.
  • Project management and delivery
Project base Supervision will be driven by location


Visualisation: Unlocking the Potential of Cancer Data


Brief Summary of the work involved: From informed patient choices and symptom awareness campaigns to communicating complex discoveries in cancer research to the clinicians who will decide an individual’s treatment options, high quality patient care relies on effective communication. A good visualisation has the power to communicate a message with far greater impact than text or raw data: It can highlight key points, provide easy summaries and comparisons, and reveal patterns or trends over time.

Health Data Insight works with Public Health England’s National Cancer Registration and Analysis Service (NCRAS) to generate new insights into healthcare data to improve outcomes in healthcare. NCRAS aims to collect data on all cases of cancer diagnosed in England for the purposes of improving cancer services and outcomes, improving patient care, and to complete in-depth research into understanding all aspects of cancer causes, symptoms, progression, and treatment effects. This data is stored in large, linked datasets, the aggregate of which contains information on all areas of the cancer pathway.

The aim of this internship will be to work with a team of analysts to come up with and develop a high-impact visualisation(s) to convey a key message(s) identified from the cancer datasets. This internship will combine strong technical skills with a high level of creativity and innovation, and will provide the opportunity to gain experience working in the competitive data science industry and to develop skills in a wide range of aspects of a collaborative working environment working with a large public sector organisation, including managing and delivering projects, working as part of a team, and developing technical solutions to meet the needs of both the HDI-NCRAS team and the patients, clinicians, and other individuals and organisations who will use the visualisation to access and understand patient data.

Principal questions to consider would be:

• What are the key messages that patients, their families, and clinicians need to know?
• How can we effectively convey these messages in a clear, engaging, and user-friendly way using the tools available (R, D3.js, JavaScript, or p5.js)?
• How can we implement the ideas precisely and to a very high standard using appropriate technologies to create high-quality, professional visualisations? How can we use text and design to enhance the image?
The role will include:
• Liaising with members of the HDI-NCRAS team to establish areas where visualisation would be beneficial and identify key messages and to identify and work to a chosen visualisation’s requirements.
• Planning and managing the delivery of the project.
• Designing high-impact visualisations to effectively convey one or more identified messages.
• Data extraction and analysis using Oracle SQL, R, Python, or other appropriate languages.
• Technical implementation of the visualisation in a relevant language, e.g. R, D3.js, JavaScript, p5.js.
• Sharing of technical knowledge with the wider analytical team through documentation, peer-to-peer learning, and seminars.
• Supporting other analytical work within the team

Skills required
  • Interest in data science and visualisation.
  • Creativity and an interest in cancer research.
  • Enthusiasm and willingness to learn.
  • Interest in working as part of a team.
  • Demonstrable ability to complete the project to a high standard.
Skills Desired:
  • A mathematics or computer science background.
  • Experience with SQL or other query language.
  • Experience with a relevant programming language or library, e.g. R, D3.js, JavaScript, p5.js.
  • Experience with data visualisation or web design.
Project base Cambridge


Share This