BLOG - Exploring the potential of synthetic data

This week we explore how synthetic data can be used to enhance administrative data research and practice.

What is synthetic data?

Synthetic data is information that's generated from models of the real-world used in the place of the actual data. Synthetic data, because it is based on a model, retains the structure and some of the patterns of the original dataset whilst containing none of the original data.

Why use synthetic data?

As outlined in the recent ADR UK report on synthetic data, there are clear uses for this type of data. Two significant purposes that can benefit researchers in administrative data research are training and testing.

Synthetic data is a really useful tool for training as it allows researchers to practice handling datasets and to become familiar with working with them, without requiring full access to secure environments. It also means researchers can test their code and explore the kinds of analysis they can undertake.

For example, synthetic data can help PhD students to prepare their research whilst waiting for data access which can often be lengthy! Even ‘surface level’ information on this protected data, like variable names, can help in producing a synthetic structure so a researcher can then make sure the code runs without errors and check that they’re getting the expected (dummy) results. This means when they do get the data, analysis is much quicker.

Researchers can also show data controllers the synthetic data and share exactly what they want to do with it, so they can better understand how the data is being used and are reassured.

What tools are out there?

There are various tools that are available or being developed for creating synthetic data. Here at SCADR, our colleagues Professor Chris Dibben, Professor Gillian Raab and Dr Beata Nowok created a synthetic data package called Synthpop several years ago (listening to synthpop music is optional!). Synthpop for R (ideally R studio) allows users to create synthetic versions of confidential, individual-level data for use by researchers. To find out more about Synthpop and explore their range of resources, please visit their website.

Since it was first made available as an open-source package in 2014 it has been widely used by a variety of groups including National Statistical Offices. As well as supporting a variety of methods of creating synthetic data (we could add details if you wish), the package provides tools for evaluating the utility/fidelity of synthetic data and assessing disclosure risk; this last is not yet fully developed but is being expanded now. The number of downloads of the package has increased in recent years. Since mid-2020 there have consistently been over 2,000 downloads per month.

Staff at the Scottish Longitudinal Study (SLS) use the package to provide datasets for preliminary analysis to users of the SLS. We have also created synthetic datasets to use in training courses.

Further information 

SCADR researcher, Professor Gillian Raab, received £30,000 funding from Research Data Scotland at the start of 2023 to 'Review and evaluate methodology on how to measure the disclosure risks from synthetic data'. We look forward to sharing results from this project later this year.

Gillian’s work has also recently informed and contributed to the United Nations Economic Commission for Europe (UNECE) Synthetic Data for Official Statistics - Starter Guide.

This article was published on 24 Feb 2023

Categories and tags