BLOG - Exploring the potential of synthetic data

This week we explore how synthetic data can be used to enhance administrative data research and practice.

What is synthetic data?

Synthetic data is information that's generated from models of the real-world used in the place of the actual data. Synthetic data, because it is based on a model, retains the structure and some of the patterns of the original dataset whilst containing none of the original data.

Why use synthetic data?

As outlined in the recent ADR UK report on synthetic data, there are clear uses for this type of data. Two significant purposes that can benefit researchers in administrative data research are training and testing.

Synthetic data is a really useful tool for training as it allows researchers to practice handling datasets and to become familiar with working with them, without requiring full access to secure environments. It also means researchers can test their code and explore the kinds of analysis they can undertake.

For example, synthetic data can help PhD students to prepare their research whilst waiting for data access which can often be lengthy! Even ‘surface level’ information on this protected data, like variable names, can help in producing a synthetic structure so a researcher can then make sure the code runs without errors and check that they’re getting the expected (dummy) results. This means when they do get the data, analysis is much quicker.

Researchers can also show data controllers the synthetic data and share exactly what they want to do with it, so they can better understand how the data is being used and are reassured.

What tools are out there for creating synthetic data?

There are various tools that are available or being developed. Here at SCADR, our colleagues Professor Chris Dibben, Professor Gillian Raab and Dr Beata Nowok created a synthetic data package called Synthpop several years ago (listening to synthpop music is optional!). Synthpop for R (ideally R studio) allows users to create synthetic versions of confidential, individual-level data for use by researchers. To find out more about Synthpop and explore their range of resources, please visit their website.

This article was published on 24 Jan 2022

Categories and tags