© Getty Images/iStockphoto

What is synthetic data?

Synthetic data simulate characteristics of the relationships between people and objects (e.g. a school or a neighbourhood) so that a real-life situation can be reconstructed without identifying the person or object involved. Synthetic data can be generated by an algorithm or a computer simulation. The advantage of synthetic data is that, depending on the user’s purpose, it offers a trade-off between the analytical value of the dataset and the risk of disclosure (fidelity). When a study requires a dataset with a high analytical value (i.e. a relatively strong similarity to the original dataset), this leads to a higher risk of disclosure. Synthetic data are used for privacy protection and controlled publication. The use of synthetic data is therefore regarded as a Privacy Enhancing Technique (PET).

What applications does synthetic data have?

There are various levels of synthetic data, each with a different trade-off between analytical value and disclosure risk. Depending on this trade-off and the intended use of the dataset, synthetic data can be used for the following applications:

  1. systems testing, in situations where actual data cannot be used due to the rules governing data protection
  2. software demonstrations
  3. development of AI models
  4. sample data from a CBS data source that a user is not (or not yet) permitted to access
  5. sample data for an external data or big data source to which CBS does not (or does not yet) have access
  6. data for training exercises (technical or statistical)
  7. source data for new ideas/proof of concepts
  8. source data for agent-based models and digital twins
  9. source data for analytical purposes (policy analyses or scientific research).

CBS already sees opportunities for certain applications but rejects others, either definitively or for the time being, possibly subject to the need for further research. It cannot be ruled out that more applications may emerge as time goes on.
In concrete terms, CBS will start deploying synthetic data for intended uses that carry the least risk. These will be internal CBS cases in which synthetic data are generated for testing and evaluation purposes. In addition, CBS will release a synthetic dataset for educational purposes which will be subject to a high degree of privacy. For other potential synthetic data services, CBS will need to gain yet more experience while involving relevant parties in the process.

Why work with synthetic data?

CBS holds a substantial amount of data for which privacy has to be fully guaranteed. Although the demand for data and the amount of data available continue to grow, data exchange with the scientific community still does not take place to a sufficient extent. From an organisational and operations perspective, there is a need for improved data-exchange methods in response to increasingly stringent privacy regulations and the obstacles they present in terms of data exchange. Synthetic data could play a key role in this regard. It is important to note that privacy regulations, such as the GDPR, also need to be observed in these applications. They provide guidelines on the purposes for which sensitive data can and cannot be used. CBS sees added value in using synthetic data to facilitate and simplify data sharing.

What is the added value to society of using synthetic data?

CBS seeks to use and share data securely. Synthetic data are increasingly being seen as an alternative to exchanging privacy-sensitive data. CBS regularly receives inquiries about synthetic data, and is happy to address them. As a knowledge institute, CBS positions itself as a data partner and data hub. Synthetic data can be used to strengthen both specific collaborations and the role that CBS plays in society.

Syntho pilot: what has already been done with synthetic data?

CBS carried out a Proof of Concept (PoC) with the aim of gaining experience in generating synthetic data. This process made use of a software package from Syntho, a Dutch startup that develops and commercially produces software for creating synthetic data. The aim of the PoC was to synthesise a section of the General Business Register (ABR) that details the economically and statistically relevant business population in the Netherlands using a number of basic characteristics such as economic activity and size class.

We succeeded in creating a synthetic test dataset. The analytical value and disclosure risk assessment resulted in the recommendation that this dataset should be applied primarily to internal use in testing software for the production of statistics. Valuable lessons have been learned as regards content, methodology, IT infrastructure, legal issues and software. However, more research needs to be done in terms of assessing disclosure risks. This PoC has acted as a catalyst for a wider discussion on synthetic data and what it can mean for CBS.

Feedback