In its statistical activities, Statistics Netherlands (CBS) strives to make the greatest possible use of already generated data, including large amounts of data obtained from the Internet, traffic loops, sensors and mobile phones. Prof. dr. Piet Daas explains: ‘These can be used to compile new or more detailed statistics in addition to the mandatory official statistics. Other advantages of big data are that we’re able to speed up statistics production, but also reduce the response burden on private citizens and companies. These are some of the major benefits. At the same time, using big data raises a number of fundamental questions.’
Creating big data statistics is very different from creating statistics based on surveys: ‘In the case of survey-based statistics, first you draw up your research concept. Then you define the target population, select a sample and start collecting the data. But with big data as the basis of your statistic, you need to turn the entire process upside down and start with data that is already generated. Does this set of data contain all the information you need for your statistic?’ As a statistical office, CBS is at the forefront of big data implementation. ‘This means we have to find the answers to any arising questions ourselves. As a professor I’m being given the space to investigate these questions.’
One of the areas Daas focuses on is the representativeness of big data: are the data an adequate reflection of the population to be measured? He explains: ‘Big data sources often contain data on a very large group of people or companies. But does this group match the target population of your statistic? For example, we have files on shipping traffic at our disposal, containing the movements of each and every ship. This implies a high level of representativeness. However, if you want to use website data to compile a statistic on companies – for example on how innovative or sustainable they are - you will miss data of companies that do not have a website. This could pose a problem, in which case you’ll need to find a solution to compensate for the missing data.’
‘Creating big data statistics is very different from creating statistics based on surveys’
A second important question deals with the relationship between the phenomenon that is measured in a big data source and the purpose for which CBS wants to use the data. Causality plays a key role in this. Do the used data show a trend or is it mere coincidence? Daas: ‘We compile official statistics; everything we publish must be accurate. That is why it’s imperative to investigate whether the information obtained from these data sources actually provides the answer to the statistical question asked. For example, do company website data truly reflect how innovative and sustainable a company is?’
Daas is excited about his professorship. ‘It’s a hectic work schedule at the CBS Centre for Big Data Statistics. By working at the university one day a week, I can fully focus on the theoretical side, together with the many experts working there. In Eindhoven statistical knowledge converges with computer science, creating an ideal environment for big data research. We ’ve started with a literature study to lay a solid bases for answering the fundamental questions. After that, we’ll start analysing.’
Curriculum vitae of Piet Daas
Dr. Piet J.H. Daas is a biochemist and bio-informatics specialist. In 1996, he received his PhD with honours from the Radboud University Nijmegen. He then proceeded to work at Wageningen University, where the datasets used in his research kept growing in size. As of 2000, he is a methodologist at CBS, where he has laid the foundation for research on big data statistics. In 2016, the CBS Centre for Big Data Statistics was founded to promote the use of big data in official statistics. Daas is an internationally recognised big data expert who delivers presentations and training sessions all around the world.