New methods to correct data

22/03/2018 13:30 / Author: Masja de Ree
© Sjoerd van der Hucht Fotografie
Data used towards statistics production almost always contains errors. What can you do about them? On 6 March 2018, CBS methodologist Sander Scholtus obtained his doctorate cum laude with research on new methods to assess and correct data.

Reliable statistics

CBS uses, amongst others, data obtained from surveys as well as existing datasets, for instance from the Dutch Tax and Customs Administration. In either case, such data may contain errors and these need to be corrected in order to produce reliable statistics. Scholtus explains how this can be done in two ways. ‘The first method is called editing: you detect the errors and correct them in the best possible way. This is preferably done automatically, as manual work is good but also time consuming. With the second method, you do not correct errors beforehand, but instead you estimate the impact of the errors on your statistic. In this way, you produce a model which you use to correct the entire statistic afterwards.’ In his doctorate research, Scholtus studied and further developed both methods.

Automatic correction

Imagine you are participating in a business survey and your company has overlooked the fact that its financial data should have been presented in units of thousand euros; or you have made a switching mistake, e.g. plus instead of minus or the cost instead of the revenue. This will result in systematic or one-off errors. Scholtus designed a number of new algorithms to automatically correct systematic errors. ‘We have successfully tested these algorithms on the job at CBS.’ Furthermore, Scholtus extended the Fellegi-Holt methodology, used frequently by statisticians. ‘This method is designed to detect and correct non-systematic errors. My extension allows for this method to be customised to specific datasets. It has become more flexible and thus provides more accurate results.’ The extension is currently being tested on CBS statistics.

Mathematical model

In principle, CBS only uses surveys to collect data which cannot be obtained from existing data files. However, when comparing turnover data of Dutch businesses available at the Dutch tax authorities to the turnover data CBS obtains from business surveys, there are discrepancies. This is due to the fact that the Tax and Customs Administration and CBS use different definitions. Scholtus: ‘To CBS, this is a relevant problem. It is important to know exactly in which business sectors the definitions are sufficiently matching and in which sectors we need to follow up with a survey. I studied ways to solve this problem by means of a mathematical model.’ Scholtus developed a model that can be used to estimate to what extent data from each source deviate from the actual value. ‘This model is very versatile. In statistics for which multiple sources are available, CBS will be able to determine the most valuable source for each part.’

Sander Scholtus designed a number of new algorithms to automatically correct systematic errors

Hard data

A mathematical model such as the one developed by Scholtus for CBS has been used in academic circles for some time. ‘One specific feature of my method is that it is related to CBS using fixed, hard units, such as euros. This imposes additional requirements on this model, because scale is important. CBS wants to know not only whether the values measured are sufficiently related to the actual values, but also whether structurally, they are not too high or too low. In this model, we solve that question by making additional efforts to retrieve actual values for a small sample of businesses, aside from the values we already have available from the data files. This has made our model more specific.’

Faster and more efficient

Is CBS going to deploy the new methods? Scholtus: ‘In correcting the data, we carry out both manual and automatic corrections. The new algorithms I designed can be deployed there as well. We are not yet running models as we need to be certain that the assumptions which they are based on are in fact correct. In my opinion, the future lies in a combination of data editing and the use of different models. For example, by having a model monitor the quality of editing, as I did in my thesis. This could yield a faster and more efficient approach.’

Scholtus studied mathematics at Leiden University and has worked for CBS as a methodologist since 2006. He obtained his PhD at the Free University of Amsterdam (VU). His supervisor was Prof. Bart Bakker, department head at CBS and an endowed professor at VU.