Deriving a test set to classify patterns in hours paid in administrative data

A regression analyses procedure to detect different reporting patterns in administrative data
Dutch employers deliver information on wages of their employees to the tax office. Employers may report this information at different frequencies. Monthly reporters use different patterns to declare hours paid, for instance depending on whether someone works full-time or part-time and whether overtime work is included or not. When left uncorrected, these patterns may lead to biased estimates of the total number of hours paid and of hourly wages. With current methods we are unable to detect the pattern of about one-third of the records. We aim to improve this in future by using data mining methods, but this requires the availability of a test set for which at least some of these patterns are known. As a first step, in the current paper we derived such a test set. Because manual labelling of these patterns by experts is too costly, we used an automatic procedure based on linear regression analysis for this. We only labelled records with a small absolute prediction error. With our procedure we managed to label 56% of the sampled records within our test set.