Matching datasets is a regular operation in the statistical process. The simplest form of matching is linking two datasets (or tables) on the basis of a database key: records from the two datasets (tables) match if the database key is exactly identical. For more complicated types of matching other variables, so-called secondary keys, are used, such as names and time variables. The problem with this more extensive type of matching is that the scores of secondary keys may contain errors, or that variables do not have exactly the same definition. This report gives a systematic overview of various matching problems, characterised by complicating factors to be taken into account and the methods available to deal with these.
Coding is the translation – if possible - of a description of an occupation, economic activity, education level, disease, commodity etc. into a code from a corresponding classification. Only after this translation has taken place is it possible to count, classify and produce statistics about the subjects concerned. This report describes the problems encountered during the coding process, and how they are tackled with the aid of computers, completely automatically or interactively.
Both surveys and registrations nearly always contain errors. If these errors result in bias in the results, it is in the interest of Statistics Netherlands to detect and correct them. Error detection is also important to gain an insight into the quality of the data sources, observation and processing. This report discusses methods to detect and correct errors automatically, and methods to select records that should preferably be corrected manually (selective and macro-editing).
In addition to high quality information, administrative datasets contain measurement and representation errors. Micro-integration is a collection of methods to identify and correct these errors. Without these corrections, the published results would be biased. Three methods are distinguished: harmonisation, completion, and correction of measurement errors.
Sampling theory contains situations in which estimates can be improved by using explicit models. For example, if the sample design is unknown, or a sample results in only few observations per sub-population. This report sets out how relatively simple models can be used to estimate parameters such as population totals or sub-population totals.
Sometimes data are missing from surveys and registers, for example because a respondent cannot or will not answer a question. One way to cope with such missing values is to impute valid values, so that it is still possible to produce total values. This report describes imputation methods used at Statistics Netherlands. It pays extra attention to longitudinal imputation and to imputations for records with more than one missing value.
Representative outliers are extreme values in a sample, which are assumed to have been observed correctly, and for which it is assumed that the population contains other similar elements. This report discusses estimation methods which can restrict the effect of such outliers on the estimation results, and thus improve the accuracy of the results.