Soft edit rules, i.e. constraints which identify (combinations of)values that are suspicious but not necessarily incorrect, are an importantelement of many traditional, manual editing processes. It is desirable to usethe information contained in these edit rules also in automatic editing.However, current algorithms for automatic editing are not well suited to usesoft edits because they treat all edit rules as hard constraints: each editfailure is attributed to an error. Recently at Statistics Netherlands, a newautomatic editing method has been developed that can distinguish betweenhard and soft edits. A prototype implementation of the new algorithm has been written in the R programming language. This paper reports some resultsof an application of this prototype to data from the Dutch Structural BusinessStatistics. The paper also introduces and tests several size measures of softedit failures that can be used with the new automatic editing method.
Improvement of waterflows in the National Water Balance;Water Stocks; feasibility of Water Balances per River Basin.Final report on Grant Agreement No. 50303.2010.001-2010.564
A common problem faced by statistical institutes is that data may be missing from collected data sets. The typical way to overcome this problem is to impute the missing data. The problem of imputing missing data is complicated by the fact that statistical data collected by statistical institutes often have to satisfy certain edit rules, which for numerical data usually take the form of linear restrictions. Standard imputation methods for numerical data as described in the literature generally do not take such linear edit restrictions on the data into account. Hot-deck imputation techniques form a well-known class and relatively simple to apply class of imputation methods. In this paper we extend this class of imputation methods so that linear edit restrictions are satisfied.
Recently, survey literature has put forward responsive and adaptive survey designs as means to make efficient tradeoffs between survey quality and survey costs. The designs, however, restrict quality-cost assessments to nonresponse error, while in mixed-mode surveys the measurement or response error plays a dominant role. Furthermore, there is both theoretical and empirical evidence that the two types of error are correlated.In this paper, we investigate adaptive survey designs that minimize both errors simultaneously in the Labour Force Survey. The design features that are selected are self-reporting versus proxy reporting, and the number of contact attempts.
An increasing number of people is active on various social media platforms. Here, people voluntarily share information, discuss topics of interest, and contact family and friends. Because the social media platform Twitter is used by a large number of people in the Netherlands and the pubic messages can be relatively easily collected, we investigated the content and usability of Twitter-messages for statistics. This revealed that a considerable amount of the messages collected, around 50%, could potentially be used to provide information on work, politics, spare time activities and events.
We discuss the relation between trust and statistical dissemination, drawing on examples from the Netherlands and comparing with other European countries. Dutch citizens have a fair amount of confidence in official statistics, even in the recent period of political and economic upheaval. The most important reason for this seems to be the political culture in the Netherlands, which puts a strong emphasis on rational policy making based on evaluations from scientific councils, committees and official research bureaus. We discuss how this came to be and how this influences the trust in the national statistical institute. and the consequences this has for the dissemination of statistical data.
Inference in official statistics is traditionally motivated from a design-based perspective, with the model-based approach being gradually adopted in specific circumstances. We take this shifting paradigm one step further, from model-based to algorithmic inference methods. Surveying a sample of the population of interest – typically enterprises or households – is fundamental to the design-based approach, where the design is the basis for inference. Model-based estimation methods may provide a viable alternative in situations where design information is not available. Estimation of the model parameters is pivotal, although in official statistics it is only an intermediate goal, as the model is ultimately used for prediction. Therefore, adopting a data-centred, algorithmic view rather than a model-centred view is possible. The algorithmic view encompasses methods generally attributed to the fields of data mining, machine learning, or statistical learning. Algorithmic methods may be useful in situations where data are not obtained through a sample survey, and where the typical models used in model-based estimation are not tenable.
Conflicting information may arise in statistical micro data due to partial imputation, where one part of the imputed record consists of the observed values of the original record and the other of the imputed values. Edit rules that involve variables from both parts of the record will often be violated. One strategy to remedy this problem is to make adjustments to the imputations such that all constraints are simultaneously satisfied and the adjustments are, in some sense, as small as possible. The minimal adjustments are obtained by minimizing a chosen distance metric subject to the constraints and we show how different choices of the distance metric result in different adjustments to the imputed data. As an extension we also consider an approach that does not aim to minimize the adjustments but to make the adjustments as uniform as possible between variables. Under this approach, even the values that are not explicitly involved in any constraints can be adjusted. The properties and interpretations of the proposed methods are illustrated using empirical business-economic data.
When monthly business surveys are not completely overlapping, there are two different estimators for the monthly growth rate of the turnover: (i) one that is based on the monthly estimated population totals and (ii) one that is purely based on enterprises observed on both occasions in the overlap of the corresponding surveys. The resulting estimates and variances might be quite different. This paper proposes an optimal composite estimator for the growth rate as well as the population totals.
The quality of statistical statements strongly depends on the quality of the underlying data. Since raw data is often inconsistent or incomplete, data editing may consume a substantial amount of the resources available for statistical analyses. Although R has many features for analyzing data, the func-tionality for data checking and error localization based on logical restrictions (edit rules, or edits) is currently limited. The editrules package is designed to offer a user-friendly toolbox for edit definition, edit manipulation, data checking, and error localization.
In our research we aim to gain insight in the geospatial activity of mobile phone users. Points op interest are the correlation between calling- and economic activity, population density based on the number of active mobile phones in an area and movement statistics. A derived research question is deducing a method to obtain a tessellation of cell serving areas from a cell plan and combining different tessellations.For our research we obtained a dataset from a telecommunication company containing records of all call-events on their network in the Netherlands for a time period of two weeks. Each record contains information about the time and serving antenna and an identification key of the phone. The dataset is large (containing over 600 million records) and the cell plan has over 20.000 geo-locations of antennas. We devised a method to transform this cell plan with use of the Voronoi algorithm into an appropriate tessellation needed for geo-spatial analysis.Results of our research are a geo-spatial animation from which it is clearly visible that high call intensity coincides with high population density. Also with use of k-means clustering we found useful patterns in the time series of the call activity providing insight into economic activity over time and space. Using the unique phone identification we obtained information of the movement of Dutch inhabitants.
In 2011, Statistics Netherlands conducted a large-scale mixed-mode experiment linked to the Crime Victimization Survey. The experiment consisted of two waves; one wave with random assignment to one of the modes web, paper, telephone and face-to-face, and one follow-up wave to the full sample with interviewer modes only. The objective of the experiment is to estimate total mode effects and the corresponding mode effect components arising from undercoverage, nonresponse and measurement. The estimated mode effects are used to improve methodology for mixed-mode surveys. In this paper, we define mode-specific selection and measurement bias, and we introduce and discuss estimators for these bias terms based on the experimental design. Furthermore, we investigate whether mode effect estimators based on the first wave only, reproduce the estimates from the full experimental design. The proposed estimators are applied to a number of key survey variables from the Labour Force Survey and the Crime Victimization Survey.
Nonresponse in surveys may effect representativity, and therefore lead to biased estimates. A first step in exploring a possible lack of representativity is to estimate response probabilities. This paper proposes using the coefficient of variation of the response probabilities as an indicator for the lack of representativity. The usual approach for estimating response probabilities is by fitting a logit model. A drawback of this model is that it requires the values of the explanatory variables of the model to be known for all nonrespondents. This paper shows this condition can be relaxed by computing response probabilities from weights that have been obtained from some weighting adjustment technique.
In this paper a general methodology for automatic coding is suggested, which is supposed to be (fairly) general. Characteristic for the approach described in this paper is that each code (say for a business activity) is characterized by one or more combination of code words, or Cwords. These combinations of C-words can be seen as definitions of the various codes used. It is assumed that the order of the C-words is irrelevant to describe a code. Because it is unlikely that people will use exactly those Cwords when describing a business activity, synonyms, hyponyms and hyperonyms for the C-words are needed as well. They form the bridge between the definition of the codes in the classification used and the descriptions provided by respondent, and are called D-words. A semantic network is used to provide a bridge between the descriptions and the codes.Some inherent difficulties with automatic coding are presented, as well aspossible solutions to overcome them, either by solving or by sidesteppingthem.
This paper concentrates on methods for handling incompleteness caused by differences in units, variables and periods of the observed data set compared to the target one. Especially in economic statistics different unit types are used in different data sets.
This report describes the validity testing of a multi-item scale of global life satisfaction, namely the Satisfaction With Life Scale (SWLS). This scale has been proposed as an alternative to single-item life satisfaction measures.
In this paper we concentrate on methodological developments to improve the accuracy of a data set after linking economic survey and register data to a population frame. Because in economic data different unit types are used, errors may occur in relations between data unit types and statistical unit types. A population frame contains all units and their relations for a specific period. There may also be errors in the linkage of data sets to the population frame. When variables are added to a statistical unit by linking it to a data source, the effect of an incorrect linkage or relation is that the additional variables are combined with the wrong statistical unit. In the present paper we formulate a strategy for detecting and correcting errors in the linkage and relations between units of integrated data. For a Dutch case study the detection and correction of potential errors is illustrated
Recently Statistics Netherlands has started research into the share of illegal activities in the national income. The total contribution of illegal activities to the national income of the Netherlands increased from 1800 million euro in 1995 to almost 3500 million euro in 2008, equalling 0.6 percent of gross national income. The main illegal sector is drugs, which accounted for over 50 percent of the total income from illegal activities in 2001. In 2008 that share was down to less than 40 percent, whereas finding illegal employment rose from about 10 percent in 1995 to 33 percent in 2008.
In this paper by Floris van Ruth, a graphical tool is presented for analysing the labour market. The state of the labour market, defined as tightness, is characterised via the interaction of supply of and demand for labour, enabling a more comprehensive analysis. Supply is defined as the proportion of the labour force holding a job, whilst demand is represented by the average of several indicators of labour demand. This approach results in a new and more general characterization of the state of the labour market, presented in an easy to interpret visual manner.
This paper by Daan Zult and Floris van Ruth describes a new type of analytical tool, developed for detecting early signals of changes in the development of exports. It is novel in several ways; it focuses on exports using a demand-pull approach, is based on a structural analysis of the demand for Dutch exports, and takes the form a disaggregated visual tool. 16 selected foreign demand sectors are monitored using leading signaling indicators. The aggregate is leading to the growth rate of Dutch exports. The disaggregated set up results in increased early warning capabilities, as changes in development in individual industries are immediately visible.
This is a report for choosing a language and tool for the business and information architecture (BI-architecture) of Statistics Netherlands. This is a first step to improve the current way of BI-architecture development and maintenance. In the first phase of the project the business architects indicated the need to change the current language and tool for others. In the second phase of the project the business architects together with the IT Enterprise architects advised that Archimate and BizzDesign Architect are the best language and tool combination to support the new way of working.
Since design-based methods like the generalized regression estimator have large design variances in the case of small sample sizes, model-based techniques can be considered as an alternative. In this paper a simulation study is carried out where small area estimation based on a linear mixed model is applied to the variable turnover of the Structural Business Survey of Statistics Netherlands. By applying the EBLUP, the accuracy of the estimates can be substantially improved compared to the generalized regression estimator. The EBLUP estimates, however, are biased, which is partly caused by the skewed distribution of the variable turnover. It is found that by transforming the target variable both skewness and bias can be substantially reduced, whereas the variance increases. As a result, the accuracy is slightly improved compared to the EBLUP.
Macro-integration is widely used for the reconciliation of macro figures, usually in the form of large multi-dimensional tabulations, obtained from different sources. Traditionally these techniques have been extensively applied in the area of macro-economics, especially in the compilation of the National Accounts. Methods for macro-integration have developed over the years and have become very versatile techniques for solving integration of data from different sources at a macro level. In this paper we propose applications of macro-integration techniques in other domains than the traditional macro-economic applications. In particular, we present two possible applications for macro-integration methods: reconciliation of tables of a virtual census and combining estimates of labour market variables.
In order to produce official statistics of sufficient quality, statistical institutes carry out an extensive process of checking and correcting the data that they collect. This process is called statistical data editing. In this article, we give a brief overview of current data editing methodology. In particular, we discuss the application of selective and automatic editing procedures to improve the efficiency and timeliness of the data editing process.
The object of this paper is to present a new formulation of the error localisation problem which can distinguish between hard and soft edits
Survey nonresponse occurs when members of a sample cannot or will not participate in the survey. It remains a problem despite the development of statistical methods that aim to reduce nonresponse. In this paper, we address the problem of resource allocation in survey designs in which the focus is on the quality of the survey results given that there will be nonresponse. We propose a novel method in which the optimal allocationof survey resources can be determined. We demonstrate the effectiveness of our method by extensive numerical experiments.