Improving the sampling strategy for the Community Innovation Survey using machine learning algorithms
National statistical institutes (NSI's) are increasingly interested in using non-probability data to produce official statistics. Examples are information on the internet, social media messages, sensor data, and web-scraped data. Relying on this kind of data sources implies an increased impairment risk for producing official statistics, because e.g. has no control over the availability of the data or its comparability over time. To minimize these impairment risks, this paper proposes using information extracted from these kinds of data sources to improve the sampling strategy of a probability sample. This concept is illustrated with an application to the Community Innovation Survey (CIS).
Three sources of auxiliary information are considered to improve the estimation procedure of the CIS: (1) web-scraped data indicating the likelihood of a company being innovative, (2) administrative records of businesses receiving research and development subsidies, and (3) administrative data on the number of patents associated with each business. Using data from the 2016 Community Innovation Survey, the paper studies the extent to which survey estimate accuracy can be improved by weighting to a population distribution informed by these auxiliary sources. The weighted estimates obtained through the generalized regression estimator are compared to those derived from the currently used Horvitz-Thompson estimator. This analysis assesses whether the existing weighting approach can be refined and identifies which auxiliary source yields the most accurate estimates. Additionally, the paper contributes to the ongoing discussion on using traditional and novel data sources to enhance the quality of official statistics.