The use of non-probability data as a primary data source in official statistics is currently an active field of research. Without the traditional sampling design, modern machine-learning algorithms might play a central role in producing accurate population estimates. This working paper presents empirical research on the effects of class imbalance and the non-probability nature of the data on the quality of individual predictions and population estimates.
Using a graph-theoretical interpretation of the Dutch state road network, with traffic junctions as vertices and state roads as edges, the Dutch freight traffic across network is inferred from Weigh-in-Motion (WiM) road sensors. These sensors are installed on a non-probability sample of edges detecting passing transport vehicles. Photographs of the license plates allow for record linkage of vehicle features. However, the sensors only provide information about a small and non-probability sample of edges in the network. Another complication is that detecting a vehicle from the population (the vehicle register) is a rare event. We apply extreme gradient boosting to learn the probability of vehicle detection by a WiM sensor from time features (e.g., weekend indicator, weather), edge features (e.g., pageRank of an edge’s origin vertex, general traffic intensity from loop sensors), vehicle features (e.g., mass, age) and vehicle owner features (e.g. company size, economic activity). The learned relationship is then used to predict the probability of detection on each day of the year, along each edge in the network for each vehicle in the population. Several scenarios were designed to simulate the effects of the non-probability nature of the data and of the extreme class imbalance.
With about 27 million records and over 100 features, the model performed about halfway between random guessing and perfect prediction when trained and tested on a balanced probability sample. Training and testing on a non-probability sample caused substantial variation in model performance across test sets, confirming the risk of extrapolating to domains that are not well represented in the data. Class imbalance seriously compromised model performance, best detected by Matthews’ correlation coefficient and the min-max normalized 𝐹1 of the rare class. Balancing the data improves model performance on balanced test sets but hampers making inference to the entire population, illustrated by reliability plots.
Producing official statistics using non-probability data as the primary source would benefit from a sampling design or features explaining the data generating mechanism. In the absence of both, and when predicting a rare event, the quality achieved with a modern machine-learning algorithm does not meet official statistics’ quality standards.