Correcting selection bias in nonprobability samples by pseudo-weighting

Cover of dissertation Correcting selection bias in nonprobability samples by pseudo-weighting
Dissertation on selection bias correction in nonprobability samples

Statistics are often estimated from a sample rather than from the entire population. If the inclusion probability of the sample is unknown to the researcher, that is, a nonprobability sample, naively treating the sample as a simple random sample may result in selection bias. Attention to correcting selection bias is increasing due to the availability of new data sources. These data are often easy to collect and may be so-called "Big Data" considering the large inclusion fraction of the population. This dissertation proposes a novel framework for correcting selection bias in nonprobability samples. The general idea is to construct a set of unit weights for the nonprobability sample by borrowing the strength of a reference probability sample. If a proper set of weights is constructed, design-based estimators can be used for population parameter estimation given the weights. To evaluate the uncertainty of the estimated population parameter, a pseudo population bootstrap procedure is proposed given different relations between the nonprobability sample and the probability sample.

Three practical challenges for pseudo-weighting are also discussed. The proposed framework is flexible and many kinds of probability estimation models can be used. The question is raised about how to select a proper model given the population parameter in question. A series of performance measures are tested, and we found that modeling the target variable when evaluating the performance of weights may be useful. The second challenge comes from the large size of the nonprobability sample. Since we often have a large nonprobability sample assisted with a small probability sample, we end up with an imbalanced combined sample which can cause problems when estimating model parameters. Several remedies for imbalanced samples are discussed and the proposed framework is also adjusted accordingly. The results show that SMOTE is a promising technique for dealing with imbalanced samples. Finally, we look at the scenario where not only the population level estimates are of interest but also subpopulation estimates. Several approaches to combine pseudo-weights with small area estimation are discussed. Of all approaches, we found that combining a hierarchical Bayesian model with weights is a relatively stable estimation approach. If both population-level and area-level estimates are of interest, aligning the weighted estimates with estimated marginal totals may be a better option.

Liu, A.-C. (2025). Correcting selection bias in nonprobability samples by pseudo-weighting. Dissertation, Tilburg University.