Evaluating and improving a text classifier for subpopulations.

In 2019 Statistics Netherlands (CBS) developed a beta product to derive cyber-related crimes from texts in police reports using text mining. CBS is interested to develop this product into an official statistic on the proportion of cybercrime partitioned into subpopulations. In this current paper we evaluate what needs to be done to achieve such output.

For these subpopulations, we use crime types as an example. The report treats three issues: missingness of text fields, bias in population estimates due to modelling errors and difference in model performance between crime types. The issues are analysed and solutions are proposed. The variability in model performance with crime types appears to be the most difficult issue to tackle. The proposed method to evaluate model performance over subpopulations might also be useful in other situations where machine learning is used.