Innovating statistics with data challenges

/ Author: Sjoertje Vos
People on bicycles on their way to work
© Hollandse Hoogte / David Rozing
The end of last year saw the second edition of the Big Data Meets Survey Science conference online: the BigSurv20. Utrecht University organised the conference, in cooperation with Statistics Netherlands (CBS). Four teams competed for the prize for the best solution to a real-world data problem. The winning team was able to use GPS data to anticipate reasons why people move around. Next year, the application will be further investigated for the ‘Innovatie Onderweg in Nederland' research project of CBS and the Directorate-General for Public Works and Water Management (Rijkswaterstaat).

Great added value

The BigSurv conference is about improving questionnaire research using techniques from computer science and big data. The first edition took place in Barcelona in 2018. In 2020, Utrecht University and CBS took the initiative in organising the conference. ‘The aim of the online conference was to bring together researchers from different disciplines,' says Peter Lugtig, assistant professor of Methodology and Statistics at Utrecht University. Researchers from the social sciences, economics and official statistics are adept at evaluating data quality. They are concerned with measurement and selection errors and non-response. Computer scientists are further advanced in the field of data science and machine learning. We believe that the cross-pollination between these fields has great added value.'

Data fusion

The online conference lasted five weeks. Thanks to sponsors, the conference was free to attend, which also allowed students and candidates from developing countries to participate. The some 1800 people that registered were able to view over 200 presentations. Lugtig: 'These covered subjects such as text, image and sensor data; optimising data collection processes with machine learning; ethics and privacy, and data fusion. The latter is the combination of traditional questionnaire data with new data sources, such as Twitter data. This transition from traditional questionnaire research with random sampling to using existing data sources as much as possible is one of the greatest challenges for science in general and official statistics in particular.’

Barry Schouten is the organiser of the BigSurv20 data challenge

Data Challenge

The conference participants could also register for the big data challenge. ‘This was organised by CBS,' says Barry Schouten, a senior methodologist at CBS and professor of Innovation Survey Observation at Utrecht University. ‘We selected four actual data problems from various agencies. The renewal of the Budget Study for the European statistical office Eurostat; the Innovatie ODiN research project for CBS and Rijkswaterstaat; the measurement of the discrepancy between job vacancies and job seeker skills, for the purposes of CBS's labour market statistics; and the measurement of possible effects of the coronavirus pandemic on the choice of a study for Studiekeuze123. Four teams worked on one of these data questions for one month. They received guidance from project leaders and researchers of the agencies involved, which yielded good results.

Enriching GPS data

The winner of the data challenge was the 'Travel Escape' team. They investigated the renewal of the Movements research project of CBS and Rijkswaterstaat. Qixiang Fang, PhD student at Utrecht University, was one of the four candidates. The other members were researchers from the Netherlands, Germany and India. ‘Our data question was to use GPS data to anticipate reasons why people move around. So for example, whether people went to work, visited friends, or went shopping. This was no easy task,' says Fang. ‘We enriched the GPS data with various data, such as the distances to nearby shops, businesses and schools, information about the weather and weather forecasts, and time variables such as day of the week and time of day. We then ran machine learning models on them. This led to accurate predictions of people’s motives. In addition, we developed a web application that would allow users to check the predicted motives.’ Fang and his team presented their idea in the closing session of the BigSurv20 conference and were awarded a cash prize for coming first.

Webscraping on LinkedIn

Schouten is very pleased with the outcome of the challenge: 'This winning team has convincingly demonstrated that it is possible to predict motives for travel using GPS data. We will therefore continue to investigate whether it is possible to actually implement this methodology in the Movements research project. The solution of the team that came second will also be followed up. They used webscraping on LinkedIn and job sites to compare the competences requested in job ads with the skills of people on the job market. Because the information on LinkedIn is partially protected, we are going to try a variant in which people can donate their own data.’

Interested in the presentations of BigSurv20? You can view them at: www.bigsurv20.org