Over the next few months, Statistics Netherlands (CBS) will invest in a new computer system: Spark. This big data framework provides researchers and statisticians with the capacity to process large amounts of data and big data more rapidly. A fundamental tool for CBS, which nowadays needs to process ever larger databases.
Spark project leader at CBS Adrie Ykema says databases are growing in size at CBS: ‘Our researchers have reached the boundaries of our computer infrastructure. The calculations we need to perform with these vast amounts of data demand more computing power and smart processing to speed up processing time.’ Spark offers a solution in the form of a software layer enabling multiple computers to perform calculations on the same task simultaneously. This will result in fast and accurate calculations on huge quantities of data.
Over the past few months, CBS has run tests with Spark. Ykema explains: ‘On a “small version” of the Spark system we have conducted three Proofs of Concept: one with traffic loop data, another with data from the Centre for Policy Related Statistics, and one with statistics from the Nature Department on dragonflies. The latter served as a sample of working on a relatively small dataset while using a highly processing-intensive method. Through these tests we were able to study what exactly becomes feasible with Spark, as well as the cost in terms of money and effort it will involve. Our conclusion is that Spark offers many benefits in performing calculations with very large datasets such as traffic loop data from the Directorate-General of Public Works and Water Management.’
Marco Puts is a big data researcher at CBS who has already gathered considerable experience with Spark outside the pilot study. He is very keen: ‘From a European project, we have sourced AIS data (AIS stands for Automatic Identification System). It is a system which was designed to increase maritime security, both at sea and on waterways, by providing overviews and information which it obtains from interactions between ships and from ship to shore. CBS has purchased this information with two different aims. Firstly, to allow experimenting with Spark and secondly, to see whether we would be able to produce statistics based on these huge amounts of data.’ According to Puts, the sky is the limit for Spark: ‘With Spark, we only need 15 minutes to process data over an entire quarter, whereas on a normal computer, the processing of just one day’s data takes us one and a half days. We could not have kept up without Spark. Another great advantage is that it is easy for us to purchase extra services whenever processing becomes too heavy due to large data volumes.’
Investing in speed
CBS is currently investing in a larger Spark system, due to become operational later this year. Ykema: ‘This investment may lead to much faster production of our statistics. Over the next few months, preparations for the commissioning of Spark will continue. We will look into how and where to store large databases, how to handle data security issues, and how technical and operational management is best organised.’
According to Ykema, it is good to see how research, IT and statistical production staff all work together for the execution of the Spark project. ‘Spark requires a new way of thinking and working, and a different organisation of the statistical process. We do offer staff an online training course, but the results from working with Spark will only become clear through actual practice.’
Ripples on water
As confirmed by Puts, a number of CBS staff are already taking part in the EDX training course series ‘Data Science and engineering with Spark’. ‘This series of courses takes 6 months and offers step-by- step instructions on working with Spark. I have noticed that my colleagues are picking up quickly. Hopefully this will spread like ripples on water, so CBS will have more and more people who know how to process big data.’