Profiling of twitter users a big data selectivity study

Big data may contain traces of human or economic activity that could potentially be used for official statistics. On the other hand, big data does not contain a random sample of the target population, which may result in biased estimates.
In sample surveys, auxiliary information known for the target population is traditionally used to correct for selective non-response. A similar approach could be applied to big data, if auxiliary information can be extracted and linked to the units in the big data source. In this paper, we explore different ways of extracting auxiliary information, both from the big data source itself and from linking it with other sources of information. We apply this profiling method to a dataset of Dutch Twitter users. We show to what extent gender can be extracted from the user’s first name, short biography, tweet writing style and profile picture. We also show to what extent Twitter accounts can be matched with LinkedIn accounts, from which additional characteristics can be extracted using a web scraping robot. We discuss the potential and implications of profiling big data sources for official statistics.