Author: May Offermans, Yvonne Gootzen, Edwin de Jonge, Jan van der Laan, Frank Pijpers, Shan Shah, Martijn Tennekes, Peter-Paul de Wolf.
Pilot Study: Mobile Phone Meta Data Records – Introduction to the research method

3. Method description in 9 steps

3.1. Processing steps taking place at the MNO

3.1.1 Step 1 Description of source data already available at the MNO and pseudonymisation

Source data
The source data according to this method are signalling data, i.e. data regarding events that take place between the device and the network and are recorded in the datacentre of the MNO. The signalling data are generated by mobile phones connecting with the mobile phone network, for example because of a phone call, a data connection being started, or a text message being received. This is registered at the switchboard and stored for up to 6 months. It is not the content of the communication but the time, duration and cell to which that particular event is assigned. A cell is part of an antenna; an antenna covers an area consisting of one or more cells.

There are various types of event-based source data, depending on which events are stored. Besides signalling data, these include for example Event Data Records (EDR) and Call Detail Records (CDR). Signalling data contains the most events. These data are often generated separately for 2G, 3G, 4G, and 5G networks. The latter is irrelevant under this method. The point is that sufficient data points of this type (i.e. events that are recorded) are available per device. The more events are available, the better this is for the estimates. Table 3.1 has been borrowed from publication of Positium [2] and shows the differences in numbers of events between different types of data.

Table 3.1

From the signalling data, only a double pseudonymised ID, the exact time and the antenna ID (number of the antenna used) of each active connection are generated as source dataThen, static reference data such as the cell plan (where the antennas are located) are added.

Incidentally, the method can also handle Timing Advance data as a special form of an event, but this falls explicitly outside the scope of the pilot. Special location determination techniques such as those offered by network providers for 4G and 5G networks also fall outside the scope of this project (and are therefore not used).

Storage and pseudonymisation of source data
The method begins with the processing and storage of signalling data relating to subscribers and users of the MNO's communications services. This is an existing process at the MNO. All subsequent steps are derived from it. As mentioned above, the legal retention period of the signalling data is 6 months.

Signalling data are pseudonymised and provided with an identification number. This is an existing internal security procedure. It is a common internal action in data processing that is also used by other data processors in sectors other than telephony to prevent direct traceability of individuals within an organisation. This pseudonymisation ensures that all directly traceable information is extracted from the data (telephone number, IMSI, etc.). The employees of the telecom company have signed for confidentiality and are not allowed to engage in any activities that lead to disclosure.

3.1.2 Step 2 Second pseudonymisation

As a second step, the MNO pseudonymises the signalling data again, i.e. the keys are changed constantly and cannot be stored. This process is repeated every 30 days for all pseudonymised numbers. Among other things, this counteracts disclosure (by researchers themselves within the company) of groups of people (often VIPS with security) whose agendas are publicly known. Much more important is that this process is irreversible. The employees of the telecom company who process the data cannot go back to the original data.

Further on in the process, considerably more privacy safeguards are built in. Nevertheless, also in this process step, the risks of traceability of the data within the MNO have already been considered. The risks of traceability of the simulated data by the researchers themselves have been examined. We see two risks. The first risk consists of cracking the key. The second is the risk of disclosure on the basis of pseudonymised signalling data, as indicated in the recommendation of the Article 29 Working Party (hereinafter WP29).

Cracking the key is difficult and takes a huge amount of computing power. As indicated in the European WP29 advice, the tracing of individuals by estimating their location based on signalling data has become extremely difficult due to the second pseudonymisation, in which no key is produced. The literature mentions that disclosure is easily achieved if a few of the person’s locations are known. However, that theory does not apply to this particular practice.

Suppose that someone makes a deliberate illegal attempt to disclose individual (telephony) data based on estimated geodata from signalling data.

First, sorting has to take place from a dataset that contains 40 to 200 billion records per month. This is extremely difficult, after all, the IMSI or other variable is unknown.

Sorting based on pseudonymised data from within the telecom company requires collecting extra information on someone’s agenda with at least 4 accurate locations per day, plus the associated time of day. A large part of the route has to be known in advance in order to enable tracing. It is not enough to merely know someone’s work location and place of residence. A total of 4 locations is needed in order to achieve a 95% success rate.

Then, the known elements (places and times) must be translated into a route of cells and antennas that have to retrieved from the source dataset. There will be false positive matches that lead to the wrong antenna, and so to the wrong corresponding encrypted and as such pseudonymised IMSI. This results in quite a labour-intensive process. It will also require writing of extra software. The above is not possible with the existing process algorithms. “Route information” simply is not included anywhere in the process.

This implies that an illegal automated process would need to be devised within the walls of the telecom provider in order to establish the route that is taken by an individual. That route will not contain exact locations such as those enabled by GPS data but areas that may vary per location. This exercise is many times more complex and arduous than if it were taking place in a “regular” dataset such as that of a survey. We estimate that any attempt to trace such information would take at least 4-5 days’ work plus an excessive amount of computing power that is generated from the existing IT facilities. Very few employees at the telecom company have access to the original signaling data. All of them have signed a confidentiality agreement. Then, a conspiracy would need to be formed between said employees, the data analysts at the company and the employees in charge of IT security. 

After all, there are security checks in place within the IT environment of the telecom provider that will identify any excessive use of computing power and there is logging of the actions performed by employees. Although the inherent residual risk of deliberate re-identification is by no means zero, it is many times harder to achieve than with a regular dataset such as information at the tax authorities or in medical file systems. The trouble involved in active disclosure combined with the security precautions effectively result in an extremely remote risk. Tracing of information using other more traditional datasets would be achieved more easily, mainly because of the matches with unique variables such as age or sex, which are completely absent in this case.

3.1.3 Step 3 Producing location estimates (MobLoc module)

In the section MobLoc, the location is estimated. That location is estimated with the help of a Bayesian model, using the coverage area of a cell. In order to arrive at the most reliable aggregated estimates, the devices are redistributed and disseminated. This means no exact location is defined such as with GPS data, but an estimate that takes into account the probabilities . The aim of this step is to provide estimates of the number of devices per municipality, based on the signalling data. This involves the use of the antenna map (reference data).In addition, publicly available information may be proposed by Statistics Netherlands to enhance results, such as land use and elevation maps. This concerns static information.

The algorithm used for location estimation works as follows: the land is divided into grids measuring 100x100 metres. These are grids that (similar to pixels) are easily translated into the fluid borders of municipalities. In cases where a device connects to a cell, the probability of the device being located in a particular grid is estimated for all grids located nearby. This involves the use of data from the antenna map, such as location of the antennas / cells as well as the physical properties such as height and direction angle. These probabilities are multiplied by prior probabilities. These are standardised estimates of the number of devices per grid produced on the basis of auxiliary information, such as the degree of urbanisation. The estimated probabilities are called posterior probabilities. 
The cell IDs from the data are linked to these posterior probabilities, so that an estimate is made of the number of devices per municipality. The location is therefore not very precise: statistical noise is added because of this this step. An additional effect of this method is that numbers of devices are "scattered" in the space and thus redistributed. There is dispersion over time as well. The model works in one-hour batches. If there is only one measurement in a specific hour, this data is extrapolated over other minutes within that hour. There is no spread between the hours as a unit of time. The dispersion in space and time causes an extra noise that changes composition every hour. This means that the relationship between the counts per area and the counts per mast is a system of linear equations, with fewer equations than unknowns and therefore not uniquely invertible.

We illustrate this with an example. Suppose we obtain the respective probabilities of 0.5, 0.3 and 0.2 for a device number one in the hour 8:00-9:00 across municipalities A, B and C. It means we consider the probability greater that the device was located in municipality A, but leave this probability distribution as it is - we do not definitively assign it to A. Suppose that for another device 2 in the same hour we estimate a probability distribution for municipalities A, B, and C to be 0, 1, and 0. In other words, we suspect that the device was located in municipality B for the entire hour. Then we add up the probability distributions for both devices: 0.5, 1.3 and 0.2. The resulting estimate is that there were 1.3 devices in municipality B. Step 4 Adding the place of residence (module MobProp(erties))
In this step, the municipality of residence is estimated. In this pilot, an indication for the place of residence is that the mobile device is most frequently located in the vicinity of a certain antenna. Again, this is not a pinpoint and uses information from Mobloc, an estimate of the location. Finally, the "residential population" counts are estimated. Each mobile device is assigned to a so-called residential antenna. This is the antenna to which the device has been connected most of the time during the observation period. The mass of the device is then distributed over the municipalities in the range of the residential mast. In general, therefore, the device is assigned to more than oneresidential municipality - it is a probability distribution over one or more municipalities.

3.1.5 Step 5 Automatic data processing and thickening (MobCube module)

The dataset is condensed by making a selection of an area and a time period and processing it in MobCube. This last step is irreversible because a lot of information is destroyed (80-90%). After this step there are aggregated counts left and there are no longer any individual records. Basically, one estimated device is equivalent to one aggregated calculation.

The data from MobLoc and MobProp are both processed automatically and aggregated into statistical information. This is actually a count. An example of such a count is described in figure 3.1. If a person resides in the municipality of Maastricht and is in the municipality Utrecht at 12:00 PM, then this is counted in the table cell (column Residence) Maastricht and (row) Utrecht. If the same person is in the municipality of Eindhoven at 11:00 AM, they will be counted in the cell Maastricht (Residence column) and the cell Eindhoven (row). There is no link between the different hour counts and therefore no route (Maastricht-Eindhoven-Utrecht) can be calculated with these results.

Figure 3.1

3.1.6 Step 6 Output check on anonymised data and transmission (MobSafe module)

After MobCube, there is virtually no traceable data left in the dataset. It is possible that some table cells contain very small numbers of counts, for example the number of people residing in a small municipality and present in another municipality that is not nearby. Therefore, in order to avoid any risk of disclosure, a definitive statistical protection in MobSafe takes place in which all output with a count of less than 15 devices is permanently removed. These table cells remain empty.

Again with this step, information is lost. The MNO eventually performs a manual check as well to see whether the process has been carried out correctly. The outcome of this process at the telecom company concerns aggregated and anonymised information that is delivered to CBS. In terms of format, this is the same table as described in Figure 3.1. However, the N>15 line once again destroyed a large part of the data. This is also the case for more than 80% of the cells in this table (see also chapter 4). These will obtain a missing value (a dot or a dash). The result of the technique applied in steps 1 through 6 is permanent. There is no possibility of deducibility due to all the measures in place. Moreover, linking with external datasets could not lead to disclosure. It is therefore impossible to trace the aggregated and anonymised information to individual personal data from this table.

The checked anonymised aggregated data is sent to Statistics Netherlands via a secure connection. The information is treated as highly business-sensitive as it provides insight into the distribution across the country and at the municipal level of subscribers of the MNO. The information is therefore encrypted before it is sent.

3.2 Further processing at CBS

3.2.1 Step 7 Correction of data (MobCal (ibration))

Corrections take place at CBS in order to arrive at representative data on the Dutch population based on the received data. After all, the objective is to say something about the population and not about mobile devices or the population registered by telecom providers. This correction takes place on the basis of registration data from the Personal Records Database (BRP) according to a weighting model. The data at the municipal level (i.e. not at the personal level) may also be enriched with other CBS data, e.g. from the vacation survey or other register information.

3.2.2 Step 8 Processing towards relative values

The weighting in step 7results in an adjusted count. The counts have not yet been validated in this pilot. This means that it is uncertain whether the numbers are the correct numbers for all municipalities. This has not yet been checked. For the visualisation in the beta product, the values are therefore translated into relative differences. This means that the numbers are translated into percentages of people originating from and destined for a municipality. For example, at 12:00 PM, 8% of all Amsterdam residents left for Almere. This is mainly related to the quality of the results, but may also be seen as an extra precaution against group disclosure with outside information. The format remains the same as described in figure 3.1, but it concerns ratios / percentages that can be read in the visualisation at a later stage (MobVis).

3.2.3 Step 9 Publication

The results have been made publicly available in a new visualisation (MobVis) that has been specially developed for this purpose.