# Uncovering temporal changes in Europe’s population density patterns using a data fusion approach

Sep 15, 2020

### General framework

We have developed a multi-layered dasymetric approach that models the spatial distribution of different population groups separately and according to a selection of covariates derived from novel geospatial data sources. The methodology follows four interlinked phases (Fig. 5) as follows: (1) estimation of monthly and regional stocks of population groups; (2) mapping of LU features relevant to the location of the population groups; (3) dasymetric disaggregation of population group stocks to their most likely locations within regions; and (4) quality assessment by means of a cross-comparison against independent datasets for selected countries.

### Estimation of monthly and regional population stocks

Based on their expected differences in spatial behavior, we distinguish 16 population groups. The basic idea is that a person’s location is determined by his or her main activity, as he or she is expected to spend most time there. Students, e.g., can be associated with education facilities (e.g., schools and university campuses) and workers with a range of service and production facilities depending on the economic sector they work in. In practice, we constructed monthly matrices with stocks of these population groups for each NUTS3 region in the study area for the reporting year 2011 (12 matrices, each with dimension = 1311 regions × 16 population groups). The NUTS classification is a hierarchical system of nested territorial units used for statistical data reporting in Europe. The NUTS3 level corresponds to country provinces or districts and comprises 1311 regions within the area of interest, with a median size of 1717 km2 (NUTS3 version 2010). NUTS3 are aggregations of municipalities, whereas census zones are small-area units within each municipality. In this study, census zones were only used directly in the cross-comparison exercise explained further down.

The 16 population groups include residents, employees, students, the non-working and non-studying population, and tourists, as detailed next.

Residents correspond to the number of registered residents within a region (obtained from Eurostat at NUTS3 level). Employees are subdivided in 11 economic sectors based on the NACE rev.2 classification of economic activities (Supplementary Table 4) and were obtained from Eurostat reflecting the NUTS3 region of work.

Students are broken down in two main educational levels: primary plus secondary education and tertiary education and above. Student statistics were available from Eurostat at NUTS2 level. Students below tertiary education were distributed among the respective NUTS3 regions based on the proportion of the relevant population age groups. Higher education students were downscaled to NUTS3 regions based on the number of enrolled students per NUTS3 available from the European Tertiary Education Register, reporting year 2011 (https://www.eter-project.com). In months with more than 50% of school/academic holidays, students were considered part of the non-working and non-studying population group. Country-specific holiday calendars for schools and universities were obtained from existing European-wide inventories70,71.

The non-working and non-studying ((N)) population group can be associated with residential areas in both day- and nighttime. It was calculated at NUTS2 level as:

$$N = U + left( {R – A – S} right)$$

(1)

based on the number of unemployed, (U), residents, (R), active population, (A), and students, (S), from Eurostat. (N) was then downscaled to NUTS3 level proportionally to the population size.

In our approach, the monthly variation of the total present population in a region is primarily linked to inbound and outbound flows of people that visit and leave regions for any purpose, leisure, and business alike. We refer to them as tourists and the estimation of their monthly and regional totals involved several steps. First, annual number of nights spent within each NUTS2 region (Eurostat) were disaggregated to NUTS3 regions proportionally to the number of bed places in touristic accommodations available per NUTS3 from Eurostat. The resulting NUTS3 annual number of nights spent were broken down per month using regional (NUTS2 or NUTS3) seasonal curves constructed from data procured from National Statistical Institutes. Finally, we divided the regional and monthly nights spent by the number of days of each month to obtain the average daily number of inbound tourists. A detailed account of the methodology and input data has been published elsewhere66.

From National Statistical Institutes and the Organisation for Economic Co-operation and Development, we obtained the share of inbound tourists per country of origin or groups of countries of origin. In the latter case, we split tourists per country of origin employing a model based on geographical distance and economic size (i.e., Gross Domestic Product), assuming larger and closer economies draw higher quantities of tourists. Then we summarized outbound tourists per country of origin. Finally, from Eurostat we obtained the fraction of tourism going outside the EU and added it to the previous sum to obtain the total amount of outbound tourists per EU country. Tourists from countries outside the study area represent added population to the existing stock and therefore did not need any further treatment. Tourists from the same country (domestic) or from countries within the study area (non-domestic) had to be subtracted from their countries and regions of origin to avoid double counting of total population within the study area. The share of outbound tourists per region within each country was assumed to be proportional to their demographic size. The resulting outbound tourists per region were finally subtracted from the different population groups proportional to their size.

### Mapping of land-use features

LU and LC features are widely used as covariates in dasymetric population mapping2,3,4,6,7,20. In this phase, we constructed the set of geospatial layers to be used as ancillary information in the population disaggregation process. Ultimately, we created two distinct types of input data as follows: (a) a fine-grained LULC map and (b) a set of activity density layers.

The LULC map was produced by integrating geospatial data from a wealth of sources. The map is originally based on the CORINE Land Cover (CLC) 2012 map and nomenclature, but achieves a significantly higher thematic and spatial detail. The 11 artificial LU classes from CLC are subdivided in 18 more specific classes, including Production facilities, Commercial or service facilities, Public facilities, and Airport terminals, which were instrumental for the allocation of certain population groups. The minimum mapping unit in this new map is 1 ha for artificial surfaces and 5 ha for others, as opposed to 25 ha in the original CLC data (see Supplementary Note 1 for more details). The production and validation of this novel map has been documented in a dedicated article72.

Recent findings indicate that the quality of dasymetric population mapping can be increased through the integration of Point of Interest (POI) data73. Therefore, complementary to the LULC map, we built a set of activity layers based on POI and polygon data extracted from TomTom Multinet and OpenStreetMap, to represent locations of activities and facilities associated with the presence of students and workers. The selection of these features was based on correspondence with the considered population groups (students and workers from 11 economic sectors). For each population group (e.g., workers in the manufacturing sector), the relevant features (e.g., factories) were processed into a single binary raster layer with a 100 × 100 m resolution and were treated as an additional LU class in the subsequent disaggregation step. For the disaggregation of tourists, we built a layer reporting touristic accommodation room density based on data from online booking platforms66. These activity layers were necessary to capture overlapping activities and because conceptual differences between point-based data and the polygon-based LULC map that make their integration difficult. In the Supplementary Note 1, we further discuss the adequacy of the used POI data sources.

### Dasymetric disaggregation of population

We downscaled each of the 16 population groups originally from NUTS3 regions to 100 m pixels for each month of the year using a two-tier approach. In essence, the stock of a population group within a region is first divided over the LU types relevant to that group proportional to the occurrence of these LU types within the region (Eq. 2). Consequently, population groups can be associated with multiple LU types (see Supplementary Tables 2 and 3). Conversely, specific LU types may be associated with multiple population groups, in which case they will co-occur in the same LU. In the second step, the number of persons per LU type are allocated to individual grid cells based on built-up density (Eq. 3).

$$P^{prime}_{j,r,u} = P_{j,r} * left( {frac{{Q_{r,u} !*! w_{!u}^{,j}}}{{mathop {sum }nolimits_u Q_{r,u} !*! w_{!u}^{,j}}}} right)$$

(2)

$$P^{prime}_{j,r,i} = mathop {sum }limits_u left[ {P^{prime}_{j,r,u} * left( {frac{{d_{r,u,i}}}{{mathop {sum }nolimits_i d_{r,u,i}}}} right)} right]$$

(3)

where (P^{prime}) is the estimated population of a given (j) population group, in pixel (i) within a NUTS3 region (r). (Q) is the count of 100 m grid cells in a region of a given LU class (u) and (w) is a Boolean parameter that establishes the link between LU classes and population groups (1 if population group (j) is linked with LU class (u), 0 otherwise). The links were based on expert judgment and can be consulted in Supplementary Tables 2 and 3. Finally, (d) is the built-up density based on the European Settlement Map 2012–release 2017 (https://land.copernicus.eu/pan-european/GHSL/european-settlement-map). In this dataset, built-up density is the percentage of surface covered by all roofed constructions without considering building volumes or density of activities. See Supplementary Note 1 for more details concerning this dataset.

There were some exceptions to this general approach. Residents were downscaled from the 1 km2 GEOSTAT grid (https://ec.europa.eu/eurostat/web/gisco/geodata/reference-data/population-distribution-demography/geostat), consistent with the Census 2011, and employing the same rationale as in Eqs. 2 and 3. The non-working and non-studying population layers were obtained by applying NUTS3-specific ratios between the non-working and non-studying, and the number of residents to the number of residents at 100 m level. The total number of tourists per NUTS3 was downscaled twice, generating two distinct grids for each month of the year as follows: (a) one grid reflecting their nighttime distribution (based on the touristic accommodation room density layer mentioned above) and (b) one grid reflecting their daytime distribution (based on a set of LU classes).

In total, the downscaling procedure generates 204 intermediate population grids, i.e., 12 months × 17 population groups (15 population groups + 2× tourists), at a spatial resolution of 100 × 100 m. For each month of the year, the respective nighttime population grid was the result of the sum of the gridded residents with the gridded tourists at nighttime. Conversely, the daytime population grid was the result of the sum of the 15 remainder population group grids (see Eq. 4). The final 24 grids were obtained by aggregating the 100 m pixel values to the target 1 km2 grid cells.

$$P_{i,r,t}^prime = mathop {sum }limits_{j = 1}^n P_{j,r,i}^prime ,,,,{mathrm{where}},nleft( t right) = left{ {begin{array}{*{20}{c}} {n = 2,{mathrm{if}},t = {hbox {‘}} !{mathrm{nighttime}}{hbox {‘}} } \ {n = 15,{mathrm{if}},t = {hbox {‘}} !{mathrm{daytime}}{hbox {‘}} } end{array}} right.$$

(4)

Although the final maps are provided at 1 km2 resolution, the disaggregation was executed at the native 100 × 100 m resolution of the LULC map to leverage the maximum possible available detail of the input data (resampling the LULC map to 1 × 1 km would result in a gross generalization of the LULC classes). Moreover, this allowed us to preserve the highest possible resolution for more detailed inspection of the results.

### Cross-comparison

To assess the quality of the produced grids, a series of cross-comparison analyses was performed for areas where independent night- and daytime population estimates at sub-NUTS3 level were available. The comparison was operated at the level of the native spatial units of each independent dataset, here denoted as (m). For this purpose, our population estimates were aggregated from grid level to the relevant spatial units. For each country (c) and temporal frame (t), we computed an accuracy metric herein called allocation accuracy, (AA) (Eq. 5). It can be interpreted as the percentage of the population stock that has been allocated to the correct spatial units. Metrics based on the sum of absolute errors are common in dasymetric mapping evaluation, because they are more robust in the presence of outliers or for skewed distributions6,7,74. In Eq. 5, the asterisk denotes the independent dataset.

$$AA_c^t = left[ {1 – left( {frac{{frac{{mathop {sum }nolimits_{m = 1}^n left| {P^prime – P ast _m^t} right|}}{2}}}{{mathop {sum }nolimits_{m = 1}^n P ast _m^t}}} right)} right] * 100$$

(5)

For Italy, Portugal, and Spain, we obtained Origin–Destination matrices with commuting of students and workers between municipalities from the census carried in 2011. This allowed us to recreate the likely size of the daytime population of each municipality by simply subtracting and adding the number of incoming and outgoing students and workers to the number of residents. These census-based daytime population values do not include tourists. Hence, for the purpose of the cross-comparison, we generated grids that did not take into account inbound and outbound tourists.

We further compared the population totals in our grids for Belgium with a dataset based on cellphone records from the Proximus, the leading mobile network operator in the country, accounting for nearly 40% of the mobile subscriptions. The number of cellphone records was calculated by Proximus based on signaling data, so capturing connections between the mobile devices and cell towers at high temporal frequency. The dataset consisted of mobile-phone user counts at the level of Voronoi polygons around each cell tower in the country, with a temporal breakdown of 15 min for a specific weekday outside the holiday season (i.e., Thursday, 08/10/2015). Based on this, we produced day- and nighttime frames based on the average observed counts in the periods 9:30–11:30 a.m. and 3:00–5:00 a.m., to capture core working and sleeping hours, respectively. We dissolved Voronoi polygons smaller than 1 sq. km with surrounding polygons to avoid spatial units smaller than the size of our final grids. Although this dataset cannot be used directly to predict total population, as Proximus covers a limited market share in Belgium, we assume it reflects the relative presence of population in space and time. Hence, before applying Eq. 5, we rescaled the number of mobile-phone users to match the total population in Belgium, assuming a constant market share across regions, as in previous studies47.

### Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.