Policy Research Working Paper                                  9979




                    Where Are All the Jobs?
    A Machine Learning Approach for High Resolution
  Urban Employment Prediction in Developing Countries

                                  Samira Barzin
                                   Paolo Avner
                                  Jun Rentschler
                                  Neave O’Clery




Urban, Disaster Risk Management, Resilience and Land Global Practice
March 2022
Policy Research Working Paper 9979


  Abstract
 Globally, both people and economic activity are increasingly                       adaptation of the random forest algorithm is used to predict
 concentrated in urban areas. Yet, for the vast majority of                         within-city cells in the 14 test cities with extremely high
 developing country cities, little is known about the granular                      accuracy (R- squared greater than 95 percent), and cells in
 spatial organization of such activity despite its key impor-                       out-of-sample ”unseen” cities with high accuracy (mean
 tance to policy and urban planning. This paper adapts a                            R-squared of 63 percent). This approach uses open data
 machine learning based algorithm to predict the spatial dis-                       to produce high resolution estimates of the distribution of
 tribution of employment using input data from open access                          urban employment for cities where such information does
 sources such as Open Street Map and Google Earth Engine.                           not exist, making evidence-based planning more accessible
 The algorithm is trained on 14 test cities, ranging from                           than ever before.
 Buenos Aires in Argentina to Dakar in Senegal. A spatial




 This paper is a product of the Urban, Disaster Risk Management, Resilience and Land Global Practice. It is part of a larger
 effort by the World Bank to provide open access to its research and make a contribution to development policy discussions
 around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The
 authors may be contacted at samira.barzin@maths.ox.ac.uk, pavner@worldbank.org, jrentschler@worldbank.org, and
 oclery@maths.ox.ac.uk.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
                         Where Are All the Jobs?
          A Machine Learning Approach for High Resolution Urban
             Employment Prediction in Developing Countries


             Samira Barzina,∗ , Paolo Avnerc , Jun Rentschlerc and Neave O’Clerya,b

                             a Mathematical Institute, University of Oxford
       b The Bartlett Centre for Advanced Spatial Analysis, University College London
                                              c The World Bank

                      Corresponding Author: *samira.barzin@maths.ox.ac.uk,




Keywords: development economics, urban economics, cities, firm locations, big data, satel-
lite data, computational methods, machine learning
JEL Classification: O18, C6, R3, R11
Acknowledgements: The authors are grateful to Tatiana Peralta Quirós, who was involved in an early effort to
develop a methodology to predict employment distribution in urban areas and was of great help in accessing
employment datasets. We also thank Holly Krambeck, Nancy Lozano, Lorenzo Carrera, Samuel Heroy and Sam
Fankhauser for helpful feedback on an earlier version. This study was supported by the Global Facility for Disaster
Reduction and Recovery (GFDRR) and by the Foreign, Commonwealth and Development Office (FCDO) through
the Multi-donor Trust Fund on Sustainable Urbanization, and the Oxford Martin Programme on Informal Cities.
The computational aspects of this research were supported by the Wellcome Trust Core Award Grant Number
203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the author(s) and not necessarily
those of the NHS, the NIHR or the Department of Health.
1   Introduction

    Cities across developing countries have experienced unparalleled population growth
in recent decades, fueled both by rural-urban migration and high population growth (Hen-
derson and Kriticos (2018), Bryan et al. (2020)). These dynamics differ substantially from tra-
ditional urban growth patterns experienced by European and North American cities, par-
ticularly in terms of speed and the noticeable absence of structural transformation and in-
dustrialization (Henderson and Kriticos, 2018). In general, urbanization is strongly associ-
ated with economic growth on a national scale (Glaeser and Xiong, 2017) and agglomera-
tion economies (economic benefits arising from spatial clustering of economic activity) (Bryan
et al., 2020). However, negative externalities arising from high urban density, such as con-
gestion and insufficient sanitation infrastructure, require well planned policy action in terms
of, for example, infrastructure and institutions. It is thus crucial to improve our understanding
of the dynamics and inner functionalities of cities in order to design more targeted policies
to support successful urban development. In particular, gaining an insight into how jobs are
spatially distributed is key to understanding urban form and internal labor market structure.
    These assessments matter. Metrics that estimate accessibility to jobs aim to capture in
a simple fashion how an urban area’s transport systems and land use perform in connecting
workers to employment opportunities, a core function of cities (Duranton and Puga, 2015).
It is also useful to uncover areas of a city that are underserved by public transport systems,
where residents have low access to jobs (i.e., spatial mismatch) (Gobillon and Selod, 2014)
and where public interventions would be beneficial. In fact, employment accessibility studies
have become a mandatory pre-requisite for World Bank financed urban transport operations
in recent years. In parallel, there is increasing recognition that natural disasters and climate
change impose significant costs on societies not only through damage to infrastructure assets
but also through disruptions to the infrastructure services that these systems enable (Halle-
gatte et al., 2019). Disruptions to commuting movements entail large opportunity costs in
the form of lost time, discomfort or foregone salaries (He et al., 2021). Minimizing or even
preventing these disruptions is possible through targeted maintenance of, or investments
in, road infrastructure identified through network criticality analyses, but these hinge on an
understanding of job locations. Finally, there is much to gain in terms of furthering our under-
standing of the urban agglomeration forces at play, particularly with respect to employment
density (Ciccone and Hall, 1996) and spatial clustering, i.e. monocentric, polycentric, or dis-
persed (Lall et al., 2021a; Bertaud, 2002).
    However, in most developing country cities, granular information on jobs is often not
available as geo-referenced business registries and employment censuses rarely exist or are
incomplete and outdated. Given these exercises are costly and burdensome to organize,
requiring the mobilization, training and deployment of surveyors, they are only conducted


                                               2
infrequently and if undertaken often suffer from quality issues. Here we aim to fill this key
knowledge gap by developing a new machine learning technique to predict the spatial dis-
tribution of employment from openly and widely available data on amenities, roads, light
intensity and other variables such as population density.
    A burgeoning literature has recently emerged that harnesses satellite data and so-called
‘big data’ to uncover socio-economic and spatial patterns of cities. Of particular note is a
milestone study by Henderson et al. (2012) where the authors show that night light intensity
derived from satellite data can be used as a proxy for income/GDP. A second set of studies
focus on estimating the spatial extent of urban areas in developing countries. For instance,
Henderson et al. (2021b) rely on satellite based population data to identify urban areas and
estimate wages for cities in six SSA countries. Focusing on building characteristics, Lall et al.
(2021b) and Henderson et al. (2021a) rely on building height data to investigate urban form
for 397 cities and identify slum areas in Nairobi respectively. Another valuable source of data
is derived via digital (often crowd-sourced) maps, e.g., Open Street Map (OSM). For example,
Baruah et al. (2021) combine satellite based population and night lights data with OSM data
on the road network to compare the urban form of cities formally under anglophone and
francophone colonial rule, while Soman et al. (2020) rely on OSM data to detect informal
settlements.
    A related strand of literature combines non-traditional data with machine learning ap-
proaches in order to predict a range of socio-economic parameters thus overcoming data
sparsity challenges in developing countries. For example, Jean et al. (2016) rely on high res-
olution satellite imagery in combination with convolutional neural network (CNN) analysis
to predict wealth across five SSA countries, where CNN is used to extract roads and other
features from the raw pixels of the satellite imagery. Yeh et al. (2020) extend this work and
predict asset wealth at high resolution for 20,000 villages located across SSA. Similarly, En-
gstrom et al. (2021) exploit deep learning methods to predict poverty and consumption at
the local level across Sri Lanka. Blumenstock et al. (2015) alternatively rely on mobile phone
data to infer socio-economic status at an individual level via a machine learning classification
analysis. While these studies represent just a fraction of efforts in the literature, they capture
an emerging trend in the deployment of novel data and machine learning to fill gaps in data
collection in low resource environments.
    In terms of novel approaches to specifically estimating employment, the literature is
more limited. The ‘gold standard’ is census data on firms or employees, typically assem-
bled by government agencies in the course of collecting taxes. It is hence normally limited
to the formal sector. Travel surveys, in contrast, typically have much more limited coverage
of the population, but can often be statistically upweighted to provide a reasonably accurate
picture of employment distribution - and may also include the informal sector. However, as
is the case for business registries and employment censuses, travel surveys are also scarce


                                                3
and conducted inconsistently. To overcome this data gap, some scholars have inferred the
presence of firms from imagery, including both street imagery (Straulino et al., 2021) and satel-
lites (Goldblatt et al., 2020). However, these approaches are severely limited to picking up just
what is ‘seen’ from the exterior or shape of buildings. Other authors have successfully inferred
work locations (among other information) from mobile phone data, either through Call Detail
Records, CDRs, (Kreindler and Miyauchi, 2021; Zagatti et al., 2018; Louail et al., 2015) or more
granular information obtained from anonymized GPS logs collected from smartphone users
through third party apps (Miyauchi et al., 2021; Yabe et al., 2020, 2021). These approaches are
promising but have several drawbacks. First, they are time consuming, as obtaining agree-
ment from mobile phone subscription providers or third parties for data use, accessing the
data in a secure manner and processing it usually takes several months. Second, accessing
this data can be costly when not made freely available for humanitarian purposes (Lu et al.,
2012; Bengtsson et al., 2015). Thirdly, there are concerns that this type of data may lack pro-
portional coverage over space, notably for CDRs where cell phone towers are geographically
unequally distributed. Similarly, coverage is likely to vary across sub-populations, especially
when the data comes in the form of GPS logs obtained from smartphones that have a low
penetration share in developing country economies and are mainly owned by higher income
groups. Quick to deploy, cheap and scalable alternative methodologies to understand the
spatial distribution of jobs in urban areas are urgently needed.
    Here, we develop a novel method to predict the spatial density of employment using
open freely available data from web-based sources including OpenStreetMap (OSM) and
Google Earth Engine (GEE). OSM contains information on the spatial location of hundreds
of amenities and other mapped attributes, ranging from bus stops to shops to the street
network. These variables hold invaluable information on human activity, which we hypothe-
size is highly correlated with employment density. We supplement the OSM variables with
satellite data, such as population, night lights and land cover, accessed via GEE. In order to
predict employment density from OSM and GEE data sources, we deploy a range of machine
learning algorithms, focusing in particular on two ensemble tree methods - Random Forest
(RF) and Gradient Boosting Machines (GBM) - a Regularized Generalized Linear Model and a
linear regression. We select the RF algorithms specifically due to their ability to handle noise
in the training data, stemming from a mix of measurement error and incompleteness in both
the employment and feature data. We propose a simple extension to these models, and
extend our feature space to include the spatial version of each feature based on feature val-
ues in neighboring cells, thus accounting for local spatial clustering. We additionally include
Regularized Generalized Linear Models and linear regression based approaches, which while
less suited to our noisy data, enable us to delve deeper into variable importance scores, and
uncover which features are most indicative of the presence of employment.
    Our work differs from previous work along a number of dimensions. First, we focus on


                                               4
prediction of employment rather than income or urban extent. Second, we predict employ-
ment at high spatial resolution within cities, while much similar work focuses on larger spatial
scales. Third, we explicitly aim to develop a framework that can predict employment in an
out-of-sample setting, where training data for the city in question is not available. This is a key
feature, and rarely tackled in the literature, for usability in data scarce environments. Finally,
our approach is scalable and more robust than alternative methods, particularly in compari-
son to deep learning approaches which tend to be less robust to noise in the data. We train
the model on data derived from open sources, and use methods that require reasonable and
accessible levels of computing power. Our work relates most closely to Goldblatt et al. (2020)
who explore the correlation between satellite derived features and employment/enterprise
at the local level for Vietnam, and Ahlfeldt et al. (2020) who infer ’prime locations’ (defined
as locations exhibiting very high density of human capital intensive services) on the basis of
Google Places data.
    We apply our method to 14 cities, ranging from Buenos Aires in Argentina to Dakar in
Senegal. The set of cities was chosen for a variety of reasons including both geographical
spread and the availability of granular employment data (for training purposes). We note that
the source (and quality) of this employment data varies, and is based on both census and
travel surveys. We show that we can predict held back within-city cells in our test cities with
extremely high accuracy (> 95% R2 ), and cells in out-of-sample cities with high accuracy (mean
63% with max 80% R2 ). The variation in out-of-sample prediction can likely be attributed
to a combination of city-specific relationships between employment and our features, and
varying degrees of data quality for both the feature and employment data across cities. We
investigate this latter effect, finding support in the data for this hypothesis.
    We hope our work will be widely deployed for policy and humanitarian purposes. Re-
cently, it has been noted that academic advances in the application of data science methods
to development questions has yielded little real world impact to date (Burke et al., 2021). This
was attributed in part to difficulties in interpretability. By using minimally processed input
data sources, we avoid the complex processing required for extracting objects, for example,
from raw satellite imagery data. We hope that this simplification, combined with the open
nature of the input data and the widely-used random forest algorithm deployed, reduces
interpretation issues and barriers to use.


2   Data

    We build the analyses and train the algorithms on a sample set of 14 cities for which
we have access to urban employment data with spatial identifiers. This data set consists of
9 cities across Sub-Saharan Africa (SSA) and 5 cities in Latin America (LAT); see Figures 1
(a) and (c). Across cities, the data differs by data collection type and by degree of spatial

                                                5
Table 1: Overview of Data across Cities


                                                                                   Mean Size            No. of
                                                         Year         No. of
            City                 Data Type                                         of Polygons      500m x 500m
                                                     Collected      Polygons
                                                                                       (km2 )         Grid Cells

 SSA
            Abidjan/CIV     Population Census           2015            307           14.79             21,121
             Dakar/SEN         Travel Survey            2015            213            2.51              2,663
  Dar es Salaam/TZA                                                     101           17.73              8,436
           Douala/CMR          Travel Survey            2018            194            1.37              1,399
            Harare/ZWE      Population Census           2012            75            12.50              4,558
         Kampala/UGA           Firm Census              2011         184,335*          5.05*             4,721
             Kigali/RWA        Firm Census              2011           1,162           0.63              3,566
         Kinshasa/COD          Travel Survey            2018            395            3.68              7,021
           Nairobi/KEN         Travel Survey            2013            106            6.77              3,532
 LAT
 Belo Horizonte/BRA            Firm Census              2019           1,048           3.05             13,687
           Bogotá/COL                                                 19,915           0.02              2,190
  Buenos Aires/ARG             Firm Census             2004/5         11,822           0.30             16,551
              Lima/PER      Population Census           2010          91,875           0.01              4,988
      Mexico City/MEX          Firm Census              2019         476,853*          0.33*             4,161

   Note: Firm Census refers to firm censuses, establishment censuses or other firm registries, or projections based on
   these. Data excludes units with missing employment information. * Data was provided in latitude/longitude point
   locations; data for Kampala was spatially matched to the 2012 Kampala travel survey polygons, data for Mexico City
   was spatially matched to AGEB polygons



disaggregation; city specific details are provided in Table 1. The data is usually collected
either via travel/commuting surveys, extrapolated via population censuses or identified from
business registries, where the degree of spatial disaggregation is generally higher for the
latter, i.e., resulting in smaller polygon units. For cities where we have access to multiple
rounds of data, we rely on the most recently collected and/or most spatially disaggregated
version of the data1 .
       First, we calculate the spatial employment density for each polygon provided; see Figure
1(b). In order to achieve homogeneity of spatial units across the cities and higher spatial
disaggregation overall, we overlay each city with a grid net of 500m x 500m hexagonal cells.
  1
      See Appendix A for more information on the data sources.




                                                         6
                     (a)                                                    (b)




                     (c)                                                    (d)


                  Figure 1: Overview of City Sample and Spatial Distribution of Employment.
    Subfigures (a) and (c) depict cities included in the data, while subfigures (b) and (d) illustrate the spatial
distribution of employment for original polygons and disaggregated into 500m x 500m hexagonal grid cells for
                                                  Dakar/Senegal.


We intersect the original polygon and grid cell layers and obtain spatial employment data
for each grid cell through weighting employment values by the spatial intersection of grid
cells and polygons. For most cities the data is provided as number of jobs per polygon,
making the identification of exact job locations impossible. Hence, we assume homogeneity
in spatial distribution of employment within polygons when computing grid cell values.
     For our 14 cities, we extract OSM data on the cities’ major street networks, public trans-
port stations, shop and office locations, amenities, airports, lakes, rivers, parks, forests and
coastlines. We further compile satellite data accessed via Google Earth Engine (GEE) on


                                                          7
night lights, population, air pollution, geophysical land surface, built-up land cover, the Nor-
malized Difference Vegetation Index/NDVI, the Normalized Difference Water Index/NDWI
and land use2 . The data on night lights, population, air pollution, NDVI and NDWI contains
temporal variation and is thus averaged within its original resolution raster cells across 2015
to 2019 prior to intersecting it with the employment grid cell layer. Both OSM and GEE data
is available globally, where OSM is crowd-sourced mapped feature data and thus is assumed
to exhibit heterogeneity in data quality and completeness across cities; data included via
GEE is assumed to be consistent across included locations. GEE hosts an array of satellite
and rasterized data from various sources. For computational arguments and ease of scale-
up of the analyses, we limit our choice of satellite data to those available in rasterised form.
Consequentially, in contrast to the related literature (see for example Jean et al. (2016) and
Yeh et al. (2020)), we do not require deep learning methods or convolutional neural networks
to detect and extract features3 from raw satellite pixels.
        For each city, we compute a bounding box around the overall geographical area and
extract OSM data for various features located within this box. In order to obtain the grid level
values from the mapped features, we count all point features, e.g. shop locations and public
transport hubs in a given grid cell. For line features, e.g. roads and rivers, we extract the
length per grid cell, and for polygon features, e.g. parks and airports, we calculate the spatial
intersection with grid cells. For GEE derived data, feature values per grid cell are computed
via a weighted spatial intersection between the original data raster cells and 500m x 500m
hexagonal grid cells. Given that cells on the border of the city are cut by the external outline
of the city and are thus smaller than 500m x 500m grid cells, we compute and use the density
versions of the values for each grid cell for the analyses. In addition to the extracted features,
we also derive further features from the raw OSM and GEE data (e.g. street intersections are
derived from the OSM data on the city’s street network and terrain roughness is computed
based on the earth’s surface elevation data). Figure 2 provides visualizations of a selection of
OSM and GEE features for Dakar/Senegal.
        A visual comparison between the images contained in Figure 1 and Figure 2 suggests
that both employment and features follow a broadly similar spatial distribution pattern across
Dakar. The highest values of both employment and the majority of the features can be ob-
served in the west and in particular the south west of the city. Commercial, administrative
and institutional activities in Dakar are concentrated in the ”Plateau” and ”Fann” neighbor-
hoods, towards the South and South-West of the peninsula. Industrial zones are distributed
in a slightly more homogeneous manner between Dakar, Pikine and Rufisque (with the ad-
dition of Guédiawaye, the four municipalities, together, constitute the Dakar metropolitan
   2
       See Appendix A for further details.
   3
       Throughout this work, we use the terms features and covariates, and the terms of response and dependent
variable interchangeably.



                                                       8
                (a)                              (b)                               (c)




                (d)                              (e)                               (f)




                (g)                              (h)                               (i)


                  Figure 2: Illustration of Selected Features at 500m x 500m Hexagonal Grid Cells.
  Subfigures (a) to (e) are density adjusted; NDVI is not logarithm adjusted; all features are within-city scaled.


area) but remain very concentrated along the coast. They are found in the vicinity of the
Port and along the southern coastline. The rest of the territory is dominated by residential
areas that continue to expand to the East of the peninsula. Some features, e.g. major roads,
exhibit a high density towards the west of the city, but are also present towards the east,
whereas other features, e.g. public transport hubs, appear to be nearly absent towards the
city’s east. Overall, the distribution of employment within the city exhibits relatively strong
spatial clustering with a Moran’s I value of 0.91. Some of the features similarly reveal strong
spatial clustering patterns, e.g., night lights, while others are more spatially spread across the
city, e.g. public transport hub locations4 . Given that spatially disaggregate data, i.e. features
  4
      The Moran’s I statistic (Moran (1948) and Anselin (1996)) allows to quantitatively describe the level of spatial


                                                           9
in lines, points and polygons, is aggregated while data obtained via larger polygons or grid
cells is disaggregated to the grid cell level, i.e. air pollution data comes in 1km2 resolution,
higher spatial autocorrelation is present within the latter. Hence, the calculated Moran’s I
statistics are influenced both by differences in spatial clustering of the features, but also by
the spatial resolution of the original data. In order to reduce the potential impact of the dif-
ferent levels of spatial (dis)aggregation on the subsequent analyses, we train our models on a
large set of cartographic, geographical and geological features for the predictions of spatial
employment within cities.
        We rely on feature engineering to generate and trial multiple feature variations derived
from the original raw data. In other words, we create variations of the original data and include
them as additional features. These spatial density adjusted parameters (i.e., neighboring
cell values), binary features derived from binned quartiles of the features’ distributions across
values, binary features indicating the presence or absence of a given feature in a grid cell, and
count and logarithm adjusted versions. Our assembled data set5 comprising all sample cities
contains 98,594 observations6 and more than 100 features7 , where each unit of observation
is a 500m x 500m hexagonal grid cell. Given that cities differ in their physical extent, and due
to a higher presence of original polygons with ’NA’ employment values in some cities, the
contributions of cities to the overall sample is not uniform (see Table 1 for details).


3       Methods

3.1 Dimensionality Reduction via Principal Component Analysis (PCA)

        Identification of patterns within large data sets with many features is often difficult, e.g.
due to the presence of collinearities. Methods of Unsupervised Machine Learning are pre-
dominantly employed to identify clusters of observations sharing similarities or to compress
high dimensional data to reveal patterns within the data more clearly. Within this work, we
specifically rely on Principal Component Analysis (PCA) for this purpose (Pearson, 1901). PCA
identifies a finite number of linear combinations containing all features (Principal Compo-
clustering of a variable, where Moran’s I values generally range from -1 to 1. A Moran’s I of approximately 0
indicates a completely random spatial distribution, negative values point to spatial dispersion and positive values
to spatial clustering of a variable. Moran’s I values of visualized variables: Employment Density (log): 0.91*** , Major
Roads (log): 0.57*** , Intersections (log): 0.66*** , Public Transport Hubs (log): 0.51*** , Amenities (log): 0.62*** , Shops
(log): 0.59*** , Night Lights (log): 0.99*** , Population (log) : 0.98*** , NDVI: 0.74*** , Urban Land Cover: 0.76*** (with
***
      representing p < 0.01, ** indicating p < 0.05 and * pointing to p < 0.1 significance levels computed via Monte
Carlo simulations across 500 runs respectively; all variables are analyzed post within-city standardization)
   5
     Data set and feature space are used interchangeably throughout this work
   6
     This only includes observations with non-missing information.
   7
     Features are reshaped according to each model’s requirement resulting in slightly different numbers of vari-
ables across models.



                                                            10
nents/PCs) while preserving the covariance structure of the original feature space. Each PC
consists of a linear combination of all features, where the contribution of each feature within
a PC is determined by its assigned feature loading, which are higher for variables more im-
portant to the variance of the overall data set. Hence, for each observation of the original
data a projected point can be derived via the PCs computed. However, given that PCA is
an unsupervised dimensionality reduction technique it does not allow to explicitly focus on
a response variable. Additionally, PCAs are limited due to their linearity assumption of the
underlying data. Thus, for this work, we limit our use of PCA to exploring and visualizing the
data.


3.2 Ensemble Methods: Random Forest and Gradient Boosting Machine Algorithms

       Models based on decision trees offer an alternative methodological approach of com-
putations for prediction problems. Decision trees split the full data into rectangular regions
(’leaves’) through covariates and threshold selection to minimize the average squared error.
Predictions are then formed on each partitioned data subsample and averaged over all to
obtain a prediction model for the full data set (Breiman et al., 1984). In order to avoid overfit-
ting, and thus reduced out-of-sample performance, a penalty term is identified to determine
optimal hyperparameters, e.g. tree depth, along the trade-off of the model’s complexity and
out-of-sample performance8 .
       While a model based on a single tree offers the advantage that it can be understood
relatively intuitively, combined algorithm models generally outperform those derived from
a single algorithm. Models building combinations of many decision trees (’Ensemble tree
methods’) have thus emerged as reliable extensions to single decision trees delivering ro-
bust performance quality. The Random Forest (RF) algorithm (Breiman, 2001) has been es-
tablished as one of the most widely used ensemble tree methods across the recent literature
(see for example Engstrom et al. (2021)). The RF algorithm generates mean predictions based
on many randomly perturbed parallel trees where each tree is identified for a subsample of
the overall data, where subsamples are most often identified via bootstrapping (’bagging’).
Different from traditional decision tree models, RF models optimize their parameters for a
random subset of covariates whose selection differs at each split. This introduces variation
within the analyzed data for each tree along both the data splits and covariate selection and
results in a better overall fit of the combined algorithm. In particular, the RF specific parti-
tioning of observations and feature space rules allow for the model to remain robust even
in the context of noisy data (Perlich et al., 2003). Additionally, given that the final RF model
is based on averaging across many trees, RF models also smooth out any discontinuities
introduced into the feature space through subsampling. RF generally outperform single de-
cision tree based models, deliver robust high predictive performance while requiring little
  8
      See Athey and Imbens (2019) for more details.


                                                      11
additional manual tuning. Gradient Boosting Machines (GBM) are a related ensemble tree
method (Friedman, 2002). GBM also merge multiple tree models but different from RF es-
timate these sequentially and combine them by reiteration. Unlike RF, GBM models rely on
’boosting’ to generate the final model, i.e. each tree is fit to the whole dataset and trees are
added sequentially to enhance performance cumulatively. GBM models can deliver precise
predictions, however they are noticeably more sensitive to outliers in the data and hyperpa-
rameters specifics than RF models (Dietterich, 2000).
       Overall, both RF and GBM present algorithms that can deliver high precision estimates.
However, while GBM can outperform RF under specific data structures, it is also significantly
more sensitive to noise in the data and to the choice of hyperparameters.


3.3 Feature Selection Approaches with Regularized GLM Models

       Large feature spaces allow to explore the relationship of the dependent variable with
various covariates. However, a large number of features may also introduce higher levels of
collinearity across the feature space and may result in overfitting, thus lowering out-of-sample
performance (Athey and Imbens, 2015). In order to reduce the aforementioned issues while
retaining as much as possible of the information contained within the feature space, regu-
larized linear models9 amend linear estimations by adding feature selection identification to
the overall estimation. In the first step, the regularized GLM model selects important features
across the whole feature space according to
                                                   N (
                                                  {∑            ∑
                                                                K               )2        ∑
                                                                                          K           }
                                 βreg = arg min          yi −          xij βj        +λ          βj       (1)
                                                   i=1          j =1                      j =1

          ∑
          N             ∑
                        K
where           (yi −          xij βj )2 are the estimation’s residuals, and λ is the penalty term applied
          i=1           j =1
across a feature space with N observations and K features. The regularization parameter can
either be set by the researcher or determined through the model via cross-validation to min-
imize the model’s RMSE. However, given that the regularization estimation also introduces
a penalty on the important features (Varian, 2014), the regularization estimation is followed
by a standard linear model estimated over the subset of selected nonzero variables. The
most known versions of regularization models are Lasso (Tibshirani (1996); Belloni and Cher-
nozhukov (2013)), Ridge (Hoerl and Kennard, 1970) and Elastic Net (Zou and Hastie, 2005)
models where the models differ in the specifics of the penalty parameter applied. Lasso
models use absolute value penalties to shrink the coefficients of unimportant variables to
zero and thus tend to perform better for models with a finite set of important covariates.
Ridge models rely on quadratic penalties to reduce the coefficients of unimportant variables,
but unlike Lasso models never shrink these to absolute zero. Ridge estimations are hence
  9
      Also referred to as penalized (linear) models throughout the literature


                                                          12
more suited to models where multiple variables exhibit similar importance. Elastic Net mod-
els combine elements of both Lasso and Ridge models. Regularized models can easily be
used in combination with the standard econometrics models, and thus remains intuitive in its
interpretation.


3.4 Model Setup

       In order to identify the model that best predicts employment across all sample cities, we
test the algorithms outlined in Sections 3.2 and 3.3 across the whole data set and on separate
SSA and LAT subsets. For all models, we follow a standard test-and-train split of the data. We
generate a random 80/20 split where we train the models on 80% of the original feature space
and test on the held back 20% to gain insight into the model’s out-of-sample predictive per-
formance; the test and train sets are identical across the models. For the GBM of Section 5.1
and Reg GLM model of Section 5.2 we further impose a k-fold cross validation (CV) set up with
10 folds and 10 repeats. Within the training of the models, the training data is thus split into
10 equally sized disjoint subsets where a model is fitted on 9 folds and internally assessed on
the remaining fold. This is iterated over all folds and repeated 10 times, so that each model
effectively runs through 100 fitting processes. This set-up enables us to identify and subse-
quently incorporate model hyperparameters that optimize out-of-sample performance. The
final model is then fit to the complete train data set. Given the bootstrapping structure of ran-
dom forest algorithms, no k-fold CV is required for hyperparameter optimization. To obtain
insight into out-of-sample performance, we test the model on the test data and compare the
models’ performance by relying on the model’s R2 and Root Mean Squared Error (RMSE)10 .
R2 presents how well the model predicts (the variation in) the response variable and the RMSE
measures the averaged difference predicted and observed response value.


4      Exploratory Analysis

       In order to gain insight into the structure of the data, we exploit two options to summa-
rize the data: (1) Pearson correlations to identify pairwise relationships between all variables,
and (2) Principal Component Analysis (PCA) to gain a thorough insight into the underlying
dynamics of the feature space.
       Figure 3(a) provides a visual representation of the pairwise Pearson correlations across
the main features of the data set, and Figure 3(b) specifically depicts the pairwise Pearson cor-
relations of the response variable of log(employment density) with the covariates at the grid
             ∑                            √ 1 ∑N
                     ˆi )2
                (yi −y
     R =1− ∑
    10 2
                             and RMSE =    n   i=1
                                                      yi −yi )2 where yi is the observed, y
                                                     (ˆ                                   ˆi is the predicted, y
                                                                                                               ¯i is the
                (yi −y
                     ¯i )2

mean response variable, n is number of observations with 0 ≤ R2 ≤ 1 and RMSE is measured in the unit of the
response variable.


                                                           13
                                         (a)                                                               (b)




                            Figure 3: Overview of the Feature Space Correlations.
  Subfigure (a) visualizes the pairwise Pearson correlations across the whole data set, and (b) summarizes the
               pairwise Pearson correlations of log(employment density) with the main features.


cell level across the whole data set. We identify correlations along both directions, i.e. there
exist both positive and negative correlations across features, and with the response variable.
Overall, the majority of correlations across the feature space are positive, and the positive
strongly outweigh negative correlations in magnitude. We observe that the features derived
from night lights and population data are most strongly correlated with employment. This is
not unexpected given that information derived from night lights in particular have become a
standard proxy for economic development across the literature, although predominantly in
the context of cross-country studies or studies conducted at larger spatial units (see Hender-
son et al. (2012) for example).
    From the cartographic features derived via OSM, we find that amenities and street inter-
sections are particularly strongly correlated with employment. This could possibly point to
co-location of amenities with urban employment or that amenities themselves are the place
of employment in the case of the former (Ahlfeldt et al., 2015). The latter indicates that the
street network is denser around high employment areas (Gudmundsson and Mohajeri, 2013).
We further observe that the geographical and geological features concerning vegetation in-


                                                      14
                       (a)                                                                  (b)

Figure 4: Principal Component Analysis where (a) is the Factor Loading Plot visualizing feature importance and
                    (b) the Monoplot of the observations across cities in SSA (blue) and LAT (red)


tensity and water presence negatively correlate the strongest with employment, i.e. a higher
presence of either water or vegetation indicates lower or no employment (see also Goldblatt
et al. (2020)). Further, terrain roughness and elevation also tend to correlate negatively with
the response variable, although with lower magnitudes. These also correlate negatively with
the night lights and population variables, cautiously hinting at a reduced suitability of rough
terrain for both residential and commercial buildings. In general, we observe a large number
of positive correlations across the whole features space, and in particular features that are
strongly correlated with the response variable also tend to be correlated themselves, and
exhibit similar correlation patterns with other features.
        In order to gain a deeper understanding of the underlying structure of the data, we
further exploit a Principle Component Analysis (PCA). However, given that a PCA is an unsu-
pervised approach, it does not allow us to explicitly focus on employment. Thus, we limit our
use of the PCA to exploring and visualizing the data, and rely on it for two specific reasons:
(1) to analyze how homogeneous the structure of the data is across cities following within-city
standardization, and (2) to understand how the features relate to each other.
        A PCA constructs various linear combinations of the original observations by assigning
weights to each variable11 . A principal component refers to a specific combination of feature
weights. Figure 4(a) shows the PCA loading plot which illustrates how strongly each variable
influences principal components 1 and 2 respectively, i.e., the weight assigned by each of
the principal components to a given variable. The position on the graph responds to the
  11
       See Section 3.1 for methodological details




                                                         15
relative weight identified for the variable by principal component 1 on x-axis and by principal
component 2 on the y-axis. Night lights for example have been assigned a weight of 0.78 by
principal component 1, and -0.38 by principal component 2. Given that larger weights repre-
sent variables more important to the data set, variables located further away from the origin
can therefore be interpreted as more influential; equivalently, variables located closely to the
origin are less important. The PCA loadings plot highlights that data on night lights, pop-
ulation, urban land cover, shops and amenities strongly contribute to the first two principal
components hinting at the importance of these variables to the overall data patterns across
the whole feature space. Further, each feature’s position within the graph enables us to gain
further insight into the correlations across the feature space. Features whose coordinates
are within close proximity are generally positively correlated, and those who are located at
an approximately 90° angle from each other are weakly correlated or uncorrelated. Features
that diverge, and are thus located at approximately 180° from each other are negatively cor-
related. We observe that employment is most closely related to night lights, population and
paved major roads, and to urban land cover and intersections data to a slightly lesser degree.
This indicates that these variables correlate strongly with employment. This finding is in line
with the Pearson correlations computed previously. This further suggests that night lights,
population and urban land cover are important to both the variance of the whole data set
and to predict employment, where the former follows from the variable’s own position and
the latter from its close location to employment on the graph. Furthermore, those features
that exhibit low levels of correlations with employment, e.g., terrain roughness and elevation,
also contribute relatively little to the PCs; this is indicated by their close location to the ori-
gin. This indicates that these variables likely do not contribute strongly to the prediction of
employment given their weak correlation with employment as indicated by the PCA.
    Relying on this PCA weighting structure, each original observation can be transformed
through weighting the observation’s variable values accordingly. Figure 4(b) shows the PCA’s
monopolot where each point refers to a observation transformed into an index through
weighting of its variable value according to different weighting structures computed; the
position on the x-axis refers to the observation’s values weighted according to principal com-
ponent 1 and the position on the y-axis according to principal component 2. The ellipsoids
capture the area where the majority of the transformed observations are located where each
ellipsoid refers to the observations of a specific city. Given that the city ellipsoids overlap to
a large degree, i.e. the transformed observations are located largely within the same area,
we can conclude that these observations are largely similar post-transformation. This allows
us to conclude that the cities can be modelled jointly within one algorithm. A small degree
of heterogeneity across and within the cities’ ellipsoids can be observed. This is most likely
driven by differences in data collection, heterogeneity in the relationship between response
variable and covariates, and city specific idiosyncrasy.



                                                16
5   Results

5.1 Random Forest and Gradient Boosting Machine Algorithm Predictions

    In order to select candidate models for employment prediction, we consider our prob-
lem and the data sets at hand. On one side, we have employment data that is measured very
differently across the sample cities. On the other side, we have OSM data that may be incom-
plete to varying degrees across cities. Given this statistical noise, we expect that the random
forest (RF) model will be most suited due to its feature space and observation partitioning
setup which makes it more robust in the presence of noise. Table 2 contains the results.
    Across the whole data set, RF achieves high predictive performance with R2 of 0.82 and
RMSE of 0.43 (Table 2, column (1)). This result is substantially above those obtained in the
related literature estimating employment factors. For instance, Goldblatt et al. (2020) predict
enterprise and employment counts for Vietnamese communes and obtain R2 s ranging from
0.24 to 0.33. Our results are above but more comparable to those obtained from studies
predicting poverty parameters where R2 s generally range from 0.3 to 0.8 (Jean et al. (2016)
and Yeh et al. (2020)). However, unlike our approach that is based on pre-processed satellite
data and mapped features these studies require complex deep learning methods to achieve
their predictive performance. We do not observe noticeable differences in predictive quality
across the two geographic regions. However, to test if models trained on data from each
region separately achieve higher predictive performance, we retrain the algorithm on the
data split accordingly (Table 2, columns (2) and (3)). The results indicate a marginal increase
in the predictive performance for the model trained on SSA data, but no change in predictive
quality for LAT observations. However, both of these models perform noticeably worse when
assessed on data from the other region. Hence, in the cases where grid cells from the city
being predicted are not contained in the training set (i.e., Latin American cities predicted
from a model trained only on SSA cities), the predictions tend to be lower.
    Employment is often clustered in one area if the city is monocentrically organized or
multiple areas for polycentric cities. These clusters are generally larger than our specific grid
cell choice of 500m x 500m. We hypothesize that high employment grid cells are generally
surrounded by other high employment grid cells thus forming a high employment cluster.
Furthermore, given that our grid cell structure has been superimposed in an arbitrary man-
ner, it may for example be that the public transport hub closest to a high employment location
is located in a neighboring cell. Similarly, a grid cell exhibiting a relatively high number of
public transport hubs with low or no number of public transport hubs in surrounding cells is
unlikely to indicate an employment cluster. This highlights the importance of including infor-
mation on surrounding cells into the analysis. We thus extend the feature space by adding
feature values (e.g., mean values) from neighboring cells to mimic the structure of a spatial


                                               17
Table 2: Performance comparison across ensemble tree models (R2 (RMSE))

                                                                               Train Data
                               All Cities        SSA Cities        LAT Cities        All Cities      SSA Cities       LAT Cities
                                     (1)               (2)               (3)                (4)            (5)              (6)
                                             Random Forest (RF)                                       Spatial RF
               All Cities      0.82 (0.43)       0.59 (0.65)       0.43 (0.80)       0.95 (0.23)     0.62 (0.63)      0.44 (0.79)
 Test Data     SSA Cities      0.82 (0.43)       0.85 (0.40)       0.40 (0.83)       0.95 (0.23)     0.96 (0.21)      0.41 (0.82)
               LAT Cities      0.82 (0.43)       0.53 (0.71)       0.82 (0.42)       0.95 (0.24)     0.54 (0.70)      0.95 (0.23)
 No. Obs.                      98,594            57,017            41,577            98,594          57,017           41,577
                                Gradient Boosting Machines (xgboost)                               Spatial xgboost
               All Cities      0.70 (0.55)       0.48 (0.77)       0.38 (0.86)       0.72 (0.53)     0.51 (0.77)      0.30 (0.89)
 Test Data     SSA Cities      0.67 (0.57)       0.72 (0.53)       0.36 (0.89)       0.70 (0.55)     0.75 (0.50)      0.26 (0.92)
               LAT Cities      0.74 (0.51)       0.44 (0.83)       0.77 (0.48)       0.74 (0.51)     0.47 (0.83)      0.78 (0.47)
 No. Obs.                      98,594            57,017            41,577            98,594          57,017           41,577

   Note: Test and train data sets used across the algorithms are identical; test data is subsampled to avoid repeated observations; No.
   Obs. refers to the full data sets of both train and test data for each sample;



lag model (Anselin, 1996) (’Spatial RF’ in Table 2, columns (4) to (6)). Overall, the results im-
prove noticeably in comparison to the models trained exclusively on within-grid cell feature
values and achieve very high predictive quality, both when trained on the whole data but also
when trained on regional subsets with R2 of 0.95 and 0.96 across the three models. However,
similar to the results of columns (1) to (3), while marginal improvements in the models are ob-
served when trained on regional sub-samples, the model trained on regional samples exhibits
noticeably lower predictive performance compared to the model trained on the combined
data.
     Given that Gradient Boosting Machine algorithms often out-perform random forest mod-
els, we retrain the algorithms accordingly as an alternative approach. Here we rely on the
extreme gradient boosting machine specification (xgboost); Results are provided in Table
2. Across both versions of the feature space, the results obtained via xgboost remain be-
low those achieved via RF. Similarly to the RF models, Spatial xgboost outperforms the non-
spatial version. In contrast to the RF case, models trained on region samples result in a higher
predictive performance for the LAT than SSA model. However, these results also remain be-
low the predictive performance achieved via RF.
     Comparing both algorithms, the data and variable partitioning structures of the RF algo-
rithm make it less prone to overfitting and more robust in the presence of noisy data. While
xgboost models may outperform RF in settings where data is less noisy (Dietterich, 2000),
within the context of our feature space which exhibits structural noise stemming from city id-



                                                                  18
                        (a)                                                         (b)




                        Figure 5: An illustration of predictions obtained via Spatial RF
   (a) and (b) show observed and predicted values assigned of test data grid cells (outlined in white) within
                                                Dakar/Senegal


iosyncracy (city specific relationships between employment and features, and measurement
errors in both the dependent variable and the OSM features), the RF model performs notice-
ably better across all variations of the data.
    We can dig slightly deeper into model performance, and investigate how the model
performs for small and large employment values. Typically, models tend to smooth towards
the mean (performs the best for values close to the mean), while predicting less well outliers.
An illustration of the observed and predicted grid cell employment values for Dakar, Senegal
is provided in Figure 5. A visual comparison highlights an upwards bias for low levels of
employment, e.g. the values predicted for grid cells located in the northeast of the area
depicted. Some grid cells with high employment values appear to be underpredicted, e.g.
in the south of Dakar, albeit to a lower degree.
    Considering our full dataset, we find that higher values of employment are predicted
more accurately but exhibit a small downwards bias while low employment levels tend to be
slightly overpredicted (shown in Figure 6). This could be driven by repeated observations
where polygons with zero employment have been disaggregated into multiple grid cells.
Overall, despite minor smoothing, the relative squared error is roughly constant across dif-



                                                      19
6




             (a)                                  (b)                              (c)

                            Figure 6: Performance metrics for Spatial RF


ferent levels of employment thus supporting the reliability of our results.
    In order to gain an understanding of which features are particularly important for the pre-
diction of employment, we extract the feature importance from the Spatial RF models trained
on all cities and the SSA and LAT samples. A visualization of the most important features of
each model is provided in Figure 7. For the model trained on all cities, population data (both
cells and neighboring cells) is the strongest predictor of employment. This is followed by
data on night lights and the NDVI observed across neighboring cells. The most important
OSM-derived features are neighboring amenities (neighbor version of aggregated amenity
variable) which are identified as the 5th most important feature. Noticeably, 11 of the 15 most
important features are satellite derived. Further, features identified as important are often
relevant in both own-cell and neighbor versions, where for example the population of a grid
cell but also the population of neighboring cells are identified as important predictor vari-
ables. This might in part be driven by the spatial disaggregation employed when obtaining
grid cell level values from polygons and might also be influenced by the non-random spatial
distribution of both dependent and feature variables.
    For the model trained solely on SSA cities’ observations, the features of population,
night lights and urban land cover are the most important. Interestingly, two specific features
(intersections and public transport) are ranked as important for the SSA model but do not
feature within the overall model. For the LAT cities’ model, urban land cover is also highly
important in both its within grid and within neighboring grid cells. Across all three models,
three main conclusions regarding feature importance can be drawn: (1) satellite data derived
features are usually most important, (2) amenities (a combination of both categorized and
uncategorized amenities) appear to be the most relevant employment predictor derived via
OSM, and (3) some features are identified as influential across all three models, but there
exists also heterogeneity across the samples.


                                                20
           Figure 7: Variable importance ranking of the 15 most important features for Spatial RF


    Overall our results reveal that a combination of satellite and mapped features data can
predict employment to a high degree; even for the high levels of spatial resolution required
for within-city analyses. Furthermore, despite the presence of noise in the data, exploiting
the RF algorithm results in very high predictive performance quality. In addition, adding
neighboring features delivers particularly high predictive performance, highlighting the im-
portance of explicitly incorporating the geospatial dimension of data. Interestingly, we find
that models trained on one region but tested on the other perform less well. This could be
due to two distinct phenomena: first, there could be structural differences between the two
sets. In other words, cities in LAT could share common relationships between employment
and our features (but different to SSA). Second, cities could themselves exhibit unique rela-
tionships, and hence leaving own-city cells out of the training data significantly impacts the
ability of the model to predict that city.
    Another consideration that deserves some attention is that of heterogeneity in data qual-
ity across cities. Both the underlying surveyed employment data that we are trying to predict
and the OSM completeness could be a main cause of differences in relationships between
employment distribution and mapping and satellite features. Such data quality issues could
be consistent with both hypotheses: structurally different relationships in SSA and LAT if the
OSM data is consistently less complete in SSA cities; and differing prediction quality if em-
ployment data is of lower quality in particular urban areas. We will further investigate these
issues below.




                                                    21
5.2 Alternative Algorithms

        In order to test whether algorithms based on different statistical foundations provide
better predictive performance, we fit regularized linear algorithms on the training data and
benchmark their performance. Within the training process of RF models the data is randomly
split and a random subset of variables is tested for each data subset. Given this evalua-
tion of a randomly drawn subset of features, the likelihood that each tested subset contains
highly correlated variables is low. This set-up reduces the issue of multicollinearity within RF
models (Genuer et al., 2010). Regularized linear models do not partition the data, but in-
stead first reduce the feature space to only the most important variables before fitting the
model in the second step. As penalized models represent a family of models differing in
their penalty specifics, we allow the algorithm to determine the most suitable model through
optimal penalty parameter choice. Results are provided in Table 3.
        Overall, the model selects the clear majority of features as relevant across both non-
spatial and spatial versions. Comparing the predictive performance reveals that the models’
R2 generally remains substantially below (and RMSE above) those observed for both RF and
GBM models. This holds true across all six versions of the fitted model (spatial vs non spatial,
all cities, and regional specific). Similar to the GBM fit models, the algorithm fit exclusively on
LAT cities tends to perform better than those fit to all cities’ or SSA cities’ data. However, the
increased fit to the data is similarly accompanied by overfitting, i.e., the model is very closely
aligned with the specific training data and thus not sufficiently flexible to predict any other
data to a high degree. This results in worse predictive performance when the algorithm is
used to predict the SSA or overall cities’ test data. As with the ensemble tree simulations, the
models based on data including spatial effects outperform the non-spatial version although
increases in predictive performance are very marginal.
        Hence, the predictive performance of these models is lower compared to the RF models
discussed in the previous subsection. However, we can exploit the fact that regularized linear
models select a subset of important features in order to delve deeper into understanding how
different features interact with employment. For example, the model selects 81/84 features
for the ’All Cities’ model. Taking this subset in each case, we run a spatial regression analy-
sis which directly incorporates spatial dependence within the data. We examine the model
coefficients, comparing the most important features to those previously identified by the RF
feature importance score; Table 3 provides the results12 . Across all three regression sets, the
features of NDWI and NDVI (water/vegetation) are reoccurring with high importance. Night
lights are the only additional satellite derived feature listed for the ’All Cities’ model, while
terrain roughness, population and urban land cover are also listed for the ’SSA’ and ’LAT’
  12
       All Spatial Regressions listed in Table 3 follow the Spatial Durbin Error model; given the large amounts of
features, we only list the largest ten coefficients among statistically significant variables.



                                                          22
Table 3: Performance comparison across Regularized Gen. Linear Models (R2 (RMSE))

                                                                                                   Train Data
                                All Cities         SSA Cities          LAT Cities                  All Cities          SSA Cities         LAT Cities
                                      (1)                 (2)                     (3)                     (4)                (5)                              (6)
                                Regularized Generalized Linear Models (Reg GLM)                                                    Spatial Reg GLM
                All Cities      0.56 (0.66)        0.53 (0.73)         0.44 (0.75)                 0.58 (0.65)         0.52 (0.78)        0.46 (0.74)
 Test Data      SSA Cities      0.49 (0.72)        0.51 (0.69)         0.43 (0.76)                 0.51 (0.70)         0.54 (0.68)        0.45 (0.75)
                LAT Cities      0.67 (0.58)        0.55 (0.73)         0.69 (0.56)                 0.68 (0.57)         0.54 (0.81)        0.70 (0.55)
     Variables Selected         81/84              75/84               69/84                       128/142             129/142            126/142
 No. Obs.                       98,594             57,017              41,577                      98,594              57,017             41,577
                                                                                        Spatial Regression Analysis
                   All Cities                                                SSA Cities                                                        LAT Cities
 -0.88*** (0.05) NDVI                              -1.05*** (0.21) Intersections (binary)                              0.96*** (0.08) NDWI
 0.78*** (0.06) NDWI                               -0.94*** (0.08) NDVI                                                -0.87*** (0.07) NDVI
 -0.48*** (0.13) Offices (binary)                  0.79*** (0.29) log(shops)                                           -0.58*** (0.18) Major Roads (binary)
 -0.36***   (0.02) Major Roads (binary)            0.37***   (0.08) NDWI                                               0.32*** (0.02) log(intersections)
 0.33***   (0.06) Aiport (binary)                  0.31***   (0.07) Shops (binary)                                     0.27*** (0.02) log(night lights)
 0.28***   (0.01) log(night lights)                -0.29*   (0.17) log(trams)                                          0.24*** (0.07) Offices (binary)
 0.27* (0.15) Shops (binary)                       0.26** (0.11) Airport (binary)                                      0.24*** (0.05) Amenities (binary)
 0.27*** (0.02) log(major roads)                   0.25*** (0.03) log(terrain roughness)                               -0.18*** (0.07) log(urban land cover)
 -0.24***   (0.02) Intersections (binary)          0.11***   (0.04) log(main roads)                                    0.16*** (0.02) log(population)
 0.20***   (0.02) log(intersections)               0.10***   (0.02) log(population)                                    -0.15*** (0.04) log(intersections with major roads)

   Note: Test and train data sets used across the algorithms are identical across the different algorithms; test data is sub-sampled to avoid repeated observations; No. Obs.
   refers to the full data sets of both train and test data for each sample; Estimated coefficients are reported for the ten most important variables in each regression as identified
   through the coefficients’ absolute magnitude. A Spatial




models. Among OSM features, information on major roads and intersections are relevant for
all three models, even though through different features; variables describing shops, airports,
amenities and offices are also important, however to very differing magnitudes. Surprisingly,
only one variable reflecting public transport is listed. While this is surprising given the litera-
ture on the importance of public transport for urban employment (Pogonyi et al., 2021), our
results may be driven by the absence of well-established public transportation networks in
the majority of the cities included here. Most of the variables identified here as important
are equivalent to those exhibiting strong pairwise correlations with employment13 and those
selected as important by the random forest models in the preceding subsection. However,
large differences in variables indicated as important by the linear and random forest mod-
els can also be observed. For example, information on air pollution is selected as important
across all RF models, but it is not identified as such by the linear regressions. Additionally,
the results of Table 3 identify binary variables as important while the random forests compu-
tations predominantly list continuous variables as highly relevant. These differences with the
RF models are most likely driven by the less strict assumptions of RF models that allow for
   13
        See Section 4; a direct comparison with the PCA results is not possible given that the PCA is an unsupervised
approach that identifies variables importance to the overall data set where as the analyses of this section identify
variables specifically important for employment.



                                                                                        23
Table 4: Performance Comparison across RF Models with Spatial Effects (R2 (RMSE))
OSM vs. Satellite Features Only

                                                                     Train Data
                               All Cities                     SSA Cities                     LAT Cities
                                                  OSM Data Only
               All Cities      0.76 (0.49)                    0.55 (0.68)                    0.39 (0.80)
 Test Data     SSA Cities      0.76 (0.49)                    0.78 (0.47)                    0.36 (0.82)
               LAT Cities      0.75 (0.50)                    0.50 (0.73)                    0.76 (0.49)
 No. Obs.                      98,594                         57,017                         41,577
                                                 Satellite Data Only
               All Cities      0.79 (0.47)                    0.54 (0.69)                    0.43 (0.80)
 Test Data     SSA Cities      0.78 (0.48)                    0.80 (0.45)                    0.41 (0.82)
               LAT Cities      0.80 (0.45)                    0.48 (0.75)                    0.81 (0.45)
 No. Obs.                      98,594                         57,017                         41,577

   Test and train data sets used across across the algorithms are identical; test data is subsampled to avoid repeated
   observations; No. Obs. refers to the full data sets of both train and test data for each sample;



the detection of variables as important even if their relationship to employment is non-linear.
These results, similar to those of Figure 8, indicate a strong importance for satellite derived
features, although RF models identify night lights and population data as most crucial while
the regression estimate NDVI and NDWI variables as most important. In contrast to the RF
feature importance, amenities are estimated to be less important by the regression analyses,
while features derived via the street network, i.e. major roads and street intersections, are
indicated as important for employment.
    Overall, the high correspondence between the variables selected as important by the
RF and linear models provides confidence that these variables are highly predictive of em-
ployment density.


5.3 Predicting Employment from Digital Maps

    The results of the preceding section strongly suggest that urban employment can be
predicted from a combination of mapped and satellite data. Variable importance analyses
further highlighted the high relevance of satellite derived features to predict employment. In
this section, we collapse the feature space including spatial effects to those variables derived
via OSM to test how well employment can be predicted from mapped or satellite informa-
tion alone. We retrain the RF with spatial effects algorithms on the reduced feature spaces.



                                                           24
               Figure 8: Variable importance: 15 most important OSM features for Spatial RF


Results are provided in Table 4.
     Overall, the results indicate that a large part of the variations in employment can be
predicted solely based on OSM data with an R2 of approx 0.76, and an R2 of approx 0.79
for satellite only data. In comparison to the previous computations relying on the combined
data, omitting one or the other type of data presents a significant reduction in predictive
performance. In the case of OSM, we might expect LAT cities to be better predicted due to
better quality data - but this is not the case. This may be due to a large sample size in SSA. In
the case of satellite data, we expect that similar data quality across LAT and SSA would yield
similar predictive power. This is consistent with our findings.
     Similar to the algorithms outlined in the previous sections, we investigate which variables
are particularly important for predicting employment density. Results are provided in Figure
8. Similar to the previous results, amenities are indicated as the most important OSM derived
feature. This holds true across all three algorithms, and refers particularly those amenities that
exhibit a tag, i.e. amenities that can be identified as a restaurant, etc. Different features cap-
turing street intersections are also indicated as particularly important, although in different
variations across the three models. Features capturing the grid cell’s distance to the coast
and features indicating shops are also revealed to be of importance, although the former
more so for SSA cities and the latter for LAT cities. Features representing major roads and
public transport appear to be only important to a smaller degree. The lower importance of
public transport could be due to relatively low public transport availability in the majority of
cities included in this analysis.




                                                   25
5.4 Predicting Out-of-Sample Cities

    Our ultimate aim is to be able to predict employment in entirely unseen or out-of-sample
cities. Here, we use training data for 14 cities in order (a) to test the feasibility of this and (b)
to rely on the algorithms to predict urban employment of cities not included in this sample
based on the data patterns identified.
    If we randomly split the data into training and test data sets, then grid cells from any given
city appear in both sets. In this case we do not know how well the algorithm would perform
on a city that was never included in the training sample. In the analyses above, algorithms fit
on SSA and LAT samples point to reduced accuracy in prediction when evaluated on cities
located in the other region. Hence, city idiosyncrasy with regards to both noise in the data
and/or the relationship between employment and features can substantially decrease the
algorithm’s predictive performance for cities fully excluded from the training data.
    We can investigate the predictive power of the model for unseen cities by refitting the
spatial RF model to the complete data set comprising all grid cells but excluding a specific
city. Hence, for each city, we fit our algorithm to the grid cells of the other thirteen cities and
evaluate the model performance on the grid cells of the city held back. The raw employment
data was provided in polygons. Given that spatially disaggregating polygon information into
grid cells might introduce noise to the data, we evaluate the predictive performance of the
algorithms both at the grid cell level, i.e. the level that we use to train the algorithms, and
also spatially aggregated to original polygon boundaries. Results are provided in Table 5.
Overall, when evaluated at the grid cell level (column 1) our results exhibit large heterogeneity
in predictive performance with the lowest R2 of 0.20 for Harare/ZWE to the highest of 0.71
for Buenos Aires/ARG. Figure 9 visualizes observed and predicted employment for these two
cities. While these results are overall noticeably lower than those of section 5.1 to 5.3, the
predictive performances are in line with a similar analysis undertaken to predict consumption
expenditure and household assets of spatial clusters across four SSA countries (Jean et al.,
2016). Our algorithms predict employment at a grid level based on detailed input data.
When we aggregate up to polygon level, we would expect that our employment predictions
are well-aligned with the observed employment data (which was collected at polygon level).
Indeed, we do find that we obtain substantially higher values of R2 for almost all cities when
evaluated at polygon level (column 2).
    In columns (4) to (6) we retrain an algorithm for each city, but only train on grid cells from
other cities located in the same region. The similarity of predictive performance of the models
fitted on all cities and those fitted solely on cities of the same geographical region does not
point to a clear better fit from either approach. Although some cities are predicted more
accurately when the regional models are used, no clear overall pattern emerges supporting
an obvious choice for either. Similar to the results of column (2), evaluating the SSA and LAT


                                                 26
Table 5: Performance Comparison across RF Models with Spatial Effects for Out-of-Sample
Cities (R2 (RMSE))

                                                         Model (unit of evaluation)
            City            All Cities     All Cities     SSA Cities      SSA Cities     LAT Cities      LAT Cities
                            (grid cell)    (polygon)       (grid cell)    (polygon)       (grid cell)    (polygon)
 SSA
            Abidjan/CIV     0.35 (0.84)    0.70 (1.25)    0.30 (0.85)     0.71 (0.33)          -              -
             Dakar/SEN      0.62 (0.62)    0.70 (0.48)    0.65 (0.59)     0.68 (0.48)          -              -
  Dar es Salaam/TZA         0.40 (0.78)    0.71 (0.75)    0.40 (0.77)     0.71 (0.74)          -              -
            Douala/CMR      0.37 (0.81)    0.65 (0.56)    0.40 (0.79)     0.64 (0.56)          -              -
            Harare/ZWE      0.20 (0.91)    0.30 (0.48)    0.17 (0.93)     0.28 (0.48)          -              -
       Kampala/UGA          0.54 (0.68)    0.54 (0.68)    0.49 (0.71)     0.49 (0.71)          -              -
             Kigali/RWA     0.52 (0.71)    0.81 (0.72)    0.49 (0.73)     0.79 (0.72)          -              -
       Kinshasa/COD         0.49 (0.72)    0.55 (0.55)    0.39 (0.79)     0.49 (0.69)          -              -
            Nairobi/KEN     0.67 (0.61)    0.77 (0.62)    0.67 (0.63)     0.80 (0.58)          -              -
 LAT
 Belo Horizonte/BRA         0.62 (0.82)    0.77 (0.90)          -               -        0.64 (0.61)     0.84 (0.40)
            Bogotá/COL      0.58 (0.65)    0.52 (0.45)          -               -        0.58 (0.65)     0.53 (0.45)
  Buenos Aires/ARG          0.71 (0.54)    0.78 (0.28)          -               -        0.71 (0.56)     0.79 (0.32)
              Lima/PER      0.50 (0.72)    0.42 (0.50)          -               -        0.48 (0.73)     0.42 (0.53)
   Mexico City/MEX          0.58 (0.67)    0.58 (0.67)          -               -        0.56 (0.67)     0.56 (0.67)

   Note: Original employment data for Kampala and Mexico City was provided as point locations, which were aggre-
   gated to grid cell levels for computations; for these cities all performance evaluations were conducted at grid cell
   levels


models at polygon levels results in substantially higher predictive performance levels.
    We hypothesize that the different sizes of polygons across cities, and associated levels
of aggregation, influence the performance of the algorithms. For example, in cases where
the observed employment data is collected as large polygons, we have a large mis-match
between the level of detail of the predicted employment at grid level and the observed em-
ployment transformed to grid level. Hence, we would expect that cities with larger polygons
are worse predicted at a grid level, while cities with small polygons are more accurately pre-
dicted. In order to investigate this, we plot each city’s mean polygon size against its model
R2 (for grid cells) in Figure 10(a). We can see a clear negative relationship between these two
metrics supporting this hypothesis. When we plot this relationship, but with both predicted
and observed data aggregated to polygon level, we no longer observe a negative slope, see

                                                         27
                             (a)                                                 (b)




                             (c)                                                 (d)


 Figure 9: Observed vs. Predicted Employment for Harare/ZWE in (a) and (b), and for Buenos Aires/ARG in (c)
                                                       and (d)14


Figure 10(b). Hence, we can deduce that the quality of our predictions is independent of
polygon size.
        To understand our models better, we analyze further patterns within our employment
predictions. Across all cities, the results (a) exhibit smoothing towards the mean, (b) the up-
wards bias in low employment grid cells is on average larger than the downwards bias in high
  14
       Predictions obtained via the models trained on all other cities



                                                          28
                     (a)                                                     (b)

               Figure 10: Correlation between R2 and Average Polygon Size across Cities


employment grid cells, and (c) predicted employment is on average within 0.71 (absolute de-
viation) of observed employment (which, when normalized, ranges from about -3 to +3). The
latter allows us to conclude that while there are differences between predicted and observed
employment, on average a predicted value is within -/+ 0.71 of the observed employment
within that grid cell. Hence, on average we do not see very large differences between ob-
served and predicted employment. However, given the noise in the data, it is not possible
to conclude if this is due to measurement noise or algorithmic performance.
    In order to probe the degree of city idiosyncrasy across the cities, we train a unique model
for each individual city. If a city can be described by its ’own’ model, then we can deduce
that there is no inherent limit on an algorithm to predict employment from the data on that
city. Hence, we refit city specific models to 80% of each city’s grid cells and evaluate on the
remaining 20%. The R2 across these models range from 0.68 for Lima to 0.99 for Harare with
an average of 0.84. These results suggest that cities’ employment can overall be predicted
accurately. However, the results also show a significant amount of heterogeneity within fea-
ture importance across the models (data not shown). Hence, while urban employment can
be predicted, city idiosyncrasy stemming from measurement noise in both employment and
features, and heterogeneity in the relationship between employment and features, limits out-
of-sample performance.
    In summary, OSM and satellite data can predict employment to a very high degree.
However, the accuracy of the predictions is influenced by both (a) a structurally different re-
lationship between employment and included variables across the different cities and (b)


                                                 29
differences in OSM and employment data quality across the cities.
          Since our guiding objective was to design an algorithm to predict employment in un-
seen cities, we apply it to various cities for which we have no employment data, Niamey/Niger,
Khartoum/Sudan, Mumbai/India, Karachi/Pakistan, Port au Prince/Haiti and Guayaquil/Ecuador
are provided as examples in Figure 11. As our results do not point to a clear preference for
a model trained on all cities or trained solely on those cities of a given geographical region,
we deploy the algorithm trained an all cities for better cross city comparison patterns here.


6         Discussion & Conclusion

          We set out to tackle a key gap in the policy toolbox, the lack of highly granular data on
employment within cities, particularly in less developed areas of the world. Using a spatial
adaption of the random forest algorithm, we show that we can predict within-city cells in
our test cities with extremely high accuracy (>95% R2 ), and cells in out-of-sample cities with
medium to high accuracy (21% - 72% R2 at grid level and 31% - 80% at polygon level). While
we found that our model picked up significant city-specific relationships, the moderate to
high R2 obtained for out-of-sample predictions, particularly for cities with expected higher
data quality, gives us confidence that the algorithm can be deployed on unseen cities.
          Our contribution to the literature and practical toolbox spans multiple levels. First, we
show that a combination of mapped features and pre-processed satellite data can be used
to fill data gap in developing countries. Importantly, we illustrate this for the tricky within
city case, generating very high resolution estimates. Distinct from most other related studies
(Jean et al. (2016) and Yeh et al. (2020)), we focus on employment as our outcome variable,
rather than less precise metrics such as poverty or GDP. We find that our model captures a
significant level of city heterogeneity. This likely comes from multiple sources, including both
true heterogeneity in terms of the relationship between employment and our features, but
also city-specific effects stemming from noise in both the employment data and the OSM
data.
          One of the key motivations for this study was to develop an algorithm that could be
deployed on completely unseen cities, i.e., cities for which no employment/training data is
available. Despite the presence of city heterogeneity in our data, we find an acceptable level
of R2 in the out-of-sample case, particularly for cities with expected higher data quality. This
gives us confidence that, despite a few outliers with lower R2 , the model performs sufficiently
well. Given the model testing above, we recommend deploying the spatial RF trained on all
14 cities as the base model to give the best prediction.



    15
         Predictions obtained via the models trained on all other cities.


                                                            30
(a)                                           (b)




(c)                                           (d)




(c)                                           (c)


  Figure 11: Employment Predictions across various Cities15




                             31
References

Ahlfeldt, G.M., Albers, T.N.H., Behrens, K., 2020. Prime Locations. CEP Discussion Paper No.
  1725. Centre for Economic Performance/CEP.

Ahlfeldt, G.M., Redding, S.J., Sturm, D.M., Wolf, N., 2015. The economics of density: Evi-
  dence from the Berlin wall. Econometrica 83, 2127–2189.

Anselin, L., 1996. The Moran Scatterplot as an ESDA Tool to Assess Local Instability in Spatial
  Association, in: Fischer, M., Scholten, H., Unwin, D. (Eds.), Spatial Analytical Perspectives
  on Gis in Environmental and Socio-Economic Sciences. Taylor; Francis, London, pp. 111 –
  125.

Athey, S., Imbens, G.W., 2015. Machine Learning Methods for Estimating Heterogeneous
  Causal Effects. Working Paper No. 3350. Stanford Graduate School of Business, Stanford
  University.

Athey, S., Imbens, G.W., 2019. Machine Learning Methods That Economists Should Know
  About. Annual Review of Economics 11, 685–725.

Baruah, N.G., Henderson, J.V., Peng, C., 2021. Colonial legacies: Shaping African cities.
  Journal of Economic Geography 21, 29–65.

Belloni, A., Chernozhukov, V., 2013. Least squares after model selection in high-dimensional
  sparse models. Bernoulli 19.

Bengtsson, L., Gaudart, J., Lu, X., Moore, S., Wetter, E., Sallah, K., Rebaudet, S., Piarroux, R.,
  2015. Using Mobile Phone Data to Predict the Spatial Spread of Cholera. Scientific Reports
  5. URL: http://www.nature.com/articles/srep08923, doi:10.1038/srep08923.

Bertaud, A., 2002.     The spatial organization of cities: Deliberate outcome or unfore-
  seen consequence?       World Development Report 2003, background paper URL: http:
  //siteresources.worldbank.org/DEC/Resources/spatialorgcity.pdf.

Blumenstock, J., Cadamuro, G., On, R., 2015. Predicting poverty and wealth from mobile
  phone metadata. Science 350, 1073–1076.

Breiman, L., 2001. Random Forests. Machine Learning 45, 5 –32.

Breiman, L., Friedman, J., Stone, C.J., Olshen, R.A., 1984. Classification and Regression Trees.
  Wadsworth & Brooks, Pacific Grove.

Bryan, G., Glaeser, E., Tsivanidis, N., 2020. Cities in the Developing World. Annual Review of
  Economics 12, 273–297.


                                               32
Buchhorn, M., Lesiv, M., Tsendbazar, N.E., Herold, M., Bertels, L., Smets, B., 2020. Copernicus
  Global Land Cover Layers—Collection 2. Remote Sensing 12, 1044.

Burke, M., Driscoll, A., Lobell, D.B., Ermon, S., 2021. Using satellite imagery to understand
  and promote sustainable development. Science 371, eabe8628.

Ciccone, A., Hall, R.E., 1996. Productivity and the Density of Economic Activity. The American
  Economic Review 86, 54–70. URL: http://www.jstor.org/stable/2118255.

Dietterich, T., 2000. An Experimental Comparison of Three Methods for Constructing En-
  sembles of Decision Trees: Bagging, Boosting, and Randomization. Machine Learning 40,
  139–157.

Duranton, G., Puga, D., 2015. Urban Land use, in: Duranton, G., Henderson, J.V., Strange,
  W.C. (Eds.), Handbook of Regional and Urban Economics. Elsevier. volume 5, pp. 467–560.

Elvidge, C.D., Baugh, K., Zhizhin, M., Hsu, F.C., Ghosh, T., 2017. VIIRS night-time lights.
  International Journal of Remote Sensing 38, 5860–5879.

Engstrom, R., Hersh, J., Newhouse, D., 2021. Poverty from Space: Using High Resolution
  Satellite Imagery for Estimating Economic Well-being. The World Bank Economic Review
  0, 1–31.

Friedman, J.H., 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis
  38, 367–378.

Genuer, R., Poggi, J.M., Tuleau-Malot, C., 2010. Variable selection using random forests.
  Pattern Recognition Letters 31, 2225–2236.

Glaeser, E.L., Xiong, W., 2017. Urban productivity in the developing world. Oxford Review of
  Economic Policy 33, 373–404.

Gobillon, L., Selod, H., 2014. Spatial Mismatch, Poverty, and Vulnerable Populations, in: Fis-
  cher, M.M., Nijkamp, P. (Eds.), Handbook of Regional Science. Springer Berlin Heidelberg,
  Berlin, Heidelberg, pp. 93–107. URL: http://dx.doi.org/10.1007/978-3-642-23430-9_
  7.

Goldblatt, R., Heilmann, K., Vaizman, Y., 2020. Can Medium-Resolution Satellite Imagery
  Measure Economic Activity at Small Geographies? Evidence from Landsat in Vietnam. The
  World Bank Economic Review 34, 635–653.

Gudmundsson, A., Mohajeri, N., 2013. Entropy and order in urban street networks. Scientific
  Reports 3, 3324.




                                              33
Hallegatte, S., Rentschler, J., Rozenberg, J., 2019. Lifelines: The Resilient Infrastructure Op-
  portunity. The World Bank. URL: http://elibrary.worldbank.org/doi/book/10.1596/
  978-1-4648-1430-3, doi:10.1596/978-1-4648-1430-3.

He, Y., Thies, S., Avner, P., Rentschler, J., 2021.     Flood impacts on urban transit and
  accessibility—A case study of Kinshasa. Transportation Research Part D: Transport and
  Environment 96, 102889. URL: https://doi.org/10.1016/j.trd.2021.102889, doi:10.
  1016/j.trd.2021.102889.

Henderson, J., Regan, T., Venables, A., 2021a. Building the City: From Slums to a Modern
  Metropolis. The Review of Economic Studies 88, 1157–1192.

Henderson, J.V., Kriticos, S., 2018. The Development of the African System of Cities. Annual
  Review of Economics 10, 287–314.

Henderson, J.V., Nigmatulina, D., Kriticos, S., 2021b. Measuring urban economic density.
  Journal of Urban Economics 125, 103188. doi:10.1016/j.jue.2019.103188.

Henderson, J.V., Storeygard, A., Weil, D.N., 2012. Measuring Economic Growth from Outer
  Space. The American Economic Review 102, 994–1028. URL: http://www.jstor.org/
  stable/23245442.

Hengl, T., 2006. Finding the right pixel size. Computers & Geosciences 32, 1283–1298.

Hoerl, A.E., Kennard, R.W., 1970. Ridge Regression: Applications to Nonorthogonal Prob-
  lems. Technometrics 12, 69.

Jarvis, A., Reuter, H., Nelson, A., Guevara, E., Guevara, A., 2008. Hole-filled SRTM for the
  globe Version 4.

Jean, N., Burke, M., Xie, M., Davis, W.M., Lobell, D.B., Ermon, S., 2016. Combining satellite
  imagery and machine learning to predict poverty. Science 353, 790–794.

Kreindler, G., Miyauchi, Y., 2021. Measuring Commuting and Economic Activity inside Cities
  with Cell Phone Records. Technical Report w28516. National Bureau of Economic Research.
  Cambridge, MA. URL: http://www.nber.org/papers/w28516.pdf, doi:10.3386/w28516.

Lall, S.V., Lebrand, M., Park, H., Sturm, D., Venables, A., 2021a.          Pancakes to Pyra-
  mids :    City Form to Promote Sustainable Growth.             The World Bank, Washing-
  ton, D.C. URL: http://documents.worldbank.org/curated/en/554671622446381555/
  City-Form-to-Promote-Sustainable-Growth.

Lall, S.V., Lebrand, M., Soppelsa, M.E., 2021b. The Evolution of City Form. Evidence from
  Satellite Data. Policy Research Working Paper 9618. The World Bank. Washington, D.C.


                                              34
Li, P., Zhao, P., Schwanen, T., 2020. Effect of land use on shopping trips in station areas:
  Examining sensitivity to scale. Transportation Research Part A: Policy and Practice 132,
  969–985.

Louail, T., Lenormand, M., Cantu Ros, O.G., Picornell, M., Herranz, R., Frias-Martinez, E.,
  Ramasco, J.J., Barthelemy, M., 2015. From mobile phone data to the spatial structure of
  cities. Scientific Reports 4, 5276. URL: http://www.nature.com/articles/srep05276,
  doi:10.1038/srep05276.

Lu, X., Bengtsson, L., Holme, P., 2012. Predictability of population displacement after the
  2010 Haiti earthquake. Proceedings of the National Academy of Sciences 109, 11576–
  11581. URL: http://www.pnas.org/cgi/doi/10.1073/pnas.1203882109, doi:10.1073/
  pnas.1203882109.

Lyapustin, A., Wang, Y., 2018. MCD19A2 MODIS/Terra+Aqua Land Aerosol Optical Depth
  Daily L2G Global 1km SIN Grid V006; distributed by NASA EOSDIS Land Processes DAAC.

Miyauchi, Y., Nakajima, K., Redding, S., 2021.           Consumption Access and Agglomera-
  tion: Evidence from Smartphone Data. Technical Report w28497. National Bureau of
  Economic Research. Cambridge, MA. URL: http://www.nber.org/papers/w28497.pdf,
  doi:10.3386/w28497.

Moran, P.A.P., 1948. The Interpretation of Statistical Maps. Biometrika 35, 255 – 60.

Pearson, K., 1901. LIII. On lines and planes of closest fit to systems of points in space. The
  London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2, 559–572.

Perlich, C., Provost, F., Simonoff, J., Cohen, W., 2003. Tree Induction vs. Logistic Regression:
  A Learning-Curve Analysis. Journal of Machine Learning Research 4, 211 – 255.

Pesaresi, M., Ehrilch, D., Florczyk, A., Freire, S., Julea, A., Kemper, T., Soille, P., Vasileios, S.,
  2015. GHS built-up grid, derived from Landsat, multitemporal (1975, 1990, 2000, 2014).

Pogonyi, C.G., Graham, D.J., Carbo, J.M., 2021. Metros, agglomeration and displacement.
  Evidence from London. Regional Science and Urban Economics 90, 103681.

Soman, S., Beukes, A., Nederhood, C., Marchio, N., Bettencourt, L., 2020. Worldwide Detec-
  tion of Informal Settlements via Topological Analysis of Crowdsourced Digital Maps. ISPRS
  International Journal of Geo-Information 9, 685.

Straulino, D., Saldarriaga, J.C., Gómez, J.A., Duque, J.C., O’Clery, N., 2021. Uncovering
  commercial activity in informal cities. arXiv preprint arXiv:2104.04545 .

Tibshirani, R., 1996. Regression Shrinkage and Selection Via the Lasso. Journal of the Royal
  Statistical Society: Series B (Methodological) 58, 267–288.

                                                 35
U.S. Geological Survey, . Landsat 8 Collection 1 Tier 1 Annual NDWI Composite.

Varian, H.R., 2014. Big Data: New Tricks for Econometrics. Journal of Economic Perspectives
  28, 3–28.

Yabe, T., Jones, N.K.W., Lozano-Gracia, N., Khan, M.F., Ukkusuri, S.V., Fraiberger, S., Montfort,
  A., 2021. Location Data Reveals Disproportionate Disaster Impact Amongst the Poor: A
  Case Study of the 2017 Puebla Earthquake Using Mobilkit. arXiv:2107.13590 [physics] URL:
  http://arxiv.org/abs/2107.13590. arXiv: 2107.13590.

Yabe, T., Tsubouchi, K., Fujiwara, N., Sekimoto, Y., Ukkusuri, S.V., 2020. Understanding
  post-disaster population recovery patterns. Journal of The Royal Society Interface 17,
  20190532. URL: https://royalsocietypublishing.org/doi/10.1098/rsif.2019.0532,
  doi:10.1098/rsif.2019.0532.

Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., Ermon, S., Burke, M., 2020. Using
  publicly available satellite imagery and deep learning to understand economic well-being
  in Africa. Nature Communications 11, 2583.

Zagatti, G.A., Gonzalez, M., Avner, P., Lozano-Gracia, N., Brooks, C.J., Albert, M., Gray,
  J., Antos, S.E., Burci, P., zu Erbach-Schoenberg, E., Tatem, A.J., Wetter, E., Bengts-
  son, L., 2018. A trip to work: Estimation of origin and destination of commuting pat-
  terns in the main metropolitan regions of Haiti using CDR. Development Engineering 3,
  133–165. URL: https://linkinghub.elsevier.com/retrieve/pii/S2352728517300866,
  doi:10.1016/j.deveng.2018.03.002.

Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. Journal of
  the Royal Statistical Society: Series B (Statistical Methodology) 67, 301–320.




Appendix A Data Source Details

Employment Data

Cities in Sub-Saharan Africa
Abidjan/Republic of Côte d’Ivoire
      Japan International Cooperation Agency (JICA), Ministry of Construction, Housing,
      Sanitation and Urban Development of the Republic of Côte d’Ivoire (MCLAU) and Schema
      Directeur d’Urbanisme du Grand Abidjan (2015). The Project for the Development of
      the Urban Master Plan in Greater Abidjan (SDUGA).

Dakar/Republic of Senegal


                                                 36
     The World Bank Group, Conseil Executif des Transports Urbains de Dakar (CETUD)
     and Consortium Solidarité Internationale Sur les Transports et la Recherche en Afrique
     Sub-Saharienne (2015). Enquête Mobilité, Transports et Accès aux Services Urbains de
     Dakar (EMTASUD).

Dar es Salaam/United Republic of Tanzania
     Not yet known

Douala/Republic of Cameroon
     Agence Française de Développement (AFD), European Commission, Syndicat Mixte
     des Transports pour le Rhône et l’Agglomération (SYTRAL) and MobiliseYourCity Part-
     nership (2018). Enquête Ménage-Déplacements as part of the Plan de Mobilité Urbaine
     Soutenable (PMUS)/Communaute Urbaine de Douala.

Harare/Republic of Zimbabwe
     Zimbabwe National Statistics Agency (ZimStat), United Nations Population Fund (UNFPA),
     UK Department for International Development (DFID), Australian Agency for Interna-
     tional Development, Danish International Development Agency, United Nations Chil-
     dren’s Fund (UNICEF), European Union, The Swedish International Development Co-
     operation Agency (SIDA) and United Nations Development Program (UNDP) (2012).
     Zimbabwe Population Census 2012.

Kampala/Republic of Uganda
     Travel Survey
     Kampala Capital City Authority (KCCA), ROM Transportation Engineering Ltd., Shapira-
     Hellerman Planners, Larry Aberman & Associates, Tzamir Architects and Planners Ltd.
     and Ofek Aerial Photography (2012). Kampala Physical Development Plan (KPDP).
     Firm Census
     Uganda Bureau of Statistics (UBOS) (2011). Census of Business Establishments (COBE).

Kigali/Republic of Rwanda
     National Institute of Statistics of Rwanda (NISR) (2011). Rwanda Establishment Census.

Kinshasa/Democratic Republic of Congo
     Japan International Cooperation Agency (JICA), ALMEC Corporation and Oriental Con-
     sultants Global Co., Ltd (2018). Kinshasa Commuter Travel Survey (CTS).

Nairobi/Republic of Kenya
     Japan International Cooperation Agency (JICA), Nairobi City Council (NCC), Nippon
     Koei Co., Ltd., IDCJ Inc. and EJEC Inc. (2013). The Project on Integrated Urban Devel-

                                            37
      opment Master Plan for the City of Nairobi in the Republic of Kenya - Nairobi Personal
      Travel Survey.


Cities in Latin America
Belo Horizonte/Federative Republic of Brazil
      Brazilian Ministry of the Economy (2019). Relação Anual de Informações Sociais (RAIS).

Bogotá/Republic of Colombia
      Not yet known

(Autonomous City of) Buenos Aires/Argentine Republic
      Secretariat of Transport of the Argentinian Ministry for Federal Planning, Public Invest-
      ments, and Services (MINPLAN), Ingeniería en Relevamientos Viales S.A. (IRV), Inge-
      niería y Asistencia Técnica Argentina S.A. (IATSA) and LOGIT - Transportation Engi-
      neers (2012). 2004/5 Censo Nacional Ecónomico/The Argentinian National Institute of
      Statistics and Census (INDEC) and 2011 Encuesta Permanente de Hogares/INDEC.

Lima/Republic of Peru
      Peruvian National Institute of Statistics and Informatics (INEI) (2010). Census of Popu-
      lation and Housing Units.

Mexico City/United Mexican States
      Mexican National Statistical Directory of Economic Units (DENUE) (2019). The Business
      Register of Mexico (RENEM).

Satellite Data

Night Lights
      VIIRS Stray Light Corrected Nighttime Day/Night Band Composites Version 1; 15 arc
      seconds resolution (Elvidge et al., 2017)

Population
      WorldPop Global Project Population Data: Estimated Residential Population per 100 x
      100m Grid Square; 100m2 resolution

Air Pollution
      Terra + Aqua MAIAC Land Aerosol Optical Depth Daily; 1km2 resolution (Lyapustin and
      Wang, 2018)

Geophysical Land Surface


                                               38
          SRTM Digital Elevation Data Version 4; 90m2 resolution (Jarvis et al., 2008)

Built-up Land Cover
          GHSL: Global Human Settlement Layers, Built-Up Grid 1975-1990-2000-2015 (P2016);
          38m2 resolution (Pesaresi et al., 2015)

Normalized Difference Vegetation Index/NDVI
          Landsat 8 Collection 1 Tier 1 32-Day NDVI Composite; 30m2 resolution

Normalized Difference Water Index/NDWI (U.S. Geological Survey)
          Landsat 8 Collection 1 Tier 1 Annual NDWI Composite; 30m2 resolution (U.S. Geolog-
          ical Survey)

Land Use
          Copernicus Global Land Cover Layers: CGLS-LC100 Collection 3; 100m2 resolution
          (Buchhorn et al., 2020)



Appendix B Additional Tests

        This section provides additional computations to investigate possible effects of data
collection and of choice of different spatial units.


B.1 Heterogeneity in Employment Data Collection

        The employment data used throughout this work is collected either via travel surveys
or via population and firm censuses16 . This introduces noise to the data both with respect
to heterogeneity of polygon sizes within which the data was originally collected but also in
terms of what specific employment is measured. Travel surveys overall typically include any
type of employment, whereas firm censuses are generally limited to collecting information
on formal employment and do not capture employment within the informal sector. As the
informal sector represents a substantial part of jobs within developing countries (Bryan et al.,
2020), omitting this might introduce a noticeable measurement issue within the training of the
algorithms. In order to test this, we split the cities according to employment data collection
type, as outlined in Table 1, and retrain the Spatial RF algorithm for each of the two types of
employment data collection. Results are provided in Table 6.
        Across both types of employment data collection, the predictive performance is rela-
tively similar with slightly better predictive performance for employment data derived via
  16
       See Table 1.



                                                    39
surveys. When evaluating the models’ performance via the data obtained through the alter-
native data collection method, the model trained on census data appears to predict employ-
ment density collected via survey data slightly better than census data. This might indicate
that the measurement error introduced through neglecting informal employment within cen-
sus data weakens the relationship between the dependent variable and the features and thus
results in lower predictive performance. However, given the relatively high predictive perfor-
mance for each type of data collection and the relatively low predictive performance when
evaluated on data obtained via the alternative, the results are more likely point to the im-
portance of similarity of data used for the train and test sets. Furthermore, given that the
employment data for each city is collected either through a census or through a survey, we
cannot exclude that the results are driven by city idiosyncrasy.
        In order to test this further, we evaluate both models specifically for data for Kampala/UGA
where we have access to separate employment data obtained via both options. The evalua-
tion of the trained models reveals a predictive performance of R2 of 0.87 and 0.68 for survey
and census data respectively with an RMSE of 0.38 for the former and 0.57 for the latter.
This further strengthens the notion that survey data might be a more suitable data source for
employment prediction.
        However, given that observations of each of the Kampala data was included within the
training data for each model, we retrain the equivalent algorithms on survey and census data
cities fully excluding grid cells located in Kampala. While both the model based on survey
and on census data lose substantial predictive performance, the model trained on census
data outperforms that trained on survey data with R2 of 0.53 and 0.39 respectively. Hence,
while survey data appears to explain and predict employment data better when there exists
no structural differences across the train and test data set, census data appears to suffer from
a lower out-of-sample bias when evaluated on data that differs structurally from the train data
used for algorithms identification.


B.2 Sensitivity of Results to Levels Spatial (Dis)Aggregation

        The choice of spatial boundaries generally impacts the results of quantitative spatial
studies (Hengl, 2006; Li et al., 2020) driven by both the exact placement of the spatial unit’s
border and/or spatial (dis)aggregation within the spatial unit’s value. While the possible bias
introduced through both, issues might already be partly lessened through our incorporation
of neighboring grid cells’ values. Here we explicitly test the sensitivity of our trained algo-
rithms by extracting grid cells values of the dependent variable and all features with alterna-
tive hexagonal grid dimensions: 250m x 250m, 1km x 1km and 2km x 2km17 . We additionally
retrain the algorithm using the original polygon boundaries as observations. Table 7 provides
  17
       A hexagonal grid cell structure of 1.5km x 1.5km dimension has also been tested but results are omitted here.



                                                         40
Table 6: Performance Comparison across RF Models with Spatial Effects (R2 (RMSE))
Data Collection Subsamples

                                                               Train Data
                                                 Survey Data           Census Data
                              Survey Data        0.88 (0.35)           0.47 (0.73)
                Test Data
                              Census Data        0.40 (0.80)           0.83 (0.42)
                No. Obs.                         75,543                19,336

                   Note: Cross-Evaluation has been conducted on the Complete Survey and
                   Census Data; No. Obs. refers to the full data sets of both train and test
                   data for each sample;


the results.
    Comparing the performance of the algorithms as evaluated on the structurally identical
test and train data across the three grid cell dimensions shows that the algorithm based on
250m x 250m grid cells delivers the best predictive performance across all observations and
those based on geographical region. However, this remains below the performance quality
of the algorithm trained on 500m x 500m grid cells. In contrast, cross-evaluation points to
better predictions when larger grid cells are used, although this benefit appears to become
marginally smaller with grid cells structures larger than 1km x 1km. As discussed through-
out Section 5, algorithms are biased towards the mean and predict outliers less accurate.
When we use larger grid cells, the underlying data is spatially averaged to compute grid cell
values which results in less extreme outlier values when larger grid cells are used. This is
likely contributing to better predictive performance for larger grid cell units when we evalu-
ate it on a test data set that is structurally different from the train data used to fit the model.
This is in line with the previous results indicating that better predictions on the structurally
identical test data are generally associated with lower predictive performance on data that
exhibits structural differences. Further, these results do not point to the introduction of a spa-
tial disaggregation bias which would add statistical noise to smaller grid cell dimensions and
subsequently lead to reductions in predictive performance. However, it cannot be excluded
that the higher predictive performance for smaller units is influenced by the larger number
of observations underlying the training of the algorithm.
    The algorithm trained on the polygon data overall delivers relatively strong predictive
performance for observations of Latin America; however, it yields noticeably lower predictive
performance across SSA. The employment data for LAT cities has been provided in on av-
erage smaller polygons, resulting in a substantially higher amount of observations and thus
benefiting the accuracy of the algorithm. Despite this, the quality of the predictions identified



                                                    41
Table 7: Performance Comparison across RF Models with Spatial Effects (R2 (RMSE))
Different levels of spatial disaggregation

                                                                           Train Data
                               All Cities       SSA Cities       LAT Cities        All Cities       SSA Cities        LAT Cities
                                               Raw Polygons                                  250m x 250m Grid Cells
              All Cities       0.79 (0.46)      0.40 (0.81)      0.71 (0.55)       0.88 (0.35)      0.59 (0.67)       0.41 (0.79)
 Test Data    SSA Cities       0.61 (0.62)      0.54 (0.70)      0.47 (0.78)       0.90 (0.33)      0.91 (0.31)       0.36 (0.83)
              LAT Cities       0.80 (0.45)      0.39 (0.80)      0.81 (0.44)       0.86 (0.38)      0.51 (0.73)       0.86 (0.38)
 No. Obs.                      38,090           2,733            35,357            385,458          221,890           163,568
                                            1km x 1km Grid Cells                                2km x 2km Grid Cells
              All Cities       0.83 (0.42)      0.63 (0.63)      0.50 (0.76)       0.76 (0.47)      0.63 (0.59)       0.51 (0.73)
 Test Data    SSA Cities       0.84 (0.42)      0.82 (0.44)      0.43 (0.82)       0.74 (0.50)      0.76 (0.47)       0.46 (0.77)
              LAT Cities       0.83 (0.42)      0.59 (0.68)      0.82 (0.43)       0.79 (0.43)      0.55 (0.67)       0.77 (0.46)
 No. Obs.                      25,578           14,959           10,619            6,832            4,092             2,740

   Note: Train and test data sets have been obtained via a random 80/20 split; No. Obs. refers to the full data sets of both train and
   test data for each sample



for LAT remains below that identified for the majority of grid cell dimensions.
       Overall, these results do not provide support for the hypothesis that spatial (dis)aggregation
has introduced a substantial bias into the computations based on 500m x 500m grid cells.
Furthermore, smaller units lead to better predictions in cases where test and train data are
structurally identical. However, this might be strongly influenced by the larger training data
set.


Appendix C                 Additional Visuals




                                                                42
Figure 12: Relative Variable Importance across Spatial RF models




                              43
                       (a) Spatial Reg GLM




                          (b) Spatial RF




                       (c) Spatial xgboost


Figure 13: Illustration of Performance Metrics across Algorithms




                               44