Policy Research Working Paper 10257 Integrating Survey and Geospatial Data to Identify the Poor and Vulnerable Evidence from Malawi Melany Gualavisi David Newhouse Poverty and Equity Global Practice December 2022 Policy Research Working Paper 10257 Abstract Generating timely data to identify the poorest villages in census using gradient boosting. Incorporating the hypo- developing countries remains a fundamental challenge for thetical partial registry vastly improves the performance of existing data systems. This paper investigates the accuracy the predictions. When using the partial registry, the rank of four alternative methods for predicting a measure of correlation between the predicted and benchmark welfare village economic welfare for approximately 4,500 villages measures is 0.75, while those for the other three methods in 10 poor Malawian districts: (1) proxy means test scores range from −0.02 to 0.2, and similar results are seen when calculated from the 2017 social registry, (2) the Meta Rel- examining the area under the curve. Doubling the size of ative Wealth Index, (3) predictions derived from a standard the partial registry does little to improve predictive perfor- household survey and publicly available geospatial indica- mance. The results are robust to using a linear post–Least tors, and (4) predictions derived from a two-step approach Absolute Selection and Shrinkage Operator model instead that first predicts welfare into a hypothetical partial registry of gradient boosting for prediction. However, predictions of approximately 450 villages, and then predicts welfare using both methods are less accurate when the benchmark into the remaining villages using geospatial indicators. Geo- welfare measure is derived from a linear post–Least Abso- spatial indicators include land coverage indicators, weather lute Selection and Shrinkage Operator model. Overall, the data, night light data, building patterns, distance to major results strongly suggest that collecting partial registries of roads, and population density. Predictions are evaluated household-level poverty predictors in low-income contexts against a benchmark village welfare measure, constructed can vastly improve the performance of machine learning by imputing log per capita consumption from the 2016 models that combine survey and satellite imagery for the integrated household survey into the 2018 household purpose of village-level targeting. This paper is a product of the Poverty and Equity Global Practice and the Social Protection and Jobs Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank. org/prwp. The authors may be contacted at dnewhouse@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Integrating Survey and Geospatial Data to Identify the Poor and Vulnerable: Evidence from Malawi* Melany Gualavisi§ David Newhouse† Keywords: poverty, geographic targeting, small area estimation, poverty mapping, satellite data, machine learning JEL codes: C51, I32, I38 * We are indebted to Lina Cardona, German Caruso, Chipo Msowoya, and Boban Paul for their support in providing data, financing, and support for this project. We thank Ifeanyi Edochie for assistance with obtaining geospatial indicators. We are grateful for Richard Akresh, Sarah Jansen, Nobuo Yoshida, and seminar participants at the World Bank and the University of Illinois for constructive comments. † Department of Economics, University of Illinois Urbana-Champaign § Development Economics Data Group, World Bank, and IZA 1. Introduction Identifying the poor in developing countries is crucial to inform development policies and programs, particularly those related to social assistance. However, governments and the development community are severely constrained by the high cost of collecting household surveys, censuses, or social registries that are typically used to inform targeting decisions. For example, between 2002 and 2011, 57 countries had conducted zero or one nationally representative household budget survey, preventing them from producing timely poverty estimates, and typically four years pass between nationally representative surveys on consumption or asset wealth in most African countries.1 Even when household data is collected, data collected by household surveys are typically too small to provide reliable estimates of welfare for small geographic areas such as villages. Estimating poverty at the small area level requires alternative sources of data, traditionally census data, and utilizing this type of auxiliary data for small area estimation can provide resources to the poor more efficiently.2 Due to this lack of timely and adequate information on measures of well-being indicators in small areas, satellite imagery and other types of non-traditional data have great potential to fill these data gaps and complement traditional household surveys to provide more timely and accurate estimates for local areas.3 This paper investigates the benefits of combining traditional data with publicly available remote sensing indicators to predict welfare across approximately 4,500 villages in 10 districts in Malawi. The 10 districts correspond to the ones selected for phase one of the Unified Beneficiary Registry (UBR), which was conducted in 2017, collecting information on living standards to determine the eligibility of households for social programs. We evaluate four alternative village targeting methods which are compared to a benchmark welfare measure derived from an extract of the 2018 household census: (1) PMT scores calculated in the 2017 UBR administrative data, (2) the Meta Relative Wealth Index,4 (3) predictions derived from a village-level model estimated using a 2016 household survey and publicly available geospatial indicators, and (4) a two-step procedure that utilizes a hypothetical partial registry of 450 randomly selected villages – 10 percent of the villages in the population -- in addition to the 2016 household survey and publicly available geospatial indicators. This hypothetical partial registry would collect selected proxy welfare indicators such as asset and demographic information from all households in the selected villages. The first step entails predicting per capita consumption from the household survey into the partial registry data, which we simulate by sampling from the census extract. The second step uses the partial registry predictions to train a model using publicly available geospatial data to generate estimates for the remaining 90 percent of non-registry villages. 1 Serajuddin et al. 2015, Yeh et al., 2020. 2 Van Der Weide et al. (2022). 3 Burke, 2021, World Bank, 2021. 4 The Meta relative wealth index is described in Chi et al (2021). 2 For the purposes of evaluation, a measure of benchmark welfare was constructed by predicting household per capita consumption as a function of household and demographic characteristics in the 2018 census extract using extreme gradient boosting, a popular machine learning algorithm. This model was then used to generate a prediction of per capita consumption for each household in the census extract, which we aggregate to construct a “ground truth” measure of village welfare. We refer to this model as the “census model”, because it is used to impute predicted per capita consumption into the census extract. This census model captures over half of the variation in log per capita consumption in the household survey, with an R2 of 0.537. It is used both to generate “ground truth” using the full census extract and to impute welfare into the simulated partial registry, derived from a subsample of the census extract. To construct a “ground truth” measure of village welfare, the values of predicted household per capita consumption in the 2018 census extract are aggregated to the village level. For the main set of results, the “ground truth” village welfare measure used is the average predicted per capita consumption of the poorest 50 percent of households in the census, when ranked according to their predicted per capita consumption. We select this as the main measure to generate a clean comparison with the UBR, which only collected information on the poorest 50% of households. Since the PMT scores in the UBR are based on the average predicted welfare of the bottom half of the distribution, evaluating it against a different benchmark could introduce an additional source of noise in the UBR comparison. However, since this is a non-standard measure of village welfare, we also show below that the key results are robust to using a more typical measure, namely the average predicted per capita expenditure across all households in each village. The 20 percent census extract contains village, traditional authority (TA) and district names but not household or village geocoordinates. Obtaining information on the physical location of census villages is crucial for this exercise. Therefore, we match census village names with UBR administrative data, which contains the names of the administrative areas as well as the geocoordinates of interviewed households. This enabled us to calculate centroids based on the minimum and maximum latitude and longitude of households living in that village in the UBR administrative data as an approximation of the village centroid in the census for about 4,500 villages. These centroids were then matched to a set of grids constructed to cover the country, in order to link remote sensing information to the census. The remote sensing indicators were obtained from Google Earth Engine, WorldPop, and Meta. For each village, we calculated the average of approximately 40 indicators: landcover indicators (e.g., percentage of vegetation, water, or build-up coverage), global precipitation measurement, soil moisture, nighttime data, and year of the transition from pervious to impervious areas. This was supplemented with gridded maps of building patterns (e.g., number, area, and length of buildings, among others) in 2017, population density indicators, build settlement growth, and distance to major roads taken from Worldpop. Grid-level averages of these satellite-derived features were then linked to the census data using the village centroids obtained from the matched UBR data. This paper considers three main research questions: (1) How much does a hypothetical partial registry improve predictions of village-level welfare in this context, as compared with the existing UBR and two other feasible alternatives? (2) How much does increasing the size of the partial 3 registry improve the accuracy of prediction models using geospatial indicators? (3) How do predictions generated using extreme gradient boosting, which are more robust to outliers, compare to those generated using post-LASSO linear models? The main result is that introducing the simulated partial registry yields predictions of village welfare that are vastly more accurate than the other methods considered. When using a partial registry, the rank correlation, Area Under the Curve (AUC) coefficients, and R2 are 0.75, 0.89, and 0.57 respectively. In contrast, the rank correlation for the other three methods ranges from - 0.02 to 0.2, the AUC scores range from 0.5 to 0.6, and the R2 of the predictions range from 0.001 to 0.04. These are huge differences in predictive accuracy. These results are based on a simulated partial registry of approximately 450 villages, about 10 percent of the total number of villages with available data. However, the accuracy of the predictions does not substantially improve when the size of the simulated partial registry increases to 675 or 900 villages, which is 15% or 20% of the census extract. In other words, a partial registry of 450 villages is sufficient to train a high-performing predictive model in this context. Finally, this exercise provides a useful opportunity to compare the predictive performance of extreme gradient boosting against a post-LASSO model, which has also been used in the literature but imposes a linear functional form on the model. We find that using post-LASSO for the geospatial model – the second step of the partial registry approach -- makes little difference to the accuracy of the predictions. However, using LASSO rather than gradient boosting for the census model used to construct the benchmark village welfare measure greatly reduces the explanatory power of the geospatial model, whether it is estimated using gradient boosting or LASSO. This is because extreme gradient boosting is a tree-based method that is more robust to outliers, and generates a benchmark measure of village welfare that is far easier to predict using geospatial indicators. This suggests that the census data in this case may be susceptible to outliers that introduce noise when using linear prediction models. Overall, the results demonstrate that investing in richer and context-specific training data, such as partial registries, can greatly improve the accuracy of predictions based on geospatial data. This paper contributes to a growing literature on using satellite imagery to predict welfare. Initially, the most common remote sensing indicator used for this kind of analysis was nighttime lights, which measure the intensity of light in specific areas. Previous studies show strong correlations between night-time lights and GDP.5 However, the association between night-time lights and other measures of household welfare is weak in most contexts, suggesting the limitation of this indicator for predictions of welfare in small areas.6 More recent literature has demonstrated that indicators derived from daytime imagery is better suited for predicting welfare. 7 Recent literature has also demonstrated that advances in machine learning and available non-traditional 5 Henderson et al., 2009, Pinkovskiy and Sala-I-Martin, 2016. 6 Mellander et al., 2013. 7 Jean, et al., 2016; Babenko et al., 2017; Engstrom et al., 2015; Engstrom et al., 2017; Head, A., 2017; Yeh, et al.,2020;, Chi et al (2021); Masaki et al., 2022. 4 data can improve the targeting of social programs.8 Less attention, however, has been paid to how predictive performance relates to the nature of the training data.9 There are a few caveats to consider. The first is that we focus on one context: 10 districts in Malawi. While there can be some common features to other developing countries, additional research would be useful to confirm that the results apply in other contexts. Second, with respect to data availability, imputing per capita consumption into the 20 percent census extract was the best available measure of village-level welfare we could obtain. Nonetheless, because the welfare measure is imputed on the basis of slow-changing characteristics such as household size, head’s education, child dependency ratio, and household assets, the benchmark welfare measure is a longer-term measure of welfare that will only partially include transitory welfare shocks. A third caveat is that the sample of villages is constructed by matching village names by hand, which raises the possibility that the sample of villages used in the study is not fully representative. However, as demonstrated in table 3, most of key observed characteristics in the census are similar on average between matched and unmatched villages. A final methodological issue is that, in villages included in the simulated partial registry, we utilize the same census model and data to construct both the benchmark measure and the partial registry predictions. As a result, the partial registry by construction perfectly predicts benchmark welfare in the 450 villages randomly selected for the registry. To address this issue, we show that replacing the perfectly accurate predictions from the partial registry with the imperfect predictions generated by the geospatial model leads to only a modest fall in predictive performance. This indicates that the vast majority of the improvement from utilizing the partial registry, relative to the other three methods considered, derives from its ability to training a much richer and more accurate predictive geospatial model, rather than the increased predictive accuracy in the villages it covers. In this context, our findings provide convincing evidence that partial registries that collect a limited set of indicators for all households in a sample of villages, if collected properly, can greatly enhance the accuracy of geospatial predictions of village welfare. Many surveys routinely undertake full listing exercises in sampled enumeration areas, which could be extended to collect information on welfare proxies. Combining household surveys, partial registries, and geospatial data has to our knowledge yet to be implemented. Yet the cost would be relatively modest; a rough estimate is that the marginal cost of interviewing all households in 450 villages would be between $24,300 and $72,900.10 Moreover, this strategy appears to offer a large improvement over existing feasible methods when targeting social assistance programs to poor villages in contexts where conventional data sources are incomplete or outdated. These partial registries could be integrated into current systems of data collection, to help existing surveys benefit more from the wealth of publicly available geospatial data. 8 Aiken et al, 2022 and Van der Weide et al. (2022). 9 Although Engstrom et al (2022) finds that the model performance is very sensitive to the size of the sample training data. 10 These are based on an estimated marginal cost of $3 to $9 per household to conduct face-to-face surveys in Malawi, and the average of 18 households per village in the census extract. 5 This paper proceeds as follows: Section 2 describes the data sets used to analyze and construct the benchmark welfare measure. Section 3 presents the statistical methodology used to generate alternative estimators of village level comparison to evaluate against the benchmark. Section 4 describes the main results. Section 5 includes some robustness checks, and finally, Section 6 consists of a discussion and main conclusions. 2. Data The analysis utilizes data from approximately 4,500 villages in 10 districts in Malawi. The 10 districts correspond to those selected for the first phase of the UBR data collection. These districts are poorer than the rest of the country, according to data from the 2016 integrated household survey. For instance, households in the UBR districts tend to have lower educational attainment, as measured by the share of households where the highest educated male or female completed secondary or tertiary education. Also, households in UBR districts are in rural areas and have lower quality houses in terms of roof, wall, and floor materials. The UBR households are also less likely to have access to piped water or flush toilets and own fewer assets (e.g., cellphone, fridge, computer, cars, radio, television) (see Annex 1 to see the full set of statistics). Finally, as expected, UBR districts have significantly lower per capita consumption (Figure 1). Figure 1. Distribution of the Log per capita consumption in UBR districts vs. the rest of the country 6 2.1 Description of the data sets The primary sources of information are the following: (1) the Unified Beneficiary Registry (UBR), collected in 2017; (2) a 20 percent extract of the 2018 census provided by the National Statistical Office of Malawi; (3) the Integrated Household Survey (HIS) collected in 2016; and (4) publicly available remote sensing indicators. Unified Beneficiary Registry (UBR) Malawi’s Unified Beneficiary Registry contains information on the households’ socio-economic characteristics to determine their eligibility for social programs.11 For the analysis, we use the data set collected during the first phase of the UBR. These data were collected in 2017 in 10 districts: Lilongwe, Ntchisi, Kasungu, Rumphi, Chiradzulu, Nkhota-Kota, Blantyre, Karonga, Ntcheu, and Dowa. During this phase, half of the households in these districts were registered based on Malawi’s average poverty rate. The UBR data set contains apporximately 595,000 households, spread across 14,986 villages in the 10 districts. The UBR is a crucial data set for this analysis for two reasons. First, it contains the PMT scores for the poorest 50 percent of households, which allows us to evaluate the PMT scores as a targeting mechanism. Secondly, these data include the geocoordinates of sample households, which allows us to merge the satellite data with the census data. Census data The analysis uses a 20 percent extract of the 2018 census data for 10 districts provided by the National Statistics Office of Malawi. The study utilizes data from the 4,500 villages that were matched, by name, with the UBR data. The census extract includes 235,600 households in 26,150 villages in the ten UBR districts. The census extract also serves two main purposes. First, it is used, along with the parameters of the census model estimated in the household survey, to generate the benchmark welfare used for evaluation. Second, it provides a randomly selected sample of villages that is used to simulate a partial social registry, as explained in the methodology section below. Survey data The analysis uses the fourth Integrated Household Survey (IHS) of 2016, which is made publicly available through the World Bank’s Living Standard Measurement Survey (LSMS) program. The survey includes a cross-sectional sample of 12,447 households surveyed in 779 enumeration areas (EAs). Thus, there are roughly 16 sample households per enumeration area. It is considered to be representative at the district level. Because it is an LSMS survey, jittered enumeration area coordinates are also publicly available.12 The IHS is used in the analysis for two primary purposes. The first is to estimate a model that predicts per capita consumption as a function of household variables common to the survey and the census extract, as a basis for constructing the benchmark measure of welfare in the census. Besides serving as a benchmark for evaluation, this predicted welfare measure is also used to simulate a partial registry in a subsample of villages. The second 11 See Lindert et al., 2018, for more information on the UBR. 12 Van der Weide et al (2022) finds that the jittering reduces the correlation between census and geospatial-based estimates for traditional authorities in Malawi by a modest amount. 7 main purpose of the IHS is to train a model that predicts welfare based on publicly available satellite data, which is one of the candidate prediction methods that is evaluated. Satellite data We obtained the satellite data from three different sources: Google Earth Engine, WorldPop, and Facebook. Table 1 contains a summary of the indicators used for the analysis.13 For indicators for which annual data are available, we collected information for 2017 and 2018 that correspond to the years of the UBR and census, respectively. This data is collected at the grid level, and then is merged with the village centroids obtained from the UBR. Table 1. Satellite indicators used in the analysis Source Indicators Land cover type, weather, vegetation, nightlights, year of change to Google Earth Engine impervious surface. 7 km by 7 km resolution Population density, build-settlement growth, OSM distance to roads Worldpop (2016), and building patterns data (2020). Resolution is 0.1 km Meta Relative Wealth Index. The resolution is 2.4 km. Note: Data is for 2017-2018 unless otherwise indicated. 2.2 Matching the census and the UBR data by village Matching villages between the UBR and the census data is a critical step that enables the linking of census villages with remote sensing indicators. The matching is based on names using an algorithm that matches two text variables and assigns a similarity score. The matching starts at the biggest administrative unit, Traditional Authorities (TA), followed by Group Village Names (GVN), and finally at the village level. This process resulted in 32% (4,72714) of the UBR villages being matched with the census villages. Six UBR TAs and 24% of the GVNs are not in the census. Among the non-merged villages (10,181), 38% (3,894) are in non-matched GVNs. The second panel of Table 2 shows the matching at the district level. The districts with the lowest percentage of matched villages are Rumphi, Nkhotakota, and Lilongwe. Table 2. Matching results between UBR and Census. Merged with census (%) Not merged (%) Total in UBR (%) Traditional authorities 107 95% 6 5% 113 100% Group village names 1,302 76% 401 24% 1,703 100% Villages 4,727 32% 10,181 68% 14,908 100% District 13 For more details about the specific names of the indicators, the bands collected, and years see Annex 2. 14 This number is larger than the villages used in the analysis since we lose some of them for having missing values in some of the features used in the models. For this reason, the analysis focused on approximately 4,500 villages. 8 Karonga 161 54% 138 46% 299 100% Rumphi 122 22% 423 78% 545 100% Kasungu 485 38% 780 62% 1,265 100% Nkhotakota 119 24% 382 76% 501 100% Ntchisi 499 51% 473 49% 972 100% Dowa 481 56% 377 44% 858 100% Lilongwe Rural 1,203 29% 2,996 71% 4,199 100% Ntcheu 404 57% 308 43% 712 100% Chiradzulu 528 77% 161 23% 689 100% Blantyre Rural 725 74% 249 26% 974 100% Total* 4,727 43% 6,287 57% 11,014 100% Notes: (*) the total in the second panel corresponds to the total number of villages in matched GVNs only. The analysis presented below is based entirely on the final sample of matched villages. Therefore, it is important to check whether there is any systematic selection bias in the sample. Table 3 shows the comparison of census means in the matched and unmatched villages. It shows that in most of the features, the difference in means between matched and unmatched villages is very small and not statistically significant. However, we observe that matched villages have a higher share of households with more educated adults. Also, households in matched villages have a lower child dependency ratio and slightly higher elderly dependency ratio. Finally, a higher share of households in matched villages have better roof quality in their houses. Table 3. Comparison of census means in matched and unmatched villages Difference P-value Mean (matched-unmatched) (Difference) Highest educated man: primary education 0.27 0.01 0.16 Highest educated man: secondary education 0.11 0.01 0.05 Highest educated man: tertiary education 0.02 0.00 0.22 Highest educated woman: primary education 0.26 0.02 0.09 Highest educated woman: secondary education 0.06 0.01 0.03 Highest educated woman: tertiary education 0.01 0.00 0.30 Share of households with literate house. head 0.73 0.02 0.09 Household size 4.87 (0.16) 0.30 Overcrowding 1.91 (0.10) 0.31 Elderly dependency ratio 0.08 0.01 0.00 Children dependency ratio 0.94 (0.03) 0.09 Firewood for cooking 0.90 (0.00) 0.66 Access to pipe water 0.06 0.00 0.92 Access to flush toilet 0.01 0.00 0.24 Share of HH that own a house 0.91 0.00 0.81 Share of HH with improved walls 0.86 0.04 0.21 Share of HH with improved roof 0.38 0.07 0.04 9 Share of HH with improved floor 0.18 0.02 0.16 Share of HH with cellphone 0.45 0.01 0.73 Share of HH with fridge 0.02 0.00 0.27 Share of HH with stove 0.02 0.00 0.32 Share of HH with computer 0.02 0.00 0.35 Share of HH with oxcart 0.03 (0.01) 0.24 Share of HH with bicycle 0.33 (0.02) 0.28 Share of HH with motorcycle 0.04 (0.00) 0.77 Share of HH with car 0.01 (0.00) 0.74 Share of HH with radio 0.28 0.01 0.12 Share of HH with television 0.06 0.00 0.50 Notes: the difference in means are calculated using a regression of each variable against the indicator variable equal to 1 if the village was matched and zero otherwise. They are weighted by the number of households in each village and the standard errors are clustered at district level. 3. Methodology We propose four different targeting methods using census and survey data combined with geospatial data to understand the most effective way to target the poor population: (1) the PMT scores calculated in the UBR data of 2017; (2) the Relative Wealth Index from Meta (3) combining survey data and geospatial indicators to predict average welfare; and (4) a census sample to simulate a partial registry data set used to train models using satellite data. We rely on rank correlations, Area Under the Curve coefficients, and R-squared coefficients to compare the accuracy of each targeting method. In each case, we compare the predictions with a measure of benchmark welfare constructed from the census model. 3.1 Construction of Benchmark Welfare using the Census model We define a benchmark welfare measure of “ground truth” in order to evaluate different targeting methods. As noted above, the primary welfare measure is the average predicted log per capita consumption of the poorest 50 percent of households in each village. This measure echoes the UBR structure limited to the bottom half of households in each village according to the average Malawi poverty rate. This measure takes advantage of the rich data in the census sample, and it is easier to predict than measured consumption due to reduced measurement error and its inability to capture temporary shocks.15 To construct the benchmark welfare in the census, we use the IHS 2016 to estimate a model and then impute welfare in the census. The variables included in the model are selected so that both data sets contain the same information. All the variables are measured at the household level. We include education variables such as the literacy of the household head, the maximum level of education achieved by men and women in the household, dependency ratios, household size and overcrowding, 15 Due to data availability, we use the per capita consumption as a welfare measure. However, we observe a high rank correlation with other potential measures such as absolute poverty rate (correlation of -0.86) or extreme poverty (correlation -0.796). 10 house characteristics, and household assets. The dependent variable is household per capita consumption. We train a machine learning model to predict household per capita consumption using an extreme gradient boosting model with optimal hyperparameters chosen via 5-fold cross- validation. The parameter used in the models corresponds to the average of the selected parameters in each fold. Details on the range of parameters considered are shown in Table 4. Additional technical information regarding these parameters and other aspects of the extreme gradient boosting procedure can be found in Annex 3. Table 4. Parameters for XGBoost models to estimate benchmark welfare. Parameter Range Maximum number of boosting iterations between 50 and 200 Maximum depth of a tree 2 or 4 Learning rate 0.1 or 0.3 Subsample ratio of the training instance 0.2,0.4, or 0.6 Subsample ratio of columns to construct each tree 0.2,0.5, or 0.7 We estimate four models using different samples in the IHS: (1) The full sample (All districts-all households), (2) All households in the UBR districts (UBR districts-all households), (3) the poorest 50% of the households in all districts (All districts-poorest 50%), and (4) the poorest 50% of households in UBR districts (UBR districts-poorest 50%). Estimating four models allows us to assess the trade-offs along two dimensions: (1) using all districts in the sample rather than restricting to UBR districts, and (2) using all households rather than just the lower half. Annex 4 shows the R-squared for each model and the main explanatory variables in terms of the gain measure.16 Including all households (model 1) leads to higher R-squared, while limiting to the bottom half of UBR districts (model 4) leads to by far the lowest R-squared. All models assign high importance to similar variables, mostly household assets, household size, and the urban or rural location (for the specific gain measures of each variable, see Annex 5). The model that best matches the UBR sample is model (4) since it uses only UBR districts and the poorest 50% of households; however, the sample size is small, and the R-squared is the lowest. Although Model (1) has the highest R-squared, it also has a wider range of variability in the dependent variable, making R-squared a potentially misleading metric. Models (2) and (3) have similar values of R-squared. We elect to use model (3) as the main census model because it resembles the sample structure of the UBR by using only the poorest 50% of households in each village. However, it also takes advantage of additional data on log per capita consumption by using the full set of sample survey enumeration areas nationwide. Table 5 shows the values for the R-squared and the importance of the top 15 features in the selected model. These features explain around 87% of the model being the “urban/rural” indicator the main contributor. We later present robustness analysis using the other three samples. 16 Gain reflects the improvement in accuracy brought by a feature to the branches it is on. This means that before adding a new split on a feature X to the branch there were some wrongly classified elements, once the split on this feature is added, there are two new branches, and each of them is more accurate. A higher value of gain indicates that the feature is more important for generating a prediction. 11 Table 5. Benchmark welfare census model in IHS All districts-50% poorest HH R-squared 53.71 Importance of variables Urban (1) or rural (0) 0.24 Child dependency ratio 0.09 Ownership of a cell phone 0.09 House with improved floor 0.09 Household size 0.08 Access to piped water 0.05 Ownership of a television 0.05 Household overcrowding 0.04 Fuel cooking: firewood 0.03 Household size (squared) 0.03 Ownership of a radio 0.02 Access to flush toilet 0.02 Ownership of a car 0.02 HH head literacy 0.01 Highest educated women attained primary 0.01 The selected model is used to predict the benchmark welfare variable in the census extract. This variable is generated at a household level first, then aggregated to a village welfare measure as the average of the bottom half of households in each village. 3.2 Criteria used for evaluating prediction accuracy We use three main criteria for evaluating the accuracy of the predictions against the benchmark: the Spearman Rank Correlation, the Area Under the Curve (AUC), and the R-squared , which is equal to the share of the variation explained by the prediction. While the latter is a standard measure of prediction accuracy, the first two warrant a brief explanation. The Spearman rank-order correlation coefficient ( ) is a statistical measure of the strength and direction of a monotonic relationship between two variables measured on a continuous scale. The rank correlation between two variables, X and Y, is calculated as follows: (()()) = ()() = () () Where ()() denotes the Pearson correlation coefficient applied to variable ranks; (()()) is the covariance between two ranked variables, and () () are the standard deviations of the ranked variables. 12 Finally, the Area under the Curve (AUC) is a measure of the efficacy of a targeting method in identifying the poor population at different targeting thresholds (Wodon, 1997, Olken and Hanna 2018). The curve in question is a receiver operator characteristic (ROC) curve, which indicates the trade-off between true and false positives at different poverty lines. We plot the ROC curve using each percentile of the benchmark village welfare predicted from the census. In other words, for poverty lines defined at each percentile of the benchmark distribution, we plot the true positive rates (TPR) on the Y axis and the false positive rate (FPR) on the X axis. The former is defined as the proportion of poor villages that are correctly predicted to be poor when using predicted village welfare from each candidate prediction method, while the latter is defined as the proportion of non-poor that are incorrectly predicted to be non-poor. Because the true positive rate is on the Y axis, a higher AUC score represents an improvement in true positives for a given level of false positives, or a better targeting method. The 45-degree line, which is what one would expect if villages were ranked randomly, corresponds to an AUC score of 0.5, while a perfectly accurate ranking that correctly identifies poor households under all poverty lines would receive an AUC score of 1. 3.3 Candidate Targeting Methods for Identifying Poor Villages Proxy Mean Test scores in the UBR The administrative data from Malawi’s UBR provides information on households’ characteristics to assess their prospective eligibility for social programs. The data set contains an extensive range of variables such as geographic location, households’ assets, food security questions, and economic characteristics. This information was used by the Malawian government and the World Bank to calculate Proxy- Means Test (PMT) scores to identify poor and vulnerable households. It is used to create a proxy score of weighted variables that are highly correlated with household consumption. The PMT, like our benchmark welfare measure, is a measure of chronic poverty. This is because it uses variables that are less responsive to economic shocks than household consumption, such as assets and household composition (Lindert et al., 2018). Because the PMT variable is available in the data and was used for the UBR, it is useful to evaluate it against welfare predicted into the census extract. Relative Wealth Index The Relative Wealth Index predicts the relative standard of living within countries using non- traditional data sources such as satellite imagery, cellular network data, topographic maps, and proprietary connectivity data from Meta. Using supervised machine learning models, the team predicts the relative wealth for grid cells of 2.4 km2. The estimates of wealth are relatively accurate. Depending on the method used to assess the model’s performance, the model explains 56 to 70 percent of the actual variation in household-level wealth in 56 low- and middle-income countries (Chi, et al, 2022). However, the model is trained on a wealth index, which may 13 perform less well when predicting income or consumption-based poverty measures. For example, the RWI only explains 32 percent of the variation in average per adult equivalent consumption across Cantons in Togo (Aiken et al, 2021). Furthermore, for a microcensus conducted in rural Kenya, the RWI explains 70 percent of the variation in wealth but only 17 percent of the variation in the predicted probability of being poor, defined using household consumption (Chi et al, 2022). Thus, the performance of the RWI varies greatly depending on the context, and particularly depends on whether it is evaluated against a wealth or consumption- based measure of welfare. This study therefore contributes additional information on the performance of the RWI in distinguishing among very poor villages by comparing it to the benchmark measure of village welfare in Malawi. IHS plus geospatial indicators. The third alternative method for targeting consists of using the IHS survey data to train a welfare model against satellite indicators. This model can then be used to generate out-of-sample predictions into villages for which matched census data is available, to compare against the benchmark. This method has the advantage of being free, but may suffer from limited training data. We estimate the model only for the poorest 50% of households in each village to resemble the structure of the UBR data set and train it using extreme gradient boosting techniques. Partial registry As a final alternative method to target the poor, we consider a hypothetical collection of a partial registry data set from a sample of villages. This exercise would consist of collecting the subset of household welfare proxies used in the census model from all households in a random sample of villages, similar to an expanded sample listing procedure of the type typically carried out for household sample surveys. In practice, for this analysis we simulate a partial registry by drawing a random sample of villages from the census extract. This sample, consisting of all households in the sampled villages, is used as the hypothetical partial registry in the two-step procedure. The first step involves utilizing the parameters from the census model used to predict benchmark welfare (described above) to predict per capita consumption into the simulated partial registry. The predictions are based on 39 independent variables common to both the census and survey. The resulting predictions for households are then aggregated into a measure of village welfare. The second step of the prediction process entails estimating a second model, the geospatial model. The dependent variable in this model is predicted village welfare from the simulated partial registry, and the independent variables are village geospatial indicators, also estimated using extreme gradient boosting. This “geospatial model” is used to predict out-of-sample welfare predictions for villages not included in the partial registry. The predictions from the two models are combined, by using the census model predictions from the simulated partial registry in the villages for which they are available, and the geospatial model predictions for the remaining villages not included in the registry. In other words, in villages covered by the simulated partial registry, we use predictions from the census model rather than those from the geospatial model. This is because the simulated partial registry provides more accurate predictions than geospatial data, which are in fact exactly equivalent to the benchmark welfare measure by construction. We show below, however, that predictions from this procedure become 14 only modestly less accurate when using predictions from the geospatial model for all villages. We therefore conclude that, although predictions from the census model predict the benchmark exactly in partial registry villages, this is not a major factor in explaining the improved performance of the partial registry predictions. Instead, the partial registry provides richer data with which to train a more accurate geospatial model. 4. Main Results This section shows the results of the four alternative targeting methods. We compare welfare predictions generated using each method to the benchmark predictions using rank correlations, AUC coefficients, and R-squared coefficients. Table 6 presents the metrics for each method. The partial registry is clearly the most accurate method for targeting the poor villages in the ten UBR districts in Malawi. The second best is using the RWI to rank villages. However, this method performs only moderately well, with an AUC coefficient close to 0.60 and a 0.2 rank correlation. On the other hand, using the PMT scores as a proxy for welfare does not show promising results; the PMT scores show zero correlation with our benchmark welfare and has an AUC coefficient equivalent to guessing poor villages at random. The IHS plus geospatial variable model shows the second-lowest coefficients, only slightly higher than the PMT scores in terms of AUCs. This may be due to the limited sample size. Because the sample size is restricted to the bottom half of households within UBR districts, there are only an average of about 8 households per village available to train the model. One indication of this is that performance improves noticeably when predicting average welfare across all households in the village, as noted below. Table 6. Rank correlations, AUC, and R2 of the targeting methods Rank correlations AUC R-squared Partial registry (10% of the census sample) 0.75 0.89 0.57 PMT scores (0.02) 0.50 0.00 IHS training sample 0.13 0.53 0.01 RWI 0.20 0.60 0.04 Figure 2. Rank correlations, AUC, and R2 of the targeting methods 15 1.00 0.89 0.90 0.80 0.75 0.70 0.60 0.57 0.60 0.50 0.53 0.50 0.40 0.30 0.20 0.20 0.13 0.10 0.04 0.00 0.01 - (0.10) (0.02) Rank correlations AUC R-squared Partial registry (10% of the census sample) PMT scores IHS training sample RWI The ROC curves of the four methods are presented in Figure 3. This is the graphical representation of the AUC results described in Table 6. The partial registry method leads to far superior targeting outcomes, especially at low poverty rates. The RWI is the second-best method but is still far from the partial registry results. The curves for PMT scores and IHS are very close to each other and to the 45-degree line that corresponds to randomly selected villages. Figure 3. ROC curves of the targeting methods 1 0.9 0.8 0.7 True Positive Rates (TPR) 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 False Positive Rates (FPR) 45 degree line Partial registry: census sample 10% PMT scores RWI IHS training sample 16 As noted above, two factors might contribute to the excellent predictive performance of the partial registry method relative to the direct predictions from the household survey. The first is that the geospatial model is trained to a measure of village welfare that is far more precisely estimated than the one used in the sample. The welfare measure is more precisely estimated both because it is based on data from a much large number households from the census extract, and because it is a predicted welfare measure that largely eliminates classical measurement error. More precise training data improves the ability of machine learning to construct a predictive model, by reducing the risk that particular predictive variables will be fit to random noise in the training data, and by improving the accuracy of the cross-validation procedure used to select models. A second potential reason for the better performance of the partial registry method is that it uses predictions derived from the census model, which is also used as the benchmark measure of welfare. While this is defensible on the grounds that it improves the accuracy of the predictions, using the same values for prediction and evaluation will overstate measured performance. To address this, as a robustness check we consider the accuracy of the geospatial model predictions for all villages, instead of using the partial registry predictions where available as well as out-of- sample predictions. In this case, the rank correlation falls from 0.75 to 0.73 and the AUC coefficient falls from 0.89 to 0.78. However, this still greatly exceeds the performance of the RWI, the second-best method, which has a rank correlation of 0.20 and an AUC of 0.60 (see Figure 4). Figure 4. Rank correlations, AUC, and R-squared coefficient for the partial registry method. 1.00 0.89 0.90 0.84 0.78 0.80 0.73 0.72 0.75 0.70 0.5375 0.5732 0.60 0.5207 0.50 0.40 0.30 0.20 0.10 - Rank correlation AUC R-squared Full sample predictions Out of sample predictions Out-of-sample plus training values The RWI, although the second-most accurate method for predicting the benchmark welfare, is far less accurate than the partial registry results, and only explains 4 percent of the variation in our measure of average village predicted welfare. This is probably because the RWI is based on a measure of household wealth instead of consumption or predictive consumption. Wealth may not be as accurate at distinguishing the welfare levels of villages within 10 poor districts. Moreover, the wealth index reflects the full distribution of households, whereas the benchmark welfare measure only pertains to the bottom half. 17 The RWI, however, performs better than efforts to integrate the household survey with publicly available geospatial data in the absence of the partial registry. This is because of the noise in the household survey data used to train the model. Given that only the bottom half of the household survey data are used to train the model, there are only roughly 8 households per EA with which to generate a measure of consumption. The resulting machine learning model is therefore not particularly accurate. Finally, the PMT scores from the UBR 2017 are the least accurate proxies for village level welfare, as defined by the benchmark welfare measures. These are not entirely due to outliers. Figure 5 shows the presence of some outliers in the raw scores; however, even after trimming the values, the correlation is low. Some of the relatively poor performance of the UBR PMT targeting might be attributable to problems with data collection, given that it was the initial effort to collect data for the UBR. Figure 5. PMT scores and benchmark welfare Figure 6 displays another way to present the main results. It plots the predicted welfare measures (on the Y axis) against the actual value of benchmark welfare (on the X axis). Villages are divided into four groups depending on whether their benchmark and predicted welfare fall into the bottom quartile. The bottom left and top right quadrants represent villages that are correctly predicted to be in or out of the bottom quartile. The upper left quadrant shows targeting errors of exclusion and the bottom right shows targeting errors of inclusion. Of the four methods, it is clear that the partial registry approach has by far the lowest prevalence of points in the top left and lower right quadrants, and that these errors are closer to the center. Meanwhile, the PMT has the highest prevalence of errors, especially errors of exclusion. The Meta RWI and the geospatial 18 household survey model have similar error rates, with the latter slightly less likely to suffer from large exclusion errors. Figure 6. Predicted vs. Benchmark welfare for the four alternative methods Notes: Graphs show benchmark village welfare plotted against predicted welfare, for the four predictions methods: The partial registry (top left), PMT (top right), Meta relative wealth index (bottom left) and IHS plus geospatial indicators (bottom right). Each plot is divided into four quadrants with boundaries defined at the 25 th percentile of predicted and benchmark welfare. When classifying villages in the bottom quartile as poor, the quadrants represent villages correctly predicted as non-poor (top-right), falsely included as poor (bottom right), correctly predicted as poor (bottom left) and falsely excluded as non-poor (top left) Finally, Annex 6 presents heat maps of the predicted per capita consumption using each method and the benchmark welfare. In the maps, the PMT scores fail to predict welfare mainly in the central region of Malawi which includes Lilongwe, Dowa, Ntchisi, Nkhotakota, and Kasungu districts. Most methods make accurate predictions in two districts: Rumphi and Chiradzulu. 5. Robustness checks This section presents the results of robustness checks along five dimensions: The size of the partial registry, the nature of the village welfare measure, the geographic composition of the 19 census model, the estimation method for the geospatial model, and the use of two proprietary geospatial indicators. A. The Size of the Partial Registry Given the impressive predictive performance of the partial registry predictions, one might wonder whether increasing its size would further improve performance. Figure 7 compares the performance of the partial registry predictions for partial registries of 444 villages (10 percent), 666 villages (15 percent) and 888 villages (20 percent). Overall, expanding the size of the hypothetical partial registry offers only limited improvements in predictive accuracy, and is not worth the added expense it would entail. Figure 7. Expanded sample for the partial registry method 1.00 0.89 0.88 0.89 0.90 0.80 0.80 0.75 0.78 0.70 0.63 0.64 0.57 0.60 0.50 0.40 0.30 0.20 0.10 0.00 Rank correlation AUC R-squared Sample 1: 10%-90% Sample 2: 15%-85% Sample 2: 20%-80% B. Village welfare measure Second, we consider how the results are affected by the choice of welfare measure. This is important because the results until now have used a non-standard welfare measure, namely the average predicted per capita consumption of the bottom half of households in the village. This was based on a conscious decision to match the UBR administrative data, which only contains PMT scores for the bottom 50 percent of households in each village. This section considers how the results change when we consider mean village consumption, taken across all households, as the main welfare measure. Changing the welfare measure has three main implications for the methodology. First, the census model must now be retrained using all households, not just the bottom half, in the survey. This of course also changes the predicted values of household per capita consumption in the simulated partial registry, which is equal to the benchmark welfare model for villages included in the registry, which entails re-estimating the geospatial model. Finally, we re-estimated the IHS plus geospatial model to train it against average per capita consumption across all households in the survey, rather than just the bottom half. 20 Table 7 displays the results when using mean village consumption instead of the mean of the bottom half as the village welfare measure. Three main findings are clearly apparent. First, the partial registry approach continues to perform vastly better than the other alternatives when attempting to predict mean village welfare. Second, both the partial registry approach and the RWI suffer moderately when predicting the mean over all households rather than the mean of the poorest half of households, particularly when it comes to rank correlations. This may be because of idiosyncratic positive outliers in the upper half of the household predicted welfare distribution which are more difficult to predict using both geospatial data and predictions trained on asset indices. Table 7. Metrics of all the methods when using all households in the villages All districts-all HH Rank correlations Partial registry (10% of the census sample) 0.61 PMT scores 0.02 IHS training sample 0.19 RWI 0.14 AUC Partial registry (10% of the census sample) 0.77 PMT scores 0.50 IHS training sample 0.59 RWI 0.55 R-squared Partial registry (10% of the census sample) 0.35 PMT scores 0.00 IHS training sample 0.01 RWI 0.02 Third, the method that combined survey and geospatial predictors without a partial registry (IHS plus geospatial predictors) performs much better when using the full sample of households than when only using the bottom half for each village. The rank correlation increases from 0.13 to 0.19 and the AUC increases from 0.53 to 0.59. This is because on average there are only approximately sixteen households interviewed in each village in the IHS, and average per capita consumption is much more accurately measured when all sample households in each EA are used to train the model rather than only the bottom half. The resulting predictions, when using the full IHS sample, also performs better than the Meta relative wealth index. In this context, when trying to predict the average predicted per capita consumption from a census extract, the fact that the RWI uses additional training data from many countries and proprietary indicators on connectivity does not fully compensate for the fact that it is trained to predict an asset index rather than a consumption-based welfare measure. Therefore, a model that predicts average village per capita consumption directly on the basis of publicly available geospatial characteristics is slightly superior for targeting in this context, though both are far worse than collecting additional partial registry data to train a better geospatial model. 21 C. The geographic composition of the sample The benchmark measure of welfare is crucial for evaluating different prediction methods. However, because the census extract is only available for ten districts, it is not immediately clear whether it would be best to use only the survey data from those ten districts, or the full set of survey data to train the census model. The latter takes advantage of a wider set of training data, but the former may better capture the specific relationships between welfare and household characteristics in those poor districts. Table 8 shows the results when varying the household survey sample used to estimate benchmark welfare. Specifically, we experiment with using only the UBR districts in the household survey data to train the census model, rather than the full sample. While the partial registry method remains the most accurate method by far, it doesn’t do nearly as well when the benchmark welfare measure is derived from a model trained on data from only the UBR districts. This is because the sample size used to train the models declines significantly from 6,000 to 2,000 poorest households when limiting the training sample to households in UBR districts, leading to a less informative benchmark welfare model and measure. The partial registry method is particularly sensitive to the weakening of explanatory power in the census model, due to limiting the training data to UBR districts. This is because the predictions from the census model are also used as the dependent variable to train the second stage geospatial model that generates estimates for non-registry villages. Interestingly, however, the predictive performance of the RWI also declines substantially, due to the increase in noise in the benchmark measure of welfare. Nonetheless, the predictive performance of the UBR improves, suggesting that the PMT may have picked up some of the heterogeneity in welfare patterns within the 10 districts. Table 8. Results when training models on sample data from UBR districts instead of all districts All districts-poorest 50% HH UBR districts-poorest 50% HH Rank correlations Partial registry (10% of the census sample) 0.753 0.399 PMT scores (0.02) 0.15 IHS training sample 0.13 0.04 RWI 0.20 0.11 AUC Partial registry (10% of the census sample) 0.89 0.64 PMT scores 0.50 0.53 IHS training sample 0.53 0.50 RWI 0.60 0.53 R-squared Partial registry (10% of the census sample) 0.57 0.17 PMT scores 0.00 0.01 IHS training sample 0.01 0.00 RWI 0.04 0.01 22 D. Using LASSO instead of XGboost for the geospatial and census model This exercise provides a useful opportunity to compare predictions across two different machine learning approaches: Extreme Gradient Boosting, and post-LASSO, which uses LASSO-selected variables in an OLS regression model. The main difference between these two approaches is that the former, as applied here, is a tree-based classification method that uses sequential random forests to predict the dependent variable. Extreme gradient boosting can therefore accommodate highly non-linear relationships. In contrast, post-lasso imposes a linear functional form. An open question in this context is how much this linearity assumption affects the accuracy of the predictions. The results when using post-LASSO to generate the geospatial predictions based on the partial registry – the second step of the partial registry procedure -- are displayed in Figure 8. These are applied to predictions from the baseline census model, which uses data from all districts but predicts the average welfare of the bottom half of households. Overall, the post-LASSO model performs slightly better in terms of rank correlation, while extreme gradient boosting performs a bit better when looking at AUC and R-squared. Thus, in this context, whether one uses post- LASSO or extreme gradient boosting in the second stage of the partial registry method makes little difference. However, the estimation method for the benchmark welfare seems to affect the results considerably. When the geospatial models are trained using a benchmark welfare that was predicted using LASSO, the metrics for the partial registry methods decrease significantly: rank correlations go from 0.75 to 0.27, AUC from 0.89 to 0.66 and R-squared from 0.57 to 0.08. This could indicate potential measurement errors in the census that affects the performance of LASSO models since they are more sensitive to outliers. Figure 8. Partial registry predictions using LASSO vs. XGBoost 1.00 0.89 0.90 0.85 0.80 0.76 0.75 0.70 0.66 0.63 0.60 0.56 0.57 0.50 0.40 0.28 0.27 0.30 0.20 0.08 0.10 0.03 - Rank correlations AUC R2 Rank correlations AUC R2 LASSO-benchmark XGBoost-benchmark LASSO-predictions XGBoost-predictions 23 6. Conclusions In this paper, we evaluate different alternative methods to identify poor villages in 10 Malawian districts. This is a challenging prediction exercise because villages are highly geographically disaggregated. The results show that a two-step approach utilizing a hypothetical partial registry from 450 villages performs vastly better than the PMT, geospatial prediction based solely on the household survey, or the Meta relative wealth index. The main measure used to identify poor villages is the mean predicted per capita consumption of the bottom half of households in each village, but key results hold when using the mean predicted per capita consumption of all village households as the village welfare measure. Implementing the partial registry method requires nationally representative survey data, publicly available geospatial indicators, and the collection of a partial registry containing a subset of household characteristics found in the survey data. Several similar household surveys that collect information on welfare proxies have been fielded with the support of the World Bank through the Survey of Well-Being with Instant and Frequent Tracking (SWIFT) program, including in Malawi. Although none have surveyed the full population of households in selected villages, it is quite standard for household surveys to list all surveys in sampled enumeration areas, and we estimate that the cost of collecting approximately 40 variables from all households in approximately 500 villages could be in the ballpark of $24,000 to $73,000. This is a worthwhile investment to greatly boost the accuracy of village welfare measures constructed using geospatial data. Some countries also field periodic community surveys, which could potentially be tweaked to collect partial registries. This paper also demonstrates the efficacy of applying gradient boosting models in settings with household-level predictors, when sufficient data are available. In particular, using training data from the full sample of households, rather than only the 10 districts of interest, substantially increases predictive performance. Using post-lasso models instead of gradient boosting in the geospatial model, the second step of the partial registry method, only slightly affects the accuracy of the predictions. In contrast, using post-lasso models instead of extreme gradient boosting in the census model, used to impute per capita consumption into the simulated partial registry, greatly reduces the predictive power of the geospatial model. This large reduction in predictive power occurs both when using gradient boosting and post-LASSO for the geospatial model. This suggests that the census data may contain outliers, which introduce more noise into the partial registry predictions when using a linear model of log per capita consumption than when using gradient boosting. The relatively poor performance of the PMT scores derived from the UBR data is a puzzle. The UBR PMT scores performed a bit better when the benchmark measure of welfare was constructed using data only from the 10 Malawian districts. Even so, the UBR PMT scores do not appear to be consistent with welfare predicted using the census data collected from these districts. Partly this may be due to the UBR data being taken from the initial phase of data collection, although it is also possible that the purpose of the partial registry may have led to measurement error. It would be useful to do these types of evaluations with further rounds of the UBR, even if compared against old census data, to see if later rounds of the UBR produce predictions that are more consistent with the census. 24 Two limitations of this study are that it only applies to 10 districts in Malawi and is based on a 20 percent extract of the census. Additional work could demonstrate that similar results hold in different contexts and when using a full census. A third limitation is that the hypothetical partial registry is taken from the census, and therefore assumed to match the census exactly. In reality, measurement error in data collection for the partial registry will reduce its performance relative to census-based predictions. Indeed, a partial registry could suffer from some of the same issues in data collection experienced when collecting data for the UBR. A project that pilots the collection of a partial registry for the purpose of training a geospatial model would provide a more realistic test of the partial registry approach and could shed new light on whether such a partial registry would be prone to systematic bias. Finally, future research could leverage household level information on geocoordinates if they can be obtained in census data. This would enable estimating models relating predicted welfare to geospatial indicators at the household level, which may perform better than the village-level models considered in this analysis. Despite these caveats, the results convincingly demonstrate both the limitations of existing methods, and the potential for partial registries to add massive value when using survey and geospatial data to identify the poorest villages in a very low-income setting. 25 References Aiken, E., Bellue, S., Karlan, D., Udry, C. R., & Blumenstock, J. (2021). Machine learning and mobile phone data can improve the targeting of humanitarian assistance (No. w29070). National Bureau of Economic Research. Babenko, B., Hersh, J., Newhouse, D., Ramakrishnan, A., & Swartz, T. (2017). Poverty mapping using convolutional neural networks trained on high and medium resolution satellite images, with an application in Mexico. arXiv preprint arXiv:1711.06323. Chi, G., Fang, H., Chatterjee, S., & Blumenstock, J. E. (2022). Microestimates of wealth for all low-and middle-income countries. Proceedings of the National Academy of Sciences , 119(3). Engstrom, R., Sandborn, A., Yu, Q., Burgdorfer, J., Stow, D., Weeks, J., & Graesser, J. (2015, March). Mapping slums using spatial features in Accra, Ghana. In 2015 Joint Urban Remote Sensing Event (JURSE) (pp. 1-4). IEEE. Engstrom, R., Hersh, J., & Newhouse, D. (2016). Poverty from space: using high resolution satellite imagery for estimating economic well-being and geographic targeting. unpublished paper. Engstrom, R., Newhouse, D., Haldavanekar, V., Copenhaver, A., & Hersh, J. (2017, March). Evaluating the relationship between spatial and spectral features derived from high spatial resolution satellite data and urban poverty in Colombo, Sri Lanka. In 2017 Joint Urban Remote Sensing Event (JURSE) (pp. 1-4). IEEE. Head, A., Manguin, M., Tran, N., & Blumenstock, J. E. (2017, November). Can human development be measured with satellite imagery?. In Ictd (pp. 8-1). Henderson, J. V., Storeygard, A., & Weil, D. N. (2012). Measuring economic growth from outer space. American economic review, 102(2), 994-1028. Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790-794. Limarino, W. (2021) Augmented Proxy Mean Tests. Can Machine Learning Improve Targeting Effectiveness? (Preliminary Draft) Lindert, K., Andrews, C., Msowoya, C., Paul, B. V., Chirwa, E., & Mittal, A. (2018). Rapid Social Registry Assessment. Masaki, T., Newhouse, D., Silwal, A. R., Bedada, A., & Engstrom, R. (2022). Small area estimation of non-monetary poverty with geospatial data, Statistical Journal of the IAOS, v 37 no. 4 Mellander, C., Lobo, J., Stolarick, K., & Matheson, Z. (2015). Night-time light data: A good proxy measure for economic activity?. PloS one, 10(10), e0139779. Pinkovskiy, M., & Sala-i-Martin, X. (2016). Lights, camera… income! Illuminating the national accounts - household surveys debate. The Quarterly Journal of Economics, 131(2), 579-631. Serajuddin, U., Uematsu, H., Wieser, C., Yoshida, N., & Dabalen, A. (2015). Data deprivation: another deprivation to end. World Bank policy research working paper, (7252). 26 Smythe, I., & Blumenstock, J. E. (2021). Geographic micro-targeting of social assistance with high- resolution poverty maps. In Submission (KDD). Van Der Weide, Roy; Blankespoor, Brian; Elbers, Chris; Lanjouw, Peter. 2022. How Accurate Is a Poverty Map Based on Remote Sensing Data?: An Application to Malawi. Policy Research Working papers;10171. World Bank, Washington, DC. © World Bank. https://openknowledge.worldbank.org/handle/10986/38009 License: CC BY 3.0 IGO.” Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., ... & Burke, M. (2020). Using publicly available satellite imagery and deep learning to understand economic well-being in Africa. Nature communications, 11(1), 1-11. Zar, J. H. (2014). Spearman rank correlation: overview. Wiley StatsRef: Statistics Reference Online. 27 Annexes Annex 1. Household characteristics in UBR districts vs. the rest of the country P-value Total No UBR UBR of the difference Highest educated male has primary education 0.19 0.18 0.20 0.056 (0.39) (0.40) Highest educated male has secondary education 0.09 0.10 0.07 0.000 (0.30) (0.25) Highest educated male has tertiary education 0.03 0.04 0.02 0.000 (0.19) (0.12) Highest educated female has primary education 0.18 0.17 0.18 0.069 (0.38) (0.39) Highest educated female has secondary education 0.05 0.05 0.03 0.000 (0.23) (0.17) Highest educated female has tertiary education 0.02 0.03 0.01 0.000 (0.16) (0.10) Household head is literate 0.72 0.73 0.71 0.162 (0.45) (0.45) Household size 4.33 4.33 4.32 0.686 (2.02) (1.96) Household overcrowding 2.04 2.08 1.96 0.000 (1.28) (1.23) Urban Household 0.18 0.25 0.05 0.000 (0.43) (0.22) Elderly dependency ratio 0.07 0.07 0.08 0.000 (0.20 (0.22) Child dependency ratio 0.39 0.39 0.38 0.032 (0.24 (0.24) Fuel cooking: firewood 0.81 0.76 0.91 0.000 (0.43 (0.29) Access to piped water 0.23 0.28 0.12 0.000 (0.45 (0.33) Access to flushtoilet 0.04 0.05 0.01 0.000 (0.22 (0.12) Household owns a house 0.74 0.70 0.81 0.000 (0.46 (0.39) Household has improved walls 0.91 0.94 0.83 0.000 (0.23 (0.38) Household has improved roof 0.50 0.54 0.42 0.000 (0.50 (0.49) Household has improved floor 0.29 0.32 0.21 0.000 (0.47 (0.41) 28 Hosehold has cellphone 0.50 0.51 0.46 0.000 (0.50 (0.50) Hosehold has fridge 0.06 0.08 0.02 0.000 (0.27 (0.13) Hosehold has stove 0.00 0.00 0.00 0.888 (0.06 (0.06) Hosehold has computer 0.03 0.04 0.01 0.000 (0.19 (0.07) Hosehold has oxcart 0.01 0.01 0.02 0.000 (0.09 (0.14) Hosehold has bicycle 0.37 0.36 0.37 0.660 (0.48 (0.48) Hosehold has motorcycle 0.02 0.02 0.02 0.515 (0.13 (0.13) Hosehold has car 0.02 0.03 0.01 0.000 (0.16 (0.08) Hosehold has radio 0.42 0.43 0.39 0.000 (0.50 (0.49) Hosehold has television 0.13 0.16 0.06 0.000 (0.36 (0.24) Source: Integrated Household Survey 2016. Note: Standard deviation in parenthesis. Annex 2. Satellite data 29 Annex 3. Xgboost models XGBoost is a gradient boosting algorithm that provides a parallel tree boosting that solves data science problems in a fast and accurate way. It is designed to work with large and complex data sets. This annex describes the use of Xgboost for regression. The algorithm fits a regression tree to the residuals as gradient boost but uses a unique regression tree. Each tree starts with a single leaf that is called a root, and all the residuals go to the leaf. The algorithm calculates similarity scores and gain to determine how to split the data. The similarity score for the residuals on each leaf equals 2 = + Where λ is a regularization parameter intended to reduce the prediction’s sensitivity to individual observations and prevent overfitting the training data. If the leaf has several different residuals, the similarity score will be relatively small since they will cancel each other out. In contrast, if the residuals are similar or the leaf has very few residuals, the similarity score will be relatively large. To quantify how much better the leaves cluster similar residuals than the root, we need to calculate the gain of splitting the residuals into groups. The gain is equal to = ℎ − ℎ Then the algorithm compares the gain calculated for each split and selects the one with the highest value since that would mean that a particular feature is better at splitting the residuals into clusters of similar values. Then it continues with another split. You can limit the tree depth or the splits to different levels, up to 6 levels is the default. To determine output values for the leaves, we calculate the following: = + The output value is like the similarity score, except that it does not square the sum of the residuals. After this, the tree can be used for making predictions. Like gradient boost, xgboost makes new predictions starting at the initial prediction and adding the output of the tree scaled by a learning rate ε. The new predictions will have smaller residual values. Then the algorithm builds new trees based on the new residuals until the residuals get very small or it reaches the maximum number of trees. 30 Annex 4. Benchmark welfare models using different training samples All districts-50% poorest UBR districts-50% poorest All districts-All HH UBR districts-All HH HH HH R-squared 65.48 55.41 53.71 25.66 Household size, Household size, Household size, Household size, households assets, child households assets, child households assets, child Top 10 variables households assets, child dependency, overcrowding, dependency, overcrowding, dependency, overcrowding, dependency, overcrowding urban/rural urban/rural HH head literacy Annex 5. Importance of variables in benchmark models 31 32 Annex 6. Per capita consumption maps using different prediction methods. 33