Policy Research Working Paper 10175 Small Area Estimation of Monetary Poverty in Mexico Using Satellite Imagery and Machine Learning David Newhouse Joshua D. Merfeld Anusha Pudugramam Ramakrishnan Tom Swartz Partha Lahiri Development Economics Development Data Group September 2022 Policy Research Working Paper 10175 Abstract Estimates of poverty are an important input into policy However, poverty estimates in 2015 derived from geospa- formulation in developing countries. The accurate mea- tial data remain less accurate than 2010 estimates derived surement of poverty rates is therefore a first-order problem from household census data. These results indicate that the for development policy. This paper shows that combining incorporation of household survey data and widely available satellite imagery with household surveys can improve the satellite imagery can improve on existing poverty estimates precision and accuracy of estimated poverty rates in Mexican in developing countries when census data are old or when municipalities, a level at which the survey is not considered patterns of poverty are changing rapidly, even for small representative. It also shows that a household-level model subgroups. outperforms other common small area estimation methods. This paper is a product of the Development Data Group, Development Economics. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at dnewhouse@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Small Area Estimation of Monetary Poverty in Mexico Using Satellite Imagery and Machine Learning* David Newhouse§ Joshua D. Merfeld† Anusha Pudugramam Ramakrishnan‡ Tom Swartz§ Partha Lahiri+ Keywords: poverty, small area estimation, poverty mapping, satellite data, machine learning JEL codes: C53, C83, I32 * We are indebted to Boris Babenko and Jonathon Hersh who made substantial contributions to previous versions. We thank Shiva Makki and the World Bank’s Research Support Board for financial support and James Crawford and Orbital Insight, Inc. for support in providing AGEB-level estimates of land classification and poverty. We thank Keith Garrett and seminar participants at the Northeast University Development Conference, the World Bank, the KDI School, and Sungkyunkwan University for helpful comments. All remaining errors are ours. § Development Economics Data Group, World Bank, and IZA † KDI School of Public Policy and Management, IZA ‡ Consultant, World Bank § Change Research + Joint Program in Survey Methodology and Department of Mathematics, University of Maryland, College Park 1 Introduction In 2015, all United Nations members adopted the 2030 Agenda for Sustainable Development, which ushered in a new set of sustainable development benchmarks – known as the Sustainable Development Goals (SDGs) – across the world.1 These SDGs supplanted the previous Millennium Development Goals (MDGs),2 adding many new goals, including new goals related to the environment. However, one thing did not change with the adoption of the SDGs: Goal 1.1 was still the eradication of extreme monetary poverty. Key to addressing progress towards the new poverty goals, however, is the accurate measurement of poverty (e.g., Ravallion, 2015). To this end, many household surveys in developing countries now include consumption and/or expenditure modules, designed to measure a specific type of monetary poverty. For example, Malawi’s Fifth Integrated Household Survey (IHS5) was a major input into the 2020 Malawi Poverty Report (NSO Malawi, 2021),3 in large part due to its consumption module. While household surveys like Malawi’s IHS remain key to measuring poverty, difficulties remain. First, large household surveys are expensive and, as such, generally are fielded only every few years, at most. Second, surveys are designed to be representative at certain levels of aggregation. For example, in Mexico, the Encuesta Nacional de Ingresos y Gastos de los Hogares (ENIGH) is designed to give representative estimates for rural and urban areas across states, but not for most municipalities.4 The inability to consistently estimate more disaggregated statistics of poverty can make targeting difficult, especially when poverty is found in pockets below the level of representativeness. Moreover, many SDGs disaggregate progress by groups (e.g., women and children), which can lead to additional difficulties with reliably estimating these statistics, since many of these subgroups may have relatively few observations. This is also true for the non- poverty benchmarks, like access to health facilities. Therefore, improving the accuracy and precision of survey estimates in these contexts is a first-order question for many development economists and practitioners. In this paper, we employ a common statistical method – small area 1 https://sdgs.un.org/goals 2 https://www.un.org/millenniumgoals/ 3 https://microdata.worldbank.org/index.php/catalog/3818/related-materials 4 https://www.inegi.org.mx/programas/enigh/nc/2018/ 2 estimation (SAE) – and show that it can provide more disaggregated poverty estimates even when using only remotely-sensed satellite data. This is an important finding, since traditional SAE methods require census data that is often not available in many developing countries or is available only sporadically across time. “Big data” from satellites, on the other hand, are widely available for most countries and is often available in near real-time. Recent advances in the availability of satellite and mobile phone data, as well as increases in computing power, have sparked great interest in combining these data with survey data to generate more precise estimates of socio-economic characteristics (World Bank, 2021; Burke, 2021). Traditionally, small area estimation has been carried out by utilizing survey data to estimate a model and simulating that model in household-level census data (e.g., Elbers, Lanjouw, and Lanjouw, 2003). In the last five years, however, a growing body of innovative research has used geospatial or other big data to predict poverty and generate purely synthetic predictions of poverty or welfare, often generated by convolutional neural networks (Jean et al., 2017; Yeh et al., 2020; Engstrom et al., 2022; McBride et al. 2021). These estimates are typically validated against survey data at the village level, using sample data withheld from the prediction model. The results demonstrate the ability of methods utilizing novel sources of data to extrapolate poverty estimates to countries with no sample data with a reasonable degree of accuracy. The existing literature is, however, less informative about the performance of methods that seek to combine survey and geospatial data to generate more granular estimates within a country, for two reasons. First, much of the existing literature stops short of combining synthetic predictions with prior estimates based on survey data, which is important to generate more accurate and precise estimates for well-defined administrative levels (Masaki et al., 2022). Second, it is challenging when utilizing neural networks or other machine learning models to properly estimate uncertainty, which is crucial when evaluating the reliability of small area estimates. This paper aims to address both of these issues. A few studies examine the extent to which combining survey data with geospatial data improves small area estimates of proxies related to household welfare, with generally encouraging results (Steele et al., 2017; Masaki et al., 2022). No studies to date, however, have examined the gain in both accuracy and efficiency from combining survey and remote sensing data for predicting monetary poverty for small areas. Furthermore, we know of no study that has compared predictions 3 derived from a combination of current survey and geospatial data with older, traditional census- based small area estimates. These are important knowledge gaps given that subnational monetary poverty estimates play a significant role in resource allocation in many countries. In addition, little is known about how the accuracy of predictions compares for target areas – the area at which we wish to produce estimates – that are covered by the sample and those target areas that lie outside the sample. This distinction is important for two reasons. First, in some contexts, the official household survey used for poverty measurement may not cover all target areas. If predictions for non-sampled areas are inaccurate, this would strengthen the case for both exercising caution when utilizing out-of-sample estimates for policy and for expanding survey data collection to cover all target areas. Second, comparing the accuracy of non-sampled and sampled areas sheds some light on the benefits of using Bayesian or Empirical Bayesian methods, which utilize the sample data as a prior to inform the estimates, rather than relying only on purely synthetic predictions. Finally, the optimal statistical method to use when combining survey and satellite data remains unsettled. Bayesian methods effectively combine sample and big data in a well-established framework (Steele et al., 2017; Steele et al., 2021; Pokhriyal and Jacques, 2017). There are significant barriers, however, to their widespread use. These include computational complexity, a lack of familiarity among many practitioners, and required distributional assumptions on priors and hyperpriors that introduce additional complications. An appealing alternative is to use Empirical Best Predictor (EBP) models – a name coined by Jiang and Lahiri (2001) and Jiang et al. (2002) – in part because it is a well-established method in the existing literature on small area estimation.5 This paper makes four main contributions. First, it assesses the gain in accuracy and precision, relative to direct survey estimates, from combining survey data with geospatial indicators to generate small area estimates of monetary poverty. Second, it compares the performance of different methods for generating estimates for both in-sample and out-of-sample municipalities – 5 EBPs are also known as EBLUP (Empirical Best Linear Unbiased Predictor) for linear mixed models. See Battese, Harter, and Fuller (1988), Jiang and Lahiri (2008), Molina and Rao (2010), and Van der Weide (2014) for more on their use in small area estimation. 4 the target areas – in Mexico.6 Third, it assesses whether the combination of survey and geospatial data generate estimates that improve on small area estimates generated using a census enumerated five years prior as well as a widely available wealth index created by Meta (Chi et al., 2022).7 Finally, it compares estimates from three different types of small area estimation methods: a household unit-context model that models household per capita income as a function of sub-area and area-level predictors, a sub-area model that utilizes a nested error model at the level of the sub-area, and a Fay-Herriot area-level model (Fay and Herriot, 1979). We consider these questions in the context of Mexico. Mexico is a useful case study for this analysis because of the richness of publicly available official data. In the Mexican context, sub- areas are AGEBs (Area Geoestadística Básica) and areas are municipalities. The official MCS- ENIGH household survey provides representative statistics at the state level, motivating the use of small area methods to generate municipal estimates. INEGI, the Mexican statistical agency, fielded a large intercensus sample of 5.8 million households in 2015. CONEVAL, the agency that produces poverty statistics in Mexico, used these data to generate official municipal level estimates of monetary poverty, which can be used as a benchmark for evaluation. CONEVAL employed an EBP using a model that included an extensive set of household and municipal characteristics. These type of small area estimates, generated using household-level auxiliary data, are rarely available between census rounds in low- and middle-income countries. They therefore provide a natural benchmark to evaluate whether combining survey and geospatial data produces municipal poverty estimates that are like those produced by a best-practice poverty mapping exercise, based on household data from a large intercensus. We compare these 2015 benchmark estimates with five sets of municipal poverty estimates. The first is direct sample estimates taken from the 2014 survey. We generate the second, third, and fourth sets of estimates by combining the 2014 sample with geospatial indicators and estimating a household-level “unit-context” model of welfare, a sub-area model of poverty rates, and an area- level model of poverty rates. The fifth and final set of estimates are the official poverty estimates 6 In other words, municipalities are the level at which we wish to estimate poverty. This is below the level of representativeness of the survey. 7 https://dataforgood.facebook.com/dfg/tools/relative-wealth-index 5 for 2010, which were derived by CONEVAL from the 2010 survey and census. Six main results emerge: 1. Combining satellite indicators with household survey data substantially improves the accuracy of municipal poverty estimates relative to estimates using the survey data alone. In the preferred specification, correlation with the benchmark official estimates rises from 0.8 to 0.86 and root mean squared deviation falls 25 percent. 2. Incorporating geospatial predictors greatly improves the precision of the estimates compared with the direct survey estimates. The median coefficient of variation declines from 38.5 to 19.8 in the household level model, which rises to 25 if the mean squared error estimates are adjusted to maintain equal coverage rates across methods. Even after this adjustment, the additional precision achieved by the small area estimation procedure is equivalent to increasing the size of the sample by a factor of roughly 2.4. 3. The household-level model moderately underestimates uncertainty and yields a coverage rate of 77 percent for in-sample municipalities and 83 percent of out-of-sample municipalities. For in-sample municipalities, this is moderately lower than the 86 percent obtained using the direct survey estimates when using the Horvitz-Thompson variance estimator, but much larger than the 39 percent coverage rate when using the standard variance estimator clustered on enumeration areas, which greatly underestimates uncertainty. 4. For all satellite-based estimates, in-sample predictions are substantially more accurate and much more precise than out-of-sample predictions. The correlation between the satellite- based estimates and the official benchmark estimates declines from 0.86 for in-sample municipalities to 0.70 for out-of-sample municipalities, and the median coefficient of variation rises from 19.8 to 33.9. 5. Estimates that combine survey and geospatial data using a household model are more accurate and much more precise than those produced by a sub-area model for in-sample municipalities (correlation of 0.86 vs 0.83, median CV of 19.8 vs. 35.6). The sub-area model estimates, in turn, are more accurate but somewhat less precise than the area-level model estimates for in-sample areas (correlation of 0.83 vs. 0.80, median CV of 35.6 vs 6 28.1). For out-of-sample municipalities, the household and sub-area models are equally accurate (correlation of 0.7), while the area-level model is moderately less accurate (correlation of 0.66). 6. All geospatial estimates are substantially less accurate than the official 2010 municipal poverty estimates when predicting 2015 municipal poverty rates, particularly for out-of- sample municipalities. The correlation between the 2010 estimates and the 2015 benchmark is 0.91 in-sample and 0.90 out of sample, as compared with 0.86 and 0.70, respectively, for the preferred household model geospatial estimates. This suggests that poverty in Mexico was relatively stable, at least between 2010 and 2015. Overall, the results confirm that combining survey data and geospatial data can greatly improve the accuracy and precision of monetary poverty estimates. However, accuracy and precision both suffer markedly when making synthetic predictions into out-of-sample municipalities. This underscores the importance of using a statistical method that conditions predictions on the sample data and of including as many target areas as possible in the sample. In this context, household- level models generate more accurate and precise predictions than the basic sub-area and area models considered here.8 The household-model results, however, are substantially less accurate for predicting 2015 municipal poverty than the official 2010 census-based estimates. This correlation may be overstated because data from the 2010 census are used to generate both the 2010 and 2015 estimates. Nonetheless, this finding underscores the value of regular census data collection in this context and the possible importance of designing surveys to include all target areas if successful small area estimation is one goal of the survey. While predictions that combine survey and satellite data are not always more accurate than older census-based poverty maps for predicting current poverty rates, they greatly improve upon estimates solely based on survey data and are a viable second-best solution in settings where recent census-based estimates do not exist, which is a common situation in much of the developing world. The relative accuracy of estimates produced from survey and geospatial data, compared with older census-based estimates, depends on several context-specific factors. These include the pace of change in regional patterns of poverty since the previous census, the size of the sample, and the 8 More complex and sophisticated versions of the area and sub-area levels, such as those that transform the dependent variable, may improve on the accuracy of the simple versions considered here. 7 predictive power of the auxiliary geospatial indicators. Additional research, drawing on data from different contexts, will help inform the choice between using more recent estimates obtained by combining survey and geospatial data, and older small area estimates derived from a census. 2. Small area estimation Small area estimation deals with methods to reliably estimate parameters of interest in areas without adequate sample sizes (Molina and Rao, 2015). Here, we discuss the basics of small area estimation, with a particular focus on the Mexican context we study in this paper. Interested readers can find more details in Rao (2003), Molina and Rao (2010, 2015), and Jiang and Rao (2020). Ghosh (2020) details the evolution of small area estimation over the last five decades. In Mexico, it is worth differentiating four separate levels in the data: households, AGEBs, municipalities, and states. Households are nested within AGEBs, AGEBs are nested within municipalities, which are nested within states. We are interested in estimating poverty rates for municipalities – the target areas – but the only national surveys available are representative at urban-rural areas within states. Therefore, although these estimates are unbiased over repeated samples, a single municipality-level sample is generally too small to provide a reliable estimate of poverty rates. To estimate these poverty rates, we need to build a model at the level of the municipality or lower. In theory, we could directly estimate poverty rates at the municipality level – that is, taking poverty rates at the municipality level and estimating the model using municipalities as the unit of analysis – which is referred to as an area-level model. We could also estimate poverty in AGEBs and then aggregate AGEB-level poverty rates up to the municipality based on AGEB populations, referred to as a sub-area model. Finally, we could estimate poverty at the household level and then aggregate household-level poverty rates up to the municipality, using a household-level (or unit- level) model. These methods share some characteristics but also offer different advantages. One similarity across all three methods is that they combine survey data with auxiliary data. This is distinct from much of the recent literature on satellite estimation of poverty (see, for example, Blumenstock et al. (2015)), which tends to estimate poverty using a purely synthetic regression prediction. Small area 8 estimation methods, on the other hand, “borrow strength” across areas through a regression framework that is quite flexible and can incorporate different types of data, including census data, remote sensing data, and administrative data, to name but a few. In traditional unit-level models like Elbers, Lanjouw, and Lanjouw (2003), the key predictors were household-level values derived from a census. However, this requires that there exists a (recent) census with variables also available in a survey with monetary poverty measures. This is a restrictive requirement, as censuses like this are few and far between in many developing countries. In this paper, we instead use AGEB or municipality aggregate variables when we implement the household-level models. Implementing a household or unit-level model with only aggregate variables is sometimes referred to as a “unit-context” model. In all three methods discussed here, poverty or welfare is regressed on a selection of predictive features and random effects, which we discuss in more detail in the methods section and the appendices. Thus, each in-sample target area has two separate estimates: the direct survey estimate and the synthetic prediction estimate created through the regression. The key to small area estimation is the combination of these two separate estimates, based on relative variances. Intuitively, when the sample estimates are more precise for an area relative to synthetic predictions, there is “less strength” that can be borrowed from the prediction and the direct sample estimates are given more weight in the estimate. When the sample estimates are less precise relative to the synthetic predictions, on the other hand, areas can “borrow more strength” through the regression (Jiang and Rao, 2020). However, note that not all municipalities in Mexico appear in our sample. As such, this means that some municipalities do not have a direct estimate. For these out-of-sample areas, the sample contains no information and therefore gets zero weight in the estimate, which is a purely synthetic prediction from the regression. The three methods also differ in sometimes important ways. Sub-area and area-level models can be estimated more quickly and easily than unit-context household models. This comes at a price, however. Area-level models require estimates of the sample variance for each area, which can be unstable when derived from a survey, since the survey is not designed to provide adequate sample size to generate reliable estimates at the area level (Bell, 2008). This instability can be alleviated by variance smoothing, but that adds an additional layer of complexity to the estimation procedure. The sub-area models considered in this paper, meanwhile, require a non-trivial reweighting exercise. This entails both correcting for heteroscedasticity when the number of sample households 9 varies across AGEBs, and normalizing the weights appropriately within each municipality. Second, area-level models cannot produce estimates at lower levels of geography, such as sub- areas. In addition, area-level and sub-area models of the type implemented here model poverty rates rather than welfare, therefore discarding information on whether non-poor households in a sub-area are near or far from the poverty line. Household-level models predict welfare instead of poverty rates, conditional on the sample, and therefore better preserve this information. Currently available software for estimating area-level also models cannot easily accommodate different weights for different areas. This can become a constraint when the analyst wishes to give more weight to more populous areas when estimating the model, which may be preferred in some policy settings. Finally, both the standard sub-area and area-level models considered here impose the assumption that poverty rates are a linear function of the predictors, which can also reduce prediction accuracy. Corral et al. (2021) note that the nested error regression model we use in the household-level model is subject to “omitted variable” model bias, when household-level auxiliary variables are replaced by their respective area-level population averages and when the true data generating process is based on a household-level model. This bias occurs when the population averages used as predictor variables on the right-hand side are correlated with the discrepancy between the area-level means of the population and the particular sample used. As a result, they recommend using area-level models when feasible rather than household-level models. However, as detailed in Appendix F, the exact same omitted variable bias also appears in area-level models, under the same assumptions that the data generating process occurs at the household level and that the predictors are drawn from the population.9 Thus, because the same omitted variable bias appears in both household and area-level models, this bias is not a sufficient justification for preferring area-level models over unit-context models. There are two other reasons why the omitted variable bias identified in Corral et al (2021) may not be important in practice. First, if sampling error is exogenous, as one would expect with a random sample, both the household and area level models are consistent, implying that this bias will tend towards zero as the number of AGEBs included in each area becomes large. Furthermore, in both 9 Auxiliary data in area-level models is typically drawn from administrative data or other sources that represent the population rather than the sample. 10 household and area models, when one defines bias with respect to both sample design (in our case simple random sampling within AGEB) and the model parameters, the design-model bias vanishes. This occurs because the sample mean of the predictor variables is an unbiased estimator of their population means, under a simple random sample design, and follows from the concept of design-model bias discussed in Valliant et al. (2000, p. 120). In many applications, it is difficult or even impossible to apply a nested error regression model with household-level auxiliary variables, either because household-level auxiliary variables are not available for the entire finite population or due to the presence of measurement error in the household-level auxiliary variables. In these situations, the unit-context model would be a reasonable option. Like a nested error regression model with household-level auxiliary variables, a unit-context model is very flexible in producing estimates of a wide range of linear and nonlinear parameters (e.g., different FGT indices) using a single model. This is in sharp contrast with an area or sub-area level model, which requires separate models for each estimated parameter. The unit-context model also permits incorporation of uncertainty incurred due to estimation of different model parameters, such as the variance of the sampling error. 3. Data We evaluate the ability of data derived completely from satellite imagery to predict municipality- level poverty in Mexico. In this section, we discuss the different data sources used to construct the requisite variables as well as the data used to validate the performance of the models. 3.1. Household data We conduct the main analysis using monetary poverty estimates at the household level from the 2014 MCS-ENIGH Household Survey data. The survey is conducted every two years by the Mexican National Institute of Statistics and Geography (INEGI) to understand the income, spending, socio-demographic, and employment situation of the country. It is a nationally representative, cross-sectional household survey. The 2014 MCS-ENIGH survey covers 58,125 households, of which approximately 75% are urban and 25% are rural.10 The survey samples 896 10 Rural areas are defined as localities with fewer than 2,500 inhabitants. 11 municipalities (out of 2,457 municipalities) and 6,849 AGEBs.11 Table A1 presents additional details on the geographic structure of Mexico. Importantly, the publicly available data contains an AGEB-level identifier, which can be linked with the official AGEB shapefile. For rural areas, the data and associated shapefile identify localities, which are points rather than polygons. These were grouped together using the same AGEB shapefile into rural AGEBs. The survey collects income per capita, which is the welfare metric used to calculate the official poverty rate. With the AGEB-level identifiers and merged shapefile, we can link the survey data with predicted poverty rates and land classifications derived from satellite imagery. To predict poverty for the entire country, we need data on the location of all households. For this, we create a “synthetic census” based on the 2010 census. For each AGEB, we estimate the number of households by dividing the total AGEB population in the 2010 census by the average household size, at the state-urban/rural level, in the 2014 household survey. We round to the nearest integer and use this to construct a synthetic census that links the “households” in an AGEB with the AGEB-level satellite indicators, which serve as the auxiliary data for the household model. This resulting synthetic census consists of just under 28 million records. In addition to the satellite imagery – which serves as the auxiliary data for the prediction exercises – we also include average household size at the state-urban/rural level, estimated with the 2014 MCS-ENIGH survey, which we use as population weights when aggregating across households to generate municipal poverty estimates. 3.2. Benchmark CONEVAL estimates We compare the estimated poverty rates with a benchmark to evaluate the ability of the satellite imagery to improve poverty predictions. The benchmark used for evaluation is the municipal income poverty estimates, generated by CONEVAL using a combination of the 2014 MCS- ENIGH household survey and the 2015 Intercensus survey. The Intercensal survey is conducted every 5 years between two censuses, to update socio-demographic information at the national and 11 AGEBs (Áreas Geoestadísticas Básicas) are equivalent to sub areas in this context, while municipalities are target areas, or the areas at which we are interested in predicting poverty. Each municipality is made up of AGEBs, with larger municipalities generally having more AGEBs. 12 sub-national levels. The 2015 Intercensus sampled 5.8 million households, 56 percent of whom lived in urban areas. The Intercensus contains only household labor income and transfer income, and not total household income, necessitating the use of a small area estimation model to estimate municipal-level poverty rates. CONEVAL creates these estimates using a unit-level model, with a large set of household and individual-level auxiliary variables, including demographic, labor, housing quality, and municipal characteristics at the individual, household, and municipal level. The municipal characteristics include average per capita income and other poverty estimates. The municipal estimates and detailed documentation on the methodology are available at the CONEVAL website. 12 The 32 Mexican states are divided into 6 groups13 and a model is estimated for each, with R2 values ranging from 0.52 to 0.57. CONEVAL used the 2014 MCS-ENIGH survey in an empirical best household model to construct the small area municipal estimates. This implies that both the estimates that are evaluated below, based on geospatial data, and the CONEVAL benchmark data use the same 2014 survey. However, the R2 of the unit-level model used to generate the CONEVAL estimates are roughly three times as high as the model used to generate the geospatial-based estimates, since the auxiliary data used by CONEVAL vary at the household level. Therefore, the CONEVAL estimates give far more weight to the predicted values and less weight to the survey, relative to the estimates derived from the satellite data. This reduces the mechanical correlation between the estimates and the benchmark due to the municipal random effects being conditioned on the same 2014 survey data. Furthermore, as a robustness check, we also compare the CONEVAL estimates with geospatial small-area estimates generated using the 2016 MCS-ENIGH survey, which eliminates all mechanical correlation due to the use of the same sample data. 3.3. CNN-Generated Auxiliary Geospatial Data A key difference between this paper and Masaki et al. (2022) is the nature of the auxiliary data. We train convolutional neural network (CNN) models against high resolution satellite imagery 12 https://www.coneval.org.mx/Medicion/Paginas/Pobreza-municipal.aspx 13 The six state groups we use in this paper match those used to develop the official municipal poverty estimates and are listed in Table C1 in appendix C. 13 from Planet – which features spatial pixel resolution of 3 to 5 meters – and use those predictions as inputs into the small area estimation model. We make use of two different CNNs: one to estimate the poverty rate (defined as extreme or moderate headcount poverty according to the CONEVAL definition) and one to recognize the land classification type of an image. We train the former using the 2014 MCS-ENIGH. Land classification subdivides images into coverage by building, road, water, grassland, forest, and background (all other classes). Because CNNs require square tiles of imagery for prediction, we train the CNN to estimate poverty or land type at the tile level, which are roughly 750 square meters each. The CNN predicts poverty or land type at this tile, and we use an area-weighted sum to aggregate the predictions to the AGEB – or sub-area – level. We then aggregate these AGEB-level predictions to municipalities, weighting by the 2010 census population.14 The auxiliary information that augments the small area estimation models is therefore AGEB-level and municipal-level CNN predictions of poverty or land type. Summary statistics for the CNN generated auxiliary spatial data are shown in Table 2, and more technical details of the CNN are presented in Appendix A. Intuitively, the poverty CNN uses all available visual characteristics in an image to predict extreme or moderate poverty as defined by the 2014 MCS-ENIGH. If these visual characteristics reliably predict poverty, then CNN model predictions can be extrapolated to areas not covered by the MCS- ENIGH and used to estimate poverty predictions for these areas. Adding this additional information should allow a small estimation model to have lower variance and higher predictive performance. While Babenko et al. (2017) demonstrate the performance of the CNN poverty predictions against a benchmark, incorporating this information into a small area estimation model improves the predictions in three ways. First, it enables the estimates to also benefit from auxiliary data on land classification. Second, it conditions the predictions on sample information through the empirical best predictor model, making the predictions more accurate. Finally, it enables the appropriate estimation of uncertainty, in the form of mean squared error estimates, from which one can construct reasonably accurate confidence intervals. 14 An alternative when there is no census available is to use estimated population from WorldPop (https://www.worldpop.org/), which seems to perform well in other applications (e.g., Merfeld et al., 2022). 14 4. Methods The key survey data we use is the 2014 MCS-ENIGH. Although we are interested in estimating poverty for all municipalities in Mexico, the survey is not designed to estimate poverty at the municipality level, but rather for urban and rural areas within states. Therefore, the direct estimates from the survey will not consistently estimate poverty accurately for municipalities. The rest of this section describes how we construct the direct estimates and the household-model estimates, as well as two robustness checks using sub-area and area models. 4.1 Direct estimates The direct survey estimates for each municipality are the weighted mean, across sample households, of a dummy variable indicating whether the household’s per capita income falls below the official poverty line, which is defined separately for urban and rural areas. These are taken for each municipality, using the product of the household sample weight and household size as weights. Estimating uncertainty properly is a bit more complicated. This is because standard methods assume that per capita income is independent across primary sampling units, which can significantly underestimate the uncertainty associated with municipal poverty estimates. We therefore use two methods for estimating uncertainty: the Horvitz-Thompson approximation (Horvitz and Thompson, 1952) and the standard method of clustered standard errors, using AGEBs as the clustering unit (Huber, 1967). As shown below, the Horvitz-Thompson approach gives much more accurate estimates of uncertainty and is therefore preferred. 4.2 Household model estimates The baseline method for the household model is similar to the method used to predict non- monetary poverty rates in Masaki et al. (2022), in that it relies on a conditional random effect model, which is also often referred to as a mixed model. Many previous iterations of household- level models rely on detailed census data (e.g., Elbers, Lanjouw, and Lanjouw (2003) and Tarozzi and Deaton (2009)). However, such data is usually not available for many developing countries, while satellite imagery is. As such, instead of using household-level covariates to predict poverty, 15 we use satellite-derived features defined at higher levels of aggregation, such as the AGEB or the municipality. There are two notable differences between this paper and Masaki et al. (2022). First, the household-level model implements sample weights, both when estimating the mixed effect model and when calculating the empirical best predictors. Introducing sample weights decreases the precision of the sample data relative to the prediction and therefore gives the sample less weight when conditioning the mixed effect on the sample data, as described in Appendix C.15 Second, because urban and rural poverty lines differ in Mexico, the baseline specification in this paper inflates the sample welfare measure in rural areas by the ratio of the urban to rural poverty lines. This enables us to apply a constant poverty line threshold in the simulation stage. The estimated model is: ( ) = 1 + 2 + 3 + + , (1) where ( ) is transformed per capita income for a household i in AGEB a, which is located in municipality m and state s. is a vector of AGEB aggregates, is a vector of municipal aggregates, is a vector of state dummies, is a municipality-level random effect, and is a classical error term. The estimation process for the household model involves the following eight steps, with additional details in Appendix C: 1. Classify households in the MCS-ENIGH as poor if their per capita income, ictpc, falls below the monetary poverty lines of 1,242.6 pesos per month for urban households, or 868.25 pesos per month for rural households. 2. Multiply the ictpc of rural households by the ratio of the urban to rural poverty lines, which is approximately 1.4314. This step allows us to compare the vector of deflated per capita incomes against a single line, the urban poverty line.16 15 As in Masaki et al. (2020), we normalize weights to sum to the number of sample observations in each municipality. Unlike in standard regressions, the results of mixed models depend on the scale of the weights and normalizing to the sum of sample observations is a common approach. 16 Alternatives to this procedure, including estimating separate urban and rural models, are considered in section 6 below. 16 3. Normalize the resulting vector of welfare using Ordered Quantile Normalization, as described in Peterson and Cavanaugh (2020) and implemented in the R bestNormalize package. 4. Set the poverty line as the minimum value of normalized welfare for non-poor households, as defined in step one. Since we are only interested in estimating headcount poverty, we do not need to back transform to recover the original welfare measure. Instead, we can simply compare a household’s predicted location in the normalized distribution to the location of the poverty line. 5. Select a model from the pool of candidate variables using the “plug-in LASSO” estimator implemented in Stata 16, allowing for heteroscedasticity, with the dependent variable equal to normalized welfare (Belloni and Chernozhukov, 2013). We use the “plug-in” lasso to be sure to avoid overfitting, since it selects more parsimonious models than those selected using cross-validation (Statacorp, 2021). The weights in the LASSO regression are the sample population weights (household weight times household size), normalized to sum to the number of observations in each municipality. The candidate set of variables include the rates of moderate and extreme poverty predicted by the CNN at both the AGEB and municipality level, predicted land classifications at both the AGEB and municipal level, a dummy for whether the household’s AGEB is urban or rural, the share of the population in the municipality that are rural, and state dummies. 6. Using the model selected in step 5 and the poverty line calculated in step 4, estimate headcount poverty rates for each municipality using a modified version of the EMDI software package in R. The modification allows for the use of sample and population weights, as described in Appendix C. The “population” file is based on the synthetic census, constructed as described in the data section. 7. Benchmark the estimates by multiplying the estimated headcount rates and the estimated root mean squared error by a scaling factor calculated for each state. The scaling factor is calculated as a ratio, where the numerator is the official state headcount estimate taken from the survey data. The denominator, meanwhile, is the average, across each state, of the small area municipalities that are contained in the MCS-ENIGH survey, weighted by the sum of the original sample weights for each municipality. These scaling factors are calculated based on the small area estimates from municipalities included in the survey 17 sample, to preserve legitimate differences in the estimated state poverty rates due to including auxiliary data from non-sampled municipalities. 8. Compare the benchmarked estimator to the official estimates produced by CONEVAL, as described below. Table 1 displays a set of basic diagnostics for the household-level model. Of the 50 candidate variables, LASSO selects 23 predictors. Eight of these are geospatial predictors of some kind while the remaining 15 variables are state dummies. The first column of Table A3 presents the coefficients from a post-LASSO OLS regression using the sample data at the household level. The predictive power of these variables is weak compared with other studies that use per capita consumption, as marginal R2 is only about 0.13. This is partly because the welfare variable was spatially deflated in rural areas, meaning there is far less variation between urban and rural areas for the geospatial data to explain. In addition, the welfare measure is income per capita, making the results not comparable to other studies that have used per capita consumption as the welfare measure. The estimated area effect has skewness close to zero but kurtosis close to 3.5, meaning that the tails are slightly heavier than normal. The area effect is estimated very precisely and only accounts for 1.4 percent of the total variance. The complement of the estimated shrinkage factor is about 0.7, suggesting substantial efficiency gains in the estimation of the mixed effect from incorporating geospatial indicators. Error! Reference source not found. shows the quantile-quantile plots for the estimated residuals of the household and area components of the error term. The estimated distribution of the municipality random effects is quite close to normal. However, there is substantial deviation from normality among the household residuals, both at the bottom and the top. This could potentially cause bias in the estimated poverty rates, since the simulation procedure will incorrectly assume the household residuals follow a normal-tailed distribution. To investigate that possibility further, we compare the predictions against the direct estimates and the CONEVAL benchmark below. 4.3 Criteria for evaluation We consider six criteria for evaluating the estimators. The first is the simple unweighted mean of the poverty rate estimates across municipalities, prior to benchmarking. This gives an indication of overall bias in the pre-benchmarking estimators. The second and third criteria are measures of 18 uncertainty: the (unweighted) average estimated mean squared error (MSE) and median coefficient of variation (CV) across municipalities, which is essentially the median relative standard error. In the household model, the estimated MSE is determined by a parametric bootstrap procedure as described in Kreutzmann et al. (2018) and González-Manteiga et al. (2008).17 Meanwhile, the coefficient of variation for each municipality is defined as the square root of the mean squared error, divided by the estimated poverty rate. The fourth and the fifth criteria for evaluating the estimators are measures of accuracy. The first of these is the correlation with the official CONEVAL poverty estimates. The second is the root mean squared deviation, defined as: 1 2 ̂ − Θ = √ (∑(Θ ∗ ) ), ̂ is the estimated poverty rate for municipality m, and where M is the number of municipalities, Θ Θ∗ is the official estimate calculated by CONEVAL using the intercensus. Finally, we examine the coverage rate, defined as the share of target areas for which the official CONEVAL estimate lies within the estimated confidence interval. The upper and lower bounds of the confidence interval are determined by multiplying the square root of the estimated mean squared error by 1.96 and adding and subtracting it from the point estimate. 4.4 Alternative models As mentioned above, the relative merit of household-level versus area-level models is a controversial topic in the literature. Corral et al (2021) concludes that area-level models should be preferred to household-level models due to the potential of omitted variable bias, even though the same omitted variable bias also appears in area-level models when assuming the same underlying data generating process. Furthermore, that paper shows no empirical evidence comparing the relative performance of the household and area-level models. To the best of our knowledge, the only source of empirical evidence on this issue is Masaki et al (2022), which found that a unit- context model produced moderately more accurate estimates of non-monetary poverty than an 17 Two different parametric bootstrap methods to estimate uncertainty of EBP were proposed earlier by Arora et al. (1997) and Butar and Lahiri (2003). 19 area-level model in Sri Lanka and slightly less accurate estimates in Tanzania. Generating additional empirical evidence on the relative performance of different types of statistical models is important because in practice their performance depends on the true data generating process, which in practical applications is unknown. Therefore, we compare in the Mexican context the household model estimates with two alternative small area estimation models: a sub-area model and an area-level model. The sub-area model uses the same type of empirical best prediction model as the household model. However, the unit of analysis is the AGEB and the dependent variable is the estimated poverty rates for each AGEB from the 2014 MCS-ENIGH survey data. These poverty rates are then modeled as a linear function of covariates, with random effects specified at the area level. The inclusion of the area-level random effect allows the model to “borrow strength” from sample data in sampled municipalities, which contain relevant information for non-sampled AGEBs. Covariates are again selected using the plug-in (rigorous) LASSO procedure. The selected AGEB- level covariates include the predicted shares of extreme and non-poverty from the CNN and the share of pixels classified as buildings at the AGEB level. At the municipal level, selected covariates include extreme poverty; share roads, forest, and grass; share rural; and ten state dummies. These variables explain 28 percent of the variation in AGEB-level poverty rates. The weights adjust both for sample weights and for heteroscedasticity in the dependent variable, due to the differing number of sample observations used to estimate AGEB-level poverty. As in the household model described above, the results are benchmarked to the state-level poverty estimates. The second column of Table A3 presents the coefficients from a post-LASSO OLS regression using the sample data, collapsed to the AGEB (sub-area) level. In addition, for comparison, we also report results for a simple area-level model as initially described in Fay and Herriot (1979). Candidate predictor variables are aggregated to the municipality level, using 2010 population values from the census as weights. The prediction model is again selected using the “plug-in” rigorous LASSO, allowing for heteroscedasticity. The selected model contains seven predictors: The predicted municipal extreme and moderate poverty rates, the percentage of pixels classified as roads and background, and three state dummies. These variables explain 36 percent of the variation in estimated municipal poverty rates. The variance component of the model is estimated using the adjusted maximum likelihood approach for the linking model proposed by Li and Lahiri (2010), which guarantees positive variance estimates of 20 the modeled area effect for each small area. The model was estimated using the fayherriot package in Stata (Halbmeier et al., 2019). The third column of Table A3 presents the coefficients from a post-LASSO OLS regression using the sample data, collapsed to the municipal (area) level. We provide more details on these alternative models in Appendixes C and D. 5. Main Results 5.1 Bias and uncertainty Table 2 displays evaluation metrics for overall bias and uncertainty for the two direct survey estimates (the Horvitz-Thompson approximation and AGEB-clustered variance estimates), the household model, the sub-area model, the area-level model, and the 2010 official poverty estimates produced by CONEVAL. For reference, the bottom row shows the mean poverty estimates for the official 2015 poverty estimates used as the benchmark. The first column shows the simple average, across municipalities, of the estimated poverty rate prior to benchmarking, while the second and third column of Table 2 report two measures of the uncertainty of the point estimates – the mean estimated mean squared error and the median coefficient of variation. Remarkably, for sampled municipalities, the mean estimated poverty rate from the household model, prior to benchmarking, is only 0.1 percentage points lower than the direct estimate (28.1% vs 28.2%). This reflects the success of the ordered quantile normalization transformation at virtually eliminating overall bias in the household model, despite the slight deviations in the household error term from normality shown in Figure 1. It also suggests that any bias due to omitting household characteristics from the model in this context does not systematically bias the municipality estimates, at least on average. The mean poverty rate for the sub-area model matches the direct estimates to one decimal place, while the area-level model is substantially more downward biased. There are four other striking findings in the table. The first is the difference in estimated uncertainty between the Horvitz-Thompson and standard clustered estimates of variance. This reflects the high degree of correlation in per capita income across AGEBs within municipalities. The standard clustered estimates do not take these into account and therefore severely 21 underestimate the standard clustered estimates of uncertainty. Estimates of uncertainty based on clustered standard errors cannot be trusted in this setting. A second striking funding is the large reduction of uncertainty for in-sample municipalities when using the household-level model. The mean MSE is substantially lower for the household model, at just 55 percent and 35 percent as large as the MSE for the sub-area and area-level models, respectively. The median coefficient of variation of the household model estimates is 19.8, which is well below that of the sub-area and area-level models (32.1 and 28.1, respectively). As noted below, the mean squared error and coefficient of variation in the household model is moderately underestimated. If the mean squared error is inflated by a constant to achieve an 86 percent coverage rate, however, the median CV of the household model rises to 25, remaining well below the sub-area and area-level models. A third striking finding from Table 2 is the difference in precision between the estimates for sampled and non-sampled municipalities. For example, in the household-level model, the median CV rises from 19.7 in the sampled municipalities to 33.4 in the non-sampled municipalities, though this is still below the 38.5 median CV for (in-sample) direct survey estimates using the Horvitz- Thompson approximation. Since the sampling strategy for the MCS-ENIGH is partly based on population, one possible explanation is that the smallest municipalities are less represented in the survey but are also systematically different from larger municipalities. The fact that the true poverty rate for out-of-sample municipalities is substantially higher than for in-sample municipalities is consistent with this. A final notable finding is that the gain in precision from utilizing a household model, rather than sub-area or area-level model, decreases substantially when considering out-of-sample municipalities. The greater precision of the household model for in-sample estimates, compared with the sub-area and area-level model, may reflect the ability of the household model to better distinguish between near-poor and wealthy households, because it is modeling welfare levels. This advantage does not apply when predicting out-of-sample, because the estimates are purely synthetic predictions that are not conditioned on the sample data. This may also explain why the 22 household model’s advantage, in terms of precision, is smaller for out-of-sample municipalities than for in-sample municipalities. 5.2 Accuracy and coverage rates Error! Reference source not found. presents predicted-true plots for of the poverty predictions for in-sample predictions (left) and out-of-sample predictions (right) for the household model, the sub-area model, and the area-level model. For each model, it is visually clear that the in-sample predictions are more accurate than the out-of-sample predictions. Across models, the household- level model is discernably more accurate in-sample, as the estimates are generally closer to the dotted line than in the sub-area model. Out-of-sample, the differences are more difficult to distinguish. Table 3 displays summary measures of the accuracy and coverage rates of the estimates from the different models. The second-to-last row shows that the 2010 census-based estimates are the most accurate, especially out of sample, with estimated correlations above 0.9. 18 The household model improves on the accuracy of the direct estimates for in-sample municipalities, with the correlation between the estimates and the true estimates rising from 0.80 to 0.86. Consistent with that, the root mean squared deviation falls 25 percent, from 0.126 to 0.094. The sub-area model fares slightly worse than the estimates from the household-level model with respect to both accuracy measures but are an improvement over the direct estimates. The area-level model estimates are the least accurate and do not improve on the direct estimates in terms of correlation but do improve on the direct estimates in terms of RMSD. This may reflect imprecision in the estimated variance of municipal poverty derived from the sample, which are an input into the area-level model. Out of sample, the predictive performance of all estimators suffers from the lack of sample data on which to condition the estimated random area effects.19 The out-of-sample correlation is only 0.70 for the household-level and sub-area level model, and 0.66 for the area-level model. Similarly, root mean squared deviation of the household-level model is nearly twice as high out of sample, 18 However, derivations of the 2010 poverty estimates are used as predictors in the official 2015 poverty estimates. As such, part of the higher correlation might be mechanical. 19 The lower predictive performance out of sample is not due to overfitting the model, since the plug-in lasso selects more parsimonious models than cross-validation. 23 at 0.181, compared with the in-sample value of 0.094. In terms of root mean squared deviation, the household and sub-area models are more accurate than the area-level estimates, which have a root mean squared deviation of 0.198. Finally, the bottom row shows the correlation of municipality-level poverty with the relative wealth index (RWI) created by Chi et al. (2022) and part of Meta’s Data for Good project.20 This wealth index is available for 56 different low- and middle-income countries across the globe. The correlation between the RWI and poverty rates is worse than any of the other estimates. However, this is not an indication that the RWI is incorrect; instead, it is an important reminder that wealth does not always correlate well with poverty. Interestingly, however, the correlation between the RWI and poverty is more than 12 percent lower in out-of-sample municipalities, even though the RWI was created with completely different data. In other words, the out-of-sample municipalities may simply be systematically different from in-sample municipalities. Turning to coverage rates, the household model moderately underestimates mean squared error, which is reflected in estimated coverage of 76.9% in sample and 82.5% out of sample. This underestimate of uncertainty stems from the household model failing to account for uncertainty in the estimated variance of the sample. The sub-area model performs substantially better on coverage, due to its higher MSE estimates. Meanwhile, the coverage rate of the area-level model falls between the household and area-level model for in-sample municipalities and is the lowest for out-of-sample municipalities. While the coverage rate is lowest for the household model, for in-sample municipalities the rate is only 9 percentage points lower than the Horvitz-Thompson direct survey estimates, and nearly twice as large as the very low coverage rate of the direct estimates when using standard clustered variance estimates. Overall, the results indicate that the household-level model is the most accurate and the most precise of the three estimators for in-sample municipalities and is equally accurate as the other models for out-of-sample municipalities. While coverage rates are moderately underestimated, they remain respectable. Finally, Figure 3 presents boxplots of the relative bias of the estimators, a statistic not reported in the tables. The relative bias is defined as the difference between the predicted poverty rate and the 20 https://dataforgood.facebook.com/ 24 official CONEVAL benchmark, divided by the latter. Results are reported both for in-sample (left) and out-of-sample (right) municipalities. For out-of-sample municipalities, the relative bias estimates were top-coded at 25. Both in and out of sample, the official 2010 CONEVAL estimates exhibit markedly less relative bias than the other estimators. For in-sample estimates, the boxplot confirms that the 2010 CONEVAL estimates are the least biased, followed by the household model estimates, the sub- area model estimates, and the area-level estimates. For out-of-sample municipalities, the 2010 estimates do far better than any of the geospatial estimates, for which differences are hard to discern. The out-of-sample estimates contain several large positive outliers among non-sampled municipalities, including cases where the predicted poverty rate is more than ten times above the official CONEVAL estimates. This highlights the value of incorporating survey data in an empirical best framework, rather than simply relying on synthetic predictions, and the value of designing surveys to guarantee inclusion of all target areas in the sample. 6. Robustness checks 6.1 Benchmarking The results presented in Table 2 and Table 3 used estimates that were benchmarked to match the estimated poverty rates for each state. Benchmarking is appealing to ensure consistency between small area estimates and published survey results, and to remove any systematic bias at the state level. It is appropriate in cases when the small areas of interest define a hierarchical structure (Pfeffermann et al., 2014). Nonetheless, it is useful to test that the main results hold when not applying benchmarking. This is in part because existing software does not allow for benchmarking within the parametric bootstrap procedure when estimating mean squared error in unit-level models. Table 4 shows the results for the household-level model with and without benchmarking. The mean poverty rate is defined as the unweighted mean estimated poverty rate across municipalities 25 prior to benchmarking, and therefore is unaffected by benchmarking. Benchmarking has minor effects for in-sample municipalities because of the presence of many state-level dummy variables in the model. Conversely, state dummy variables were not included for many out-of-sample municipalities, and as seen in the mean poverty column in Table 2, the non-benchmarked predictions significantly underestimate poverty in these cases. As in Masaki et al. (2022), the mean squared error was multiplied by the same scaling factor that was used for benchmarking in each municipality, which increases mean squared error and improves coverage rates, particularly in out- of-sample municipalities. For in-sample municipalities, benchmarking improves the correlation with the CONEVAL estimates by about a percentage point, from 0.853 to 0.862, and slightly reduces root mean squared deviation. For out-of-sample prediction municipalities, meanwhile, benchmarking has a more substantial beneficial effect on accuracy, as the correlation with the CONEVAL estimates improves by about 3 percentage points from 0.669 to 0.701, and root mean squared deviation falls from 0.196 to 0.181. Overall, benchmarking improves the accuracy of the estimates, particularly for out-of-sample municipalities, and the benchmarked estimates are therefore preferred. 6.2 Weights Table 5 considers the impact of using sample and population weights when estimating the household-level model. The population weights in this case are estimated average household size at the state level, obtained by the MCS-ENIGH survey. As noted in section 3, this is assigned to each household in the synthetic census. Sample population weights, defined as the provided household weight times household size, are provided with the MCS-ENIGH data, and are used in the preferred specification when estimating the mixed effect model. Table 8 shows that failing to use sample weights decreases the average poverty rates prior to benchmarking from about 28.1%, which nearly matches the direct estimates, to about 26.4%. This suggests that failing to incorporate weights can moderately increase bias. Interestingly, the use of sample weights makes the estimates slightly less accurate for in-sample municipalities, but slightly improves accuracy for out-of- sample municipalities. This suggests that the method that was used to apply the weights to the sample may not be optimal for generating in-sample predictions in empirical best models. We nonetheless prefer the weighted estimates because they are more accurate for out-of-sample 26 predictions and because they generate larger and more accurate estimates of uncertainty. Because the cost to accuracy for in-sample municipalities is minor, the larger estimates of uncertainty increase the coverage rate for in-sample municipalities. Failing to weight by state household size when aggregating the simulation results to municipalities has virtually no effect on the estimates because the population weights only vary at the state level and the benchmarking procedure aligns the estimates with the survey-based estimates at that level. 6.3 Model heterogeneity So far, the results shown have come from one estimated national model. However, it is possible that estimating separate models would improve the performance of the model by better modeling the heterogeneity across states or across urban/rural areas. Table 6 shows how the results change due to different types of variants of the estimation and simulation procedure. The baseline model considers estimates from one national model. Mexico has separate poverty lines for urban and rural areas, however, which must be considered when estimating headcount poverty rates during the simulation stage. Because the software package used allows for only one poverty line, the baseline model inflates the per capita income of rural areas in the sample by the ratio of the urban to rural poverty line. After making this adjustment, the urban poverty line threshold is applied to all households in the synthetic census, in both urban and rural areas. The second model considers a similar approach, but instead estimates six models, each corresponding to different groups of states.21 This may better capture the heterogeneity across these different state groupings. The third model takes a different approach to estimating separate urban and rural poverty lines. This approach simulates urban and rural poverty rates separately, based on a single national model, and then takes the population-weighted average of the urban and rural estimates for each municipality.22 When taking this population-weighted average, we assume zero covariance between the urban and rural poverty estimates of common municipalities, which further 21 These are the same six groups used in official government statistics. 22 It is less straightforward to estimate the mean squared error of the population weighted urban and rural average. Partly this is because the random component of the municipal mixed effect will be drawn independently for rural and urban areas, underestimating the mean squared error. We therefore estimate the covariance of urban and rural poverty rates across municipalities, assuming the poverty estimates are unbiased and that the MSE estimates can be used to estimate variance. We then estimate the covariance between urban and rural municipalities and calculate the variance of the population-weighted average. 27 underestimates mean squared error. The fourth model is identical to the third, except that it estimates six different models, one for each group of states. The fifth and final model also simulates urban and rural poverty rates separately but differs from the third by estimating two prediction models – one for urban households and one for rural households - instead of a single national model. This allows the estimated relationship between per capita income and the geospatial predictors in the prediction models to vary for urban and rural areas. In terms of accuracy, as measured by correlation with the official CONEVAL estimates, the results are robust to all alternative specifications for in-sample municipalities. The grouped states model is slightly more accurate than the national model for in-sample municipalities and slightly less accurate for out-of-sample municipalities. The variant that estimates nominal welfare with six group models is the most accurate both in and out of sample but the difference in accuracy with the baseline estimator is modest. In terms of bias (mean poverty), all estimates are virtually unbiased within-sample but all are substantially downwardly biased when predicting out of sample. The average poverty rate of the official estimates for the out-of-sample municipalities reported in Table 2 is .426, while the analogous figure for mean poverty in Table 6 ranges from .355 to .379. In terms of efficiency, the estimated mean squared error and coefficient of variation is far lower for the nominal welfare model than the deflated welfare model, likely because the former does not account for the positive correlation between the rural and urban areas of each municipality and therefore underestimates uncertainty. The coverage rates for the nominal models are therefore below 70 percent for in-sample municipalities. However, estimating separate urban/rural models makes the estimates substantially less precise, and brings the coverage rate in sample back up to the level of the baseline estimates. Overall, the results of the household model are generally robust to alternative specifications. In particular, the range of in-sample correlation across the five methods ranges from 0.862 to 0.873. The range of out-of-sample correlations is slightly larger, since the sample does not contribute information, but the range is still only about 1.3 percentage points, from 0.697 to 0.710. Simulating separate urban and rural models is associated with far greater uncertainty, as reflected in the estimated CV and MSE, but also substantially higher coverage rates. This is because when urban and rural areas are simulated separately, either based on national or separate urban and rural 28 models, the simulated random component is no longer common to urban and rural areas in the same municipality. This underestimates uncertainty in the model by incorrectly assuming that the income of urban and rural households in the same municipality are independent. To address this, we estimate the covariance across municipalities of the estimated urban and rural poverty rates and account for it when estimating the variance of the municipal poverty estimates. This in turn overestimates the uncertainty associated with the estimates, which counteracts the underestimated uncertainty due to assuming the variance is known rather than estimated. However, the MSE and coverage rate is still slightly below the baseline estimates, which use spatially deflated welfare and a national model, because the latter allow for positive covariance between urban and rural areas of a municipality. 6.4 Using 2016 sample data The analysis up to this point has all used a single household survey, the 2014 MCS-ENIGH, to estimate municipal-level poverty. While this sample was drawn to generate official measures of poverty, it is also useful to check that the results are robust to the use of an alternative sample. We therefore repeat the analysis using the 2016 MCS-ENIGH instead of the 2014 round, which contains a different set of selected AGEBs. This also eliminates any possible mechanical correlation between the small area estimates and the benchmark CONEVAL estimates, which occurs because both use the 2014 MCS-ENIGH survey to estimate the empirical best prediction model. Table 7 reports the results when using the 2016 sample for the baseline specification. The main difference is that the estimates are moderately less accurate, due to the use of survey data that differs from that used to generate the benchmark. The correlation with the benchmark is now only 0.81, as opposed to 0.86 when using the 2014 survey. The same pattern of results holds, however, when ranking across methods. For in-sample areas, the household model gives moderately more accurate estimates than the sub-area model (correlation of 0.78) while the area- level model and direct estimates give less accurate estimates (correlation of 0.75 each). Furthermore, the household-model is far more precisely estimated for in-sample municipalities (median CV of 21 vs. 31 for the area-level model). Uncertainty is significantly underestimated in 29 the household model as the coverage rate falls to 59 percent, as opposed to 82 percent for the sub- area model and 67 percent for the area-level model. Figure 4 shows the boxplot for relative bias. All estimates are slightly biased downward. This is because they are estimated using 2016 data, which is also used for the benchmarking procedure, and then compared against 2015 estimates generated using the 2014 sample data. Therefore, it is not surprising that estimates based on 2016 would be systematically below the benchmark estimates, reflecting the overall trend of poverty decline. Nonetheless, the results confirm that the household-level model generates less biased results than the sub-area and area-level models in sample, which each give less biased results than the direct estimates. Meanwhile, the household- level model does at least as well as – if not slightly better than – the sub-area and area-level models out of sample. When using the 2016 sample data to estimate the model, the estimates produced by the household model are less biased than those produced by sub-area and area level models and represents a considerable improvement in accuracy and efficiency over the direct estimates. However, the uncertainty appears to be substantially underestimated, which is not as true for the sub-area estimates. Thus, while accuracy is best for the household-level model, the sub-area model outperforms in terms of correctly estimating uncertainty. 6.5 Including predictors at various levels. The baseline specification includes AGEB-level variables, municipal-level variables, and state- level dummies as candidate predictors. To better understand the sensitivity of the household model to including variables at different levels, we estimate five additional models, representing different combinations of including or excluding AGEB, municipal, and state dummies. Two models use only AGEB or only municipal candidate predictors without state dummies, two more only AGEB or only municipal candidate predictors with state dummies, and one uses AGEB and municipal predictors without state dummies. In all cases, we use plug-in LASSO to select a model from the set of candidate predictor variables. Table 8 reports the results from this exercise. The in-sample results are very robust to different model specifications, as correlations with the benchmark vary only from about half a point. On the other hand, the accuracy of out-of-sample predictions are much more sensitive to the set of candidate predictors. The baseline specification, which includes state dummies, AGEB and 30 municipal variables, performs the best by between 1.5 and 23 points. The second-most accurate model is the one that include state dummies and municipal variables, suggesting that the AGEB- level variables, conditional on the municipal level variables, is adding only a modest amount of accuracy in this case. In this context, including dummies at the level at which the sample is representative, as well as area and sub-area level variables notably improves out-of-sample predictive performance. 6.6 Design-based simulation Finally, we also present results from a design-based simulation in Appendix E. Since we use a single sample from 2014, one possible concern is that the results presented here may not hold in other samples. While the results when using the 2016 sample suggest that the main results are robust, we nonetheless present additional results using repeated sampling from the intercensus. For this exercise, because AGEB level identifiers are not available in the intercensus, we use auxiliary data aggregated to the municipal level. The overall results presented in Appendix E are consistent with the findings and conclusions from the 2014 and 2016 surveys. In particular, they indicate that the household level model provides equally accurate estimates in-sample, and significantly more accurate estimates out of sample. In particular, for out-of-sample municipalities, the average correlation with truth in the simulations was 0.80 out-of-sample for the unit-context model, as opposed to 0.75 for the area-level model. 7. Conclusion This paper examined the ability of methods combining survey and satellite data to generate more accurate estimates of monetary poverty in Mexican municipalities. Mexico is a valuable case study to evaluate small area estimation methodology because official estimates, generated using a best- practice unit-level model with household-level auxiliary data from the 2015 intercensus, provide a solid benchmark for comparison. Unfortunately, these types of comprehensive intercensal surveys are rarely available in developing countries, which often face challenges fielding censuses every ten years. In addition, CONEVAL produces and makes publicly available estimates generated using the 2010 census, which provides a useful point of comparison. 31 Incorporating even a small number of satellite predictors, with moderate predictive performance, markedly improves the accuracy of the sample estimates. In particular, the correlation with the benchmark official estimates improves from 0.8 to 0.86 when using a household-level “unit- context” model instead of estimates derived from the sample survey. Precision improves dramatically as well, as the coefficient of variation drops by roughly a factor of two, roughly equivalent to quadrupling the size of the survey. The gain in precision is moderately overestimated, but after adjusting for that to maintain the same coverage rate as the direct estimates, incorporating the small area estimates effectively increases the sample size by a factor of about 2.4. When using satellite data, there is a large difference in accuracy between estimates for in-sample and out-of-sample municipalities. While the preferred household model generated a correlation of 0.86 with the intercensus benchmark in-sample, this fell to 0.7 for out-of-sample. This difference in performance comes for two reasons. First, empirical best predictor models improve accuracy by efficiently combining the survey data with model predictions for each municipality. Second, out- of-sample municipalities are systematically smaller and poorer than sample municipalities, which reduces the accuracy of out-of-sample predictions in this context. These results suggest additional caution when relying on purely synthetic predictions generated from satellite data.23 An important benefit of this type of evaluation exercise against a credible benchmark is the opportunity to compare different statistical methods for incorporating geospatial data into small area estimation. In the main set of results, using the 2014 survey data, a household-level model gives more accurate predictions than a sub-area model for sampled areas (correlation of 0.86 vs 0.83) and equally accurate predictions for non-sampled municipalities (correlation of 0.7). When using the 2016 survey, the predictions from the household-level model are more accurate than the sub-area model both in and out of sample. The area-level model is generally less accurate than the sub-area model, with in-sample correlations of 0.8 when using the 2014 data and 0.75 when using the 2016 data. The exception is out-of-sample predictions using the 2016 data, where area-level estimates are more accurate than sub-area estimates but remain less accurate than the household model. Both the household-level and area-level models underestimate uncertainty, relative to the sub-area model, in the 2016 data. In artificial simulations, when compared with predictions from 23 It is possible that more sophisticated models or methods that allow for non-linearities and interactions may better predict out of sample, especially if the correlations of predictors vary by things like municipality size. This is a topic for further research. 32 an area-level model, predictions from a household “unit-context” model specified with area-level predictors were equally accurate in sample, but markedly more accurate out of sample. In sum, there is a clear ranking in accuracy across the different methods for incorporating the geospatial auxiliary data: The household model was usually more accurate, and never noticeably less accurate, than the two other methods. This was generally followed by the sub-area model estimates, and finally the area-level model estimates. There is no evidence in any of the settings considered that area-level models give estimates that are more accurate and efficient, or less biased, than household-level unit-context models. All geospatial small area estimates, however, remain substantially less accurate than those generated using survey and household census data from 2010. This suggests that municipal poverty patterns changed slowly in Mexico during this time, although the methodology used to generate the 2015 benchmark estimates may also not fully capture transient changes in welfare. 24 The correlation between the 2010 estimates and the 2015 estimates is very high: 0.91 in sample and 0.90 out of sample. Thus, if forced to use municipal poverty estimates to inform a pro-poor budget allocation in 2015, the most accurate source would be the 2010 census-based estimates. Analysts should therefore exercise healthy caution before jettisoning moderately older census-based estimates in favor of newer small area estimates based on geospatial data. The high accuracy of older census-based estimates, however, may not generalize to other contexts, and more research is needed to better understand under what circumstances older census-based estimates give more accurate predictions than recent geospatial poverty estimates. This will depend on the extent to which regional poverty patterns have changed in the country since the last census, as well as the size of the sample survey and the predictive power of the satellite-based auxiliary data, among other factors. It is also likely that the predictive performance of geospatial and other big data will continue to improve as more and better indicators become available. In addition, surveys can also be fielded in ways that better facilitate integration with geospatial data. For example, increasing the number of households interviewed in each enumeration area would likely further improve the predictive performance of the model and the accuracy of the estimates. Likewise, given the better performance of all small area estimates when predicting in- 24 This is partly because the model used to generate the official 2015 estimates of municipal poverty rates utilized data from the 2010 census as predictor variables. 33 sample versus out-of-sample municipalities, expanding sampling to all target areas may yield marked improvements in predictive power. The best survey design for small area estimation with big data thus remains a fruitful area for future research. Extending small area estimation models also remains an important area of research. For example, in the future, sub-area models could better incorporate sampling variability, while unit-context models might benefit from the incorporation of sub-area random effects. Overall, these results demonstrate, in the Mexican context, that augmenting household survey data with geospatial data substantially improves the accuracy and precision of survey-based estimates of monetary poverty. In countries where census-based estimates are heavily outdated or non- existent, we believe the results of this evaluation exercise are sufficiently encouraging to justify combining survey and satellite data to produce small area estimates of monetary poverty. 34 References Arora, V., Lahiri, P. and Mukherjee, K. (1997), Empirical Bayes estimation of finite population means from complex surveys, Journal of the American Statistical Association, 92, 1555-1562. Babenko, B., Hersh, J., Newhouse, D., Ramakrishnan, A., & Swartz, T. (2017). Poverty mapping using convolutional neural networks trained on high and medium resolution satellite images, with an application in Mexico. arXiv preprint arXiv:1711.06323. Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83(401), 28-36. Bell, William R. "Examining sensitivity of small area inferences to uncertainty about sampling error variances." Proceedings of the American Statistical Association, Survey Research Methods Section. Vol. 327. 2008. Belloni, A., and V. Chernozhukov. "High dimensional sparse econometric models: An introduction." Inverse Problems and High-Dimensional Estimation. Springer, Berlin, Heidelberg, 2011. 121-156. Blumenstock, J., Cadamuro, G., & On, R. (2015). Predicting poverty and wealth from mobile phone metadata. Science, 350(6264), 1073-1076. Burke, M., Driscoll, A., Lobell, D. B., & Ermon, S. (2021). Using satellite imagery to understand and promote sustainable development. Science, 371(6535). Butar, F. and Lahiri (2002), On the measures of uncertainty of empirical Bayes small-area estimators, Journal of Statistical Planning and Inference, 112, 63-76. Chi, G., Fang, H., Chatterjee, S., & Blumenstock, J. E. (2022). Microestimates of wealth for all low-and middle-income countries. Proceedings of the National Academy of Sciences, 119(3), e2113658119. Corral, P. Himelein, K. McGee and I. Molina (2021). A map of the poor or a poor map? Mathematics, 9(21), 2780. Elbers, C., Lanjouw, J. O., & Lanjouw, P. (2003). Micro-level estimation of poverty and inequality. Econometrica, 71(1), 355-364. Engstrom, R., Hersh, J., & Newhouse, D. (2022). Poverty from space: Using high resolution satellite imagery for estimating economic well-being. The World Bank Economic Review, 36(2), 382-412. Fay, R. E., & Herriot, R. A. (1979). Estimates of income for small places: an application of James- Stein procedures to census data. Journal of the American Statistical Association, 74(366a), 269- 277. 35 González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D., & Santamaría, L. (2008). Bootstrap mean squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78(5), 443-462. Ghosh, Malay. "Small area estimation: its evolution in five decades." Statistics in Transition. New Series 21.4 (2020): 1-22. Halbmeier, C., Kreutzmann, A. K., Schmid, T., & Schröder, C. (2019). The fayherriot command for estimating small-area indicators. The Stata Journal, 19(3), 626-644. Horvitz, D. G., & Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American statistical Association, 47(260), 663-685. Jiang, J. and Lahiri, P. (2001), Empirical best prediction for small area inference with binary data, Annals of Institute of Mathematical Statistics, 53(2): 217-243. Jiang, J., Lahiri, P. and Wan, S. (2002), Jackknifing the mean squared error of empirical best predictor, Annals of Statistics, 30, 1782-1810. Jiang, J., & Lahiri, P. (2006). Mixed model prediction and small area estimation. Test, 15(1), 1- 96. Jiang, J., & Rao, J. S. (2020). Robust small area estimation: An overview. Annual review of statistics and its application, 7, 337-360. Jean, N., Burke, M., Xie, M., Davis, W. M., Lobell, D. B., & Ermon, S. (2016). Combining satellite imagery and machine learning to predict poverty. Science, 353(6301), 790-794. Kreutzmann, A. K., Pannier, S., Rojas-Perilla, N., Schmid, T., Templ, M., & Tzavidis, N. (2019). The R package emdi for estimating and mapping regionally disaggregated indicators. Journal of Statistical Software, 91. Lahiri, P., 1987, Robust Empirical Bayes Estimation in Finite Population Sampling. Li, H., & Lahiri, P. (2010). An adjusted maximum likelihood method for solving small area estimation problems. Journal of multivariate analysis, 101(4), 882-892. Masaki, T., Newhouse, D., Silwal, A. R., Bedada, A., & Engstrom, R. (2022). Small Area Estimation of Non-Monetary Poverty with Geospatial Data, Statistical Journal of the IAOS, pre- press. Merfeld, J. D., Newhouse, D. L., Weber, M., & Lahiri, P. (2022). Combining Survey and Geospatial Data Can Significantly Improve Gender-Disaggregated Estimates of Labor Market Outcomes. World Bank Policy Research Working Paper 10077. Molina, I., & Rao, J. N. K. (2010). Small area estimation of poverty indicators. Canadian Journal of Statistics, 38(3), 369-385. Rao, J. N., & Molina, I. (2015). Small area estimation. John Wiley & Sons. 36 McBride, L., Barrett, C. B., Browne, C., Hu, L., Liu, Y., Matteson, D. S., ... & Wen, J. (2021). Predicting poverty and malnutrition for targeting, mapping, monitoring, and early warning (No. 2388-2021-383). National Statistical Office (NSO) of Malawi. (2021). Malawi Poverty Report 2020. Peterson, R. A., & Cavanaugh, J. E. (2020). Ordered quantile normalization: a semiparametric transformation built for the cross-validation era. Journal of Applied Statistics, 47:13-15, 2312- 2327. Pfeffermann, D., Sikov, A., & Tiller, R. (2014). Single-and two-stage cross-sectional and time series benchmarking procedures for small area estimation. Test, 23(4), 631-666. Pokhriyal, N., & Jacques, D. C. (2017). Combining disparate data sources for improved poverty prediction and mapping. Proceedings of the National Academy of Sciences, 114(46), E9783- E9792. Rao, J. N. K. (2003). “Small Area Estimation”. Wiley, London. Rao, J. N. K. (2014). Small‐area estimation. Wiley StatsRef: Statistics Reference Online, 1-8. Ravallion, M. (2015). The economics of poverty: History, measurement, and policy. Oxford University Press. StataCorp. 2021. Stata: Release 17. Statistical Software. College Station, TX: StataCorp LLC. Steele, J. E., Sundsøy, P. R., Pezzulo, C., Alegana, V. A., Bird, T. J., Blumenstock, J., ... & Bengtsson, L. (2017). Mapping poverty using mobile phone and satellite data. Journal of The Royal Society Interface, 14(127), 20160690. Steele, J. E., Pezzulo, C., Albert, M., Brooks, C. J., Erbach-Schoenberg, E., O’Connor, S. B., ... & Tatem, A. J. (2021). Mobility and phone call behavior explain patterns in poverty at high- resolution across multiple settings. Humanities and Social Sciences Communications, 8(1), 1-12. Tarozzi, A., & Deaton, A. (2009). Using census and survey data to estimate poverty and inequality for small areas. The Review of Economics and Statistics, 91(4), 773-792. Torabi, M., & Rao, J. N. K. (2014). On small area estimation under a sub-area level model. Journal of Multivariate Analysis, 127, 36-55. Valliant, R., Dorfman, A.H., and Royall, R.M. (2000). Finite Population Sampling and Inference, John Wiley & Sons, Inc. Van der Weide, R. (2014). GLS estimation and empirical bayes prediction for linear mixed models with heteroskedasticity and sampling weights: a background study for the povmap project. The World Bank. World Bank. World Development Report 2021: Data for Better Lives. The World Bank, 2021. 37 Yeh, C., Perez, A., Driscoll, A., Azzari, G., Tang, Z., Lobell, D., ... & Burke, M. (2020). Using publicly available satellite imagery and deep learning to understand economic well-being in Africa. Nature communications, 11(1), 1-11. 38 Tables Table 1: Household model diagnostics Household model Number of included predictor variables 23 Sample Number of households 57,661 Number of AGEBs 6,794 Number of municipalities 892 Synthetic census Number of households 27,973,210 Number of municipalities 2,433 Household model diagnostics Marginal R2 0.134 Conditional R2 0.225 Skewness of area effect 0.041 Kurtosis of area effect 3.447 Wilks-Shapiro P-value 0.043 Skewness of household error -0.212 Kurtosis of household error 3.749 Wilks-Shapiro P-value N/A Estimated variance of area effect 0.009 Estimated variance of household error term 0.659 Ratio of estimated variance of area effect to total variance 0.014 Average complement of shrinkage factor 0.695 Notes: Number of included predictor variables excludes constant. Sample figures reflect the number of households, AGEBs, and municipalities contained in the MCS-ENIGH sample. Synthetic census figures reflect the number of households and municipalities used in the population used to generate municipal poverty estimates, based on the 2010 census. Marginal R2 reflects the variation explained by model predictors, while conditional R2 includes the variation explained by conditioning the mixed effect on sample welfare. Skewness and Kurtosis of the area effect and household error term describe the distribution of these components of the error term. The average complement of shrinkage is equal to one minus the average of the estimated shrinkage factors. It is a measure of the average percentage that conditioning the area effect on the sample data shrinks the variance of the mixed effect. 39 Figure 1 - Quantile-quantile plots of household model residuals 40 Table 2 - Poverty predictions for municipalities: Mean poverty and precision Sampled municipalities Non-sampled municipalities Mean Mean Median Mean Mean Median poverty MSE CV poverty MSE CV Direct survey estimates Horvitz-Thompson 155.8 38.5 N/A N/A N/A approximation AGEB- 0.282 Clustered variance 25.2 11.3 N/A N/A N/A estimates CNN-based model estimates Household-level EBP 0.281 35.8 19.8 0.355 150.3 33.9 model Bias-corrected Household EBP model Sub-area model 0.282 101.5 35.6 0.365 306.2 47.3 Area-level model 0.227 64.7 28.1 0.271 158.1 37.5 Official 2010 estimates 0.266 N/A N/A 0.459 N/A N/A Official 2015 estimates 0.298 N/A N/A 0.426 N/A N/A Notes: Table shows key results for direct estimates, household model estimates, sub-area model, and area- level model estimates. In-sample refers to the 893 municipalities contained the 2014 MCS-ENIGH survey, while out of sample refers to the 1,418 municipalities not covered by the survey. Direct estimates are reported separately when variance is estimated using the Horvitz-Thompson approximation and clustering on AGEBs. The mean poverty represents the simple average poverty rate across municipalities prior to benchmarking, which is a measure of the extent of bias in the model-based estimates relative to the direct and official estimates. The mean MSE is the mean estimated Mean Squared Error (times 10,000) across municipalities, Model-based estimates of MSE reflect benchmarking estimated poverty rates to direct estimates at the state level as described in section 4. The median CV is the median across municipalities of the coefficient of variation, defined for each municipality as the square root of the estimated mean squared error divided by the estimated poverty rate. All results are unweighted means or medians across municipalities. 41 Figure 2 - Predicted-true plots Panel A: Household model Panel B: Sub-area model Panel C: Area -level model 42 Table 3 - Accuracy and coverage against official 2015 poverty estimates Sampled municipalities Non-Sampled municipalities Corr RMSD CR Corr RMSD CR Direct survey estimates H-T 0.856 N/A N/A N/A 0.800 0.126 EA-Clustering 0.393 N/A N/A N/A Household-level model 0.862 0.094 0.769 0.701 0.181 0.825 Sub-Area model 0.834 0.103 0.910 0.696 0.183 0.941 Area-level model 0.796 0.110 0.824 0.662 0.198 0.801 Official 2010 estimates 0.912 0.083 N/A 0.904 0.109 N/A Meta relative wealth -0.592 N/A -0.516 N/A index Notes: Table shows evaluation results for direct estimates, household model estimates, sub-area model, and area- level model estimates against official municipality poverty rates derived from a standard household level small area estimation exercise undertaken by CONEVAL. In-sample refers to the 893 municipalities contained the MCS-ENIGH survey, while out of sample refers to the 1,418 municipalities not covered by the survey. Direct survey estimates are reported separately when variance is estimated using the Horvitz-Thompson approximation and clustering on AGEB. Model-based estimates are benchmarked at the state level as described in section 4. Corr represents the simple correlation between the estimated poverty rates and the official CONEVAL benchmark. RMSD represents the root mean squared deviation of the predictions relative to the official estimates, and is an alternative measure of prediction accuracy. CR represents the coverage rate, which is the share of municipalities for which the official estimate lies within the 95 percent confidence interval of the model-based or survey-based estimates. It is a measure of the accuracy of the estimated mean squared error. 43 Figure 3 - Box plots of relative bias, 2014 MCS-ENIGH estimates In-sample municipalities Out-of-sample municipalities -1 -.5 0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 Relative bias 0 5 10 15 20 25 Relative bias 2010 Official Estimates Household-level model Sub-area model Area-level model 2010 Official Estimates Household-level model Direct survey estimates Sub-area model Area-level model Note: The figure shows the distribution of relative bias of municipal poverty estimates by estimation method. Relative bias is computed as the ratio of the difference between the estimate and the CONEVAL benchmark to the CONEVAL benchmark. The 2010 official estimates are produced by CONEVAL based on the 2010 census. The household-level, sub-area, and area-level model estimates are produced by combining the 2014 MCS-ENIGH with satellite indicators. The direct estimates are generated using only the 2014 MCS-Enigh survey. 44 Table 4 - Robustness of household-level model estimates to benchmarking Household model Mean Median MSE Corr RMSD CR estimates poverty CV Sampled municipalities State-level benchmark 0.281 35.8 19.8 0.862 0.094 0.769 No benchmark 0.281 34.3 19.8 0.853 0.095 0.753 Non-sampled municipalities State-level benchmark 0.355 150.3 33.9 0.701 0.181 0.825 No benchmark 0.355 138.5 33.9 0.669 0.196 0.779 Notes: Table shows evaluation statistics for benchmarked and non-benchmarked municipal poverty estimates. In- sample refers to the 893 municipalities contained the MCS-Enigh survey, while out of sample refers to the 1,418 municipalities not covered by the survey. Columns are defined as in Tables 6 and 7. Benchmarking entails multiplying each municipal estimate by the ratio of the weighted average of the state direct estimate to the population-weighted average of the municipal poverty estimates for each state. This ensures that the state average poverty rate for the small area estimates is consistent with the direct survey estimates. The weights in the numerator are the sample weights provided in the sample and the weights in the denominator are taken from the 2010 census. MSE estimates for each municipality are also multiplied by the same ratio that is applied to the poverty estimates. 45 Table 5 - Results with and without sample and population weights Mean Mean Median Household model Correlation RMSD CR Poverty MSE CV Sampled municipalities Baseline estimates 0.281 35.8 19.8 0.862 0.094 0.769 Aggregation weights but 0.264 32.4 18.2 0.864 0.093 0.748 no sample weights Neither sample nor 0.264 32.4 18.2 0.865 0.093 0.748 aggregation weights Non-sampled municipalities Baseline estimates 0.355 150.3 33.9 0.701 0.181 0.825 Aggregation weights but 0.336 145.4 32.9 0.700 0.180 0.824 no sample weights Neither sample nor 0.336 145.6 32.9 0.700 0.180 0.825 aggregation weights Notes: Table shows evaluation statistics for estimates with different weights applied. The baseline estimates apply both sample weights when estimating the model and estimated household size in the population when aggregating the simulation results across households. Including population weights makes the estimated poverty rates representative of individuals rather than households. Sample weights are taken from the survey and the estimated household size for each AGEB is obtained from the 2010 census. Columns are defined as in Tables 6 and 7. 46 Table 6 - Robustness of household model results to different specifications Mean Mean Median Household model Correlation RMSD CR Poverty MSE CV Sampled municipalities Deflated welfare with 0.281 35.8 19.8 0.862 0.094 0.769 national model Deflated welfare with 0.280 35.2 19.6 0.870 0.092 0.777 grouped states models Nominal welfare with 0.283 23.4 15.5 0.865 0.093 0.679 national model Nominal welfare with 0.282 23.2 15.3 0.873 0.091 0.671 grouped states models Nominal welfare with 0.280 33.6 19.1 0.866 0.093 0.767 urban/rural models Non-sampled municipalities Deflated welfare with 0.355 150.3 33.9 0.701 0.181 0.825 national model Deflated welfare with 0.365 170.6 33.4 0.697 0.173 0.869 grouped states models Nominal welfare with 0.366 143.7 26.3 0.705 0.178 0.751 national model Nominal welfare with 0.379 160.1 25.8 0.710 0.167 0.803 grouped states models Nominal welfare with 0.356 97.0 26.9 0.707 0.181 0.726 urban/rural models Deflated welfare refers to welfare after multiplying the welfare of rural households by the ratio of the urban to rural poverty lines and using the urban poverty line as the poverty threshold. National model refers to a single model for welfare estimated on the full MCS-ENIGH sample. Grouped states refers to estimating six separate models, one for each group of states as defined in Table B1, for which separate models are estimated. Nominal welfare refers to the use of nominal welfare, with poverty rates simulated separately for urban and rural areas using urban and rural specific poverty lines. In the three nominal welfare models, the final municipal estimates reflect a weighted average of the urban and rural poverty estimates for each model, with weights equal to the population in the 2010 census. The estimated MSE in these cases accounts for the estimated correlation between urban and rural poverty rates in the same municipality, as described in section 6. Urban/rural models refers to separate models for urban and rural areas. Columns are defined as in Tables 5 and 6. 47 Table 7 - Robustness of model results to use of 2016 sample data Mean Mean Median Coverage Correlation RMSD Poverty MSE CV rate Sampled municipalities Direct survey estimates (Horvitz- 0.209 84.2 38.2 0.751 0.136 0.699 Thompson) Household model 0.216 21.1 21.1 0.808 0.109 0.588 Sub-area model 0.210 62.2 40.3 0.777 0.120 0.818 Area-level model 0.166 38.0 30.7 0.749 0.126 0.672 Non-sampled municipalities Household model 0.315 116.9 35.8 0.669 0.217 0.667 Sub-area model 0.307 227.5 50.7 0.643 0.226 0.792 Area-level model 0.228 100.2 36.5 0.655 0.233 0.617 Notes: Table shows key results using the 2016 MCS-ENIGH household survey to estimate the model instead of the 2014 survey. In-sample refers to the 944 municipalities contained in the 2016 MCS-ENIGH survey, while out of sample refers to the 1,512 municipalities not covered by the survey. Direct estimates are reported using the Horvitz-Thompson approximation to generate variance estimates. Model-based estimates of MSE reflect benchmarking estimated poverty rates to direct estimates from the MCS 2016 survey at the state level. Columns are defined as in Tables 6 and 7. 48 Table 8 - Robustness of household model results to different sets of predictors Mean Mean Median Coverage Correlation RMSD Poverty MSE CV rate Sampled municipalities AGEB and municipal predictors with state 0.281 35.6 19.6 0.863 0.094 0.767 dummies (baseline) Municipal predictors with state dummies 0.281 36.0 19.6 0.859 0.095 0.768 AGEB predictors with 0.281 36.1 19.6 0.856 0.096 0.756 state dummies AGEB and municipal predictors, no state 0.280 37.5 20.3 0.862 0.094 0.779 dummies Municipal predictors 0.280 37.3 20.3 0.860 0.095 0.778 only, no state dummies AGEB predictors only, 0.280 37.8 20.3 0.857 0.096 0.771 no state dummies Non-sampled municipalities AGEB and municipal predictors with state 0.355 150.1 33.8 0.701 0.181 0.824 dummies (baseline) Municipal predictors with state dummies 0.352 153.6 34.2 0.687 0.184 0.831 AGEB predictors with 0.341 154.1 35.4 0.631 0.198 0.801 state dummies AGEB and municipal predictors, no state 0.341 173.3 37.0 0.602 0.204 0.817 dummies Municipal predictors 0.343 175.2 37.4 0.607 0.201 0.824 only, no state dummies AGEB predictors only, 0.322 175.3 39.6 0.469 0.225 0.788 no state dummies Notes: Table shows key results when using different sets of candidate variables to select the model. In-sample refers to the 944 municipalities contained in the 2016 MCS-ENIGH survey, while out of sample refers to the 1,512 municipalities not covered by the survey. Estimates of MSE reflect scaling by benchmarking factors. Columns are defined as in Tables 6 and 7. 49 Figure 4 - Box plots of relative bias, 2016 MCS-ENIGH estimates In-sample municipalities (2016 survey) Out-of-sample municipalities (2016 survey) -1.5 -1 -.5 0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 5 10 15 20 25 Relative bias Relative bias Household-level model Sub-area model Household-level model Sub-area model Area-level model Direct survey estimates Area-level model Note: See notes to Figure 3. The household-level, sub-area, and area-level model estimates are produced by combining the 2016 MCS-ENIGH with satellite indicators. The direct estimates are generated using the 2016 survey alone. 50 Appendix A Table A1 - Geographic structure of Mexico Sub-areas Areas Name AGEBS Municipality Sample Population Sample Population All states National 6,794 57,661 892 2,433 Urban 5,949 51,961 617 2,311 Rural 851 14,545 494 2,380 State groupings Group 1 938 7,586 57 105 Group 2 1,521 13,952 100 244 Group 3 1,845 16,810 188 363 Group 4 777 5,716 97 211 Group 5 1,221 14,358 283 757 Group 6 492 8,084 167 753 Notes: Figures give the number of AGEBs and municipalities in the sample and population of Mexico, broken out by urban and rural areas for the top panel. In the bottom panel, the six state groups match those used to develop the official municipal poverty estimates. More information is in Table C1 in appendix C. Population figures are taken from the 2010 census. 51 Table A2 - Descriptive Statistics for geospatial auxiliary data Household AGEB mean Municipal variables mean Variable Mean SD SD SD Sample Per capita income 3513.9 8100.3 4250.0 1680.3 Normalized per capita income 0.0 1.0 0.6 0.4 Poverty status 0.195 0.396 0.205 0.155 Predicted percent extremely 0.185 0.082 0.082 0.074 poor Predicted percent moderately 0.319 0.043 0.043 0.034 poor Predicted percent not poor 0.495 0.112 0.112 0.098 Percentage of pixels classified as: Building 0.110 0.103 0.103 0.079 Forest 0.068 0.134 0.134 0.116 Grassland 0.129 0.109 0.109 0.093 Road 0.126 0.086 0.086 0.069 Rural 0.244 0.430 0.429 0.334 Number 57,661 6,794 892 Population Predicted percent extremely 0.181 0.082 0.082 0.071 poor Predicted percent moderately 0.319 0.043 0.043 0.030 poor Predicted percent not poor 0.497 0.111 0.111 0.095 Percentage of pixels classified as: Building 0.124 0.113 0.113 0.088 Forest 0.063 0.130 0.130 0.106 Grassland 0.128 0.110 0.110 0.088 Road 0.135 0.090 0.090 0.072 Rural 0.221 0.415 0.415 0.266 Number 27,966,071 66,496 2,433 Notes: Mean and Standard Deviation reflect population-weighted mean and standard deviation across households. Third and fourth columns report standard deviations of AGEB and municipal means, across AGEBs and municipalities respectively. Means of AGEB and municipal means are identical to means of household variables reported in second column and therefore not separately shown. Population statistics are weighted by estimated average household size per AGEB, taken from the 2010 census. Sample statistics are weighted using sample weights. 52 Table A3 - Household, sub-area, and area-level models, post-LASSO OLS coefficient Household model Sub-area model Area-level model Normalized AGEB poverty Municipal poverty Dependent Variable: household welfare rate rate Auxiliary variables - AGEB average CNN Predicted percent extremely poor -0.34 0.06 CNN Predicted percent not poor 0.79*** -0.26** Percent building 0.66*** -0.06 Percent forest -0.25*** Auxiliary variables - Municipal average CNN Predicted extreme poverty rate -0.97*** 0.24 1.41*** CNN Predicted moderate poverty rate -1.21*** Percent building 0.03 Percent road 0.02 -0.37* Percent forest 0.15** Percent grass -0.55*** 0.20** Percent background -0.19** Percent rural -0.24*** 0.08* State dummy variables Baja California Sur 0.18*** -0.06*** Chihuahua -0.39*** 0.18*** 0.16*** Mexico City -0.12*** Coahuila 0.15*** -0.03* Colima -0.09*** Durango -0.18*** 0.09*** Guerrero 0.10*** -0.04*** -0.08** Jalisco -0.20*** México 0.17*** Nayarit 0.12*** Nuevo León -0.21*** 0.12*** 0.10*** Oaxaca -0.23*** 0.07* Puebla 0.17*** -0.06*** Querétaro 0.09*** Sonora 0.20*** -0.07*** Tabasco -0.02* Number of observations 57,660 6,792 893 Notes: Results are post-LASSO OLS regression results. Stars indicate statistical significance at 10, 5, and 1 percent. In the left panel, the dependent variable is normalized household per capita consumption as described in section 4. In the middle and right panels, the dependent variable is estimated AGEB and municipal poverty rates from the 2014 MCS- ENIGH survey, respectively. 53 Appendix B: Construction of CNN-Based Land Type Estimation and Poverty Predictions One goal of this project was to predict poverty rates directly from imagery. Previous work in Sri Lanka predicted poverty rates using satellite-derived features such as number of cars and building density (Engstrom et al., 2022). Similarly, Jean et al. (2016) trained a CNN to predict the density of nighttime lights in several African countries. They then used features produced by that network to predict household income using a linear regression. This project, originally written up in Babenko et al. (2017), was an early attempt to train a CNN end-to-end to directly predict poverty rates. End-to-end training refers to methods where the entire training procedure is performed directly by the CNN’s optimizer. Alternatives incl ude applying post processing to the results of the CNN or using results from intermediate layers of the CNN as features in another model. We used the administrative unit codes present in both the survey data and the shapefiles of the administrative units to match poverty rates to imagery. The images that we received from the satellite providers can cover hundreds of square kilometers and be irregularly shaped. Traditional computer vision algorithms require rectangular images that can fit onto computational hardware. We split the world up into rectangular tiles of a reasonable size that can be fed to the computer vision algorithms. These tiles are designed to have a similar number of pixels per tile. As such, their geographic extent varies with pixel size. A single image captured by a satellite may encompass thousands of these tiles. Figure B1 below shows a Planet tile in blue, a Digital Globe tile in green, and a sample urban AGEB in purple for comparison. 54 Figure B1 – Digital Globe and Planet Tile Sizes, Michoacán For every tile that intersected with a surveyed administrative unit, we assigned that administrative unit’s poverty distribution to the tile as the ground truth. In cases where more than one administrative unit intersected a tile, we assigned an average of those distributions weighed by geographic overlap and the total sampling weights of the administrative units. Tiles must have contained a certain amount of spatial intersection with the surveyed administrative units to be included in training or evaluation. We allowed this threshold to vary based upon the imagery provider to ensure sufficient tiles for training. B.1 CNN Training The CNNs were then trained using largely standard deep learning techniques. For each image, a loss that quantifies the difference between the predicted distribution and the true distribution was calculated. We used stochastic gradient descent, a popular optimization algorithm for neural networks, to minimize loss over the training data set. This effectively means iteratively adjusting the weights (coefficients) of the model to reduce loss on small subsets of the training data. The computer vision algorithm was trained at the tile level. Upon the completion of training, we aggregated the tile level predictions to the administrative unit level. As administrative units commonly intersected with more than one tile, we took an average of those tiles to be the predicted distribution for that administrative unit. We weighed those tile predictions by how much of the administrative unit spatially intersects with the tile. In Figure B2, below, tile 5 would receive a high weight in the averaging process while tiles 7 and 9 would have little effect on the AGEB’s final predicted distribution. Figure B2 - Example Tile Aggregation to AGEB 55 B.2 Land Use Prediction In addition to the end-to-end poverty prediction described above, we provided land-use predictions for all areas in Mexico that had cloud free imagery. We trained a land-use classifier that, for each pixel in an image, produces a likelihood that it belongs to one of several classes. This classifier was trained with Planet imagery from North America and Europe. For each pixel, the classifier assigns a probability distribution across 6 classes: building, road, water, grassland, forest, and background (all other classes). An example visualization of those results can be seen below in Figure B3. The first panel shows the RGB bands of a Planet image (this classifier also uses the Near-Infrared band when making classifications). The second panel shows the assigned likelihood that each pixel is a road. The third panel does the same analysis for buildings. Figure B3 - Example Pixel-level Land-use Results, Mexico DF (Satellite image (c) 2017, Planet) For each 3-5 m2 pixel in Mexico, the classifier produces 6 scores that sum to one. As such, the total volume of data in this process is substantial and large-scale analysis at this level is computationally expensive. We took the following approach to make large-scale analysis for policy and visualization purposes more feasible. We split each Planet tile into a 5x5 grid. For each grid cell, we took the distribution of pixels that were assigned to those six classes. Each grid cell contains six values, which sum to 1, which represent what percentage of the spatial area of that cell was predicted to belong to that class. Each grid cell was approximately 750 m2 in size. An example of these aggregated results is demonstrated below in Figure B4 overlaid on high-resolution imagery. The colors of the grid cells correspond to the proportion of the pixels in the cell classified as likely to be a road. 56 Additionally, these results can be used to visualize the land-use of Mexico and smaller areas of interest. This visualization is simplest and most effective for each class individually. We can make a heatmap that, for the class of interest, shows the percent of each grid cell that is made up by that class. In this case, areas that have a high proportion of that class will show up as bright and areas where that class is absent will show up as dark. An example is shown below of the roads class for all areas of Mexico in 2016 which had a cloud-free image. The second panel shows a zoomed in area around Mexico City along with a base map of the area. Figure B4 - Example of aggregated results, Mexico DF (Satellite image (c) 2017, Google) 57 Figure B5 - Fractional Road Coverage, Mexico Figure B6 - Fractional Road Coverage, Close-Up of Mexico City Area and Base map 58 Appendix C: Implementation of the household model As noted in the text, the unit-level model estimates presented in the paper are estimated using a modified version of the EBP function contained in the R EMDI package. There are two main modifications to the standard version of the package available at CRAN. The first incorporates census weights when aggregating up simulation results across households to the municipality level. The weights used in this application are the estimated average number of household members in each of the AGEBS, obtained by dividing the total population by the number of households in the 2010 census. In rural areas, the census data only contains total population and number of households by location instead of by AGEBS. Therefore, the population and number of households were aggregated up to calculate the total rural population and number of households per AGEB using publicly available AGEB-level shapefiles, under the assumption that no rural households lived outside of the localities provided in the census. The second main modification to the standard EMDI package allows for household weights in the model estimation process. This was implemented in the following way: 1. Population weights were created as the product of household weights (factor_hog) and the number of household members (tot_integ), and normalized to sum to the number of sample observations in each municipality. 2. These weights (denoted “smp_weight”) were incorporated into the estimation of the mixed effects model by adding the following option to the lme function called by the point_estim function in the EMDI package. weights=varComb(varFixed(as.formula(paste0("~1/",smp_weight))),varIdent(~1))) 3. The calculation of the shrinkage factor gm for each municipality in the point_estim function was adjusted according to the following formula, following You and Rao (2002) and Van der Weide (2014) ̂ 2 = ∑ 2 (C.1) ̂ 2 + ̂2∗ =1 2 (∑ ) =1 59 Where wi is equal to smp_weight for household i in municipality m, and nm is the number of sample households in municipality m. When wi is constant, this is equal to the formula given in Molina and Rao (2010) and implemented by default in the EMDI package, which is: ̂ 2 = 2 ̂ (C.2) ̂ 2 + As described in the text, we created a synthetic census based on the estimated number of households in each AGEB. The synthetic census consists of one observation per household, and contains the AGEB level satellite indicators as well as average AGEB household size, to use as a weight when aggregating simulation results across AGEBs. The average AGEB household size was estimated by dividing AGEB population from 2010 by the average household size for the associated state estimated in the MCS ENIGH survey. This was done separately for urban and rural areas. The number of households in each AGEB was set equal to the total population divided by the average state household size, rounded to the nearest integer. Because of the large size of the synthetic census, it was much more efficient computationally to divide it into smaller groups when performing the estimation. Unfortunately, unlike the R SAE package, the EMDI package does not allow for estimation on a subset of domains. As a workaround, we divided households into six groups based on their state code, as listed in Table B1 below. The six groups are the same that were used by CONEVAL when generating the official municipal poverty estimates. For each group, the population was constructed by taking all households from the group’s states in the synthetic census and appending a single randomly selected household from each municipality outside those states. This was done because the EBP function of the EMDI package requires the population data to contain at least one household from all areas contained in the sample. This procedure therefore produced six sets of estimates, one for each group. These were then appended together to create the final estimates, after discarding all municipal estimates based on a single household outside each group. Each set of estimates used the same national model based on the full MCS-ENIGH sample. As a robustness check, we also selected and estimated six separate models, one for each group, as shown in Table 8. Estimating a separate model for each group yielded qualitatively similar results, although mean squared error is slightly higher when estimating six separate models. 60 The baseline models “inflated” welfare in rural areas by multiplying them by the ratio of the urban to rural poverty lines. This was necessary because the EBP function in the EDMI package only allows for a single poverty line threshold, and after inflating rural welfare, the urban poverty could be used for both urban and rural areas. As a robustness check, however, we divided the urban and rural population for each group into separate subpopulations and estimated poverty rates separately for urban and rural areas. The results are shown in Table 8. The estimated poverty rates for each municipality were calculated as the population-weighted average of the urban and rural poverty rates. However, since estimated poverty rates in urban and rural areas of the same municipality are positively correlated, this procedure introduces an additional complication when estimating the uncertainty of the municipal poverty estimate. Urban and rural poverty rates within a municipality are correlated, so assuming that they are independent would substantially underestimate uncertainty. We therefore estimated the covariance of the urban and rural areas using the sample of municipalities that contain both rural and urban areas, by calculating the estimated covariance of the urban and rural headcount poverty rates across municipalities. To estimate the mean squared error of the municipal poverty rates, we took the weighted average of the estimated MSEs and added two times the estimated covariance. Therefore, for each municipality the poverty estimate is constructed as: ̂ = ̂ + (1 − )̂ (C.3) 2 2 ̂, + 2 (1 − )(̂ , ̂ ), ̂, + (1 − ) ̂ = (C.4) where wu is the share of the population in urban areas, and ̂ and ̂ are poverty rates in urban and rural areas respectively. As seen in Table 8, this procedure gives similar point estimates but significantly higher estimates of uncertainty and higher coverage rates than the baseline procedure. This is partly because the baseline procedure underestimates uncertainty by failing to fully account for correlation across households in the same AGEBs. But it is also because this procedure applies the same covariance, estimated across all municipalities, to combine the estimated urban and rural poverty rates. Furthermore, this procedure assumes that bias is zero and that the MSE is therefore a variance estimate. These additional complications and assumptions led us to prefer the “inflated” welfare models, based on spatially deflated welfare applied to a single poverty line, as the baseline estimates. 61 Table C1 - State groupings Group 1 Coahuila, Distrito Federal, Nuevo León 2 Baja California, Baja California Sur, Colima, Jalisco, Quintana Roo, Sinaloa, Sonora 3 Aguascalientes, Chihuahua, Guanajuato, México, Morelos, Nayarit, Querétaro, Tamaulipas 4 Campeche, Durango, San Luis Potosí, Yucatán 5 Hidalgo, Michoacán, Puebla, Tabasco, Tlaxcala, Veracruz, Zacatecas 6 Chiapas, Guerrero, Oaxaca 62 Appendix D: Implementation of the sub-area model The sub-area model is implemented as a standard nested error regression model, where the poverty rate estimated at the sub-area is modeled as a function of geospatial indicators specified at the sub- area level. The model is specified as follows: ̂ = + + , i=1,…,m; j=1,…,Ji, (D.1) where ̂ denotes the estimated poverty rate of sub-area j within area i, derived from the sample, Xij is a vector of auxiliary variables specified at either the sub-area or area level, b is a vector of 2) 2 regression coefficients, ~ (0, is an area random effect, and ~ (0, ), with −1 assumed known, is a sub-area residual error term. In this paper, we choose = , where nij is the number of sample households in AGEB j. Note that the subarea model (1) is different from the one proposed by Torabi and Rao (2014), which incorporates sampling variability of the estimated proportions, but is more complex to implement than subarea model (1). Auxiliary variables are selected using the rigorous LASSO procedure from the full set of candidate variables. The candidate variables include predicted extreme and moderate poverty and land classification estimates at both the AGEB and municipal variables, and nine state dummy variables.25 The variables selected were extreme poverty and the building share at the AGEB level, the share building, forest, grass, and rural at the municipal level, and ten state dummies. A modified version of the EMDI package in R, modified to account for heteroscedasticity, was used to estimate the model. 25 More details on the predicted poverty estimates and land classification variables can be found in Appendix A. The state dummies selected in the model were Colima, Chiapas, Federal District of Mexico City, Guerrero, Jalisco, Nuevo León, Oaxaca, Quintana Roo, and Tamaulipas. 63 Appendix E: Design-based simulations In this appendix, we report the results of a design-based simulation using the intercensus survey. Because AGEB-level identifiers are not publicly available in the intercensus data, we cannot simulate either a household model with sub-area auxiliary data or a sub-area model. We therefore simulate a household-level model with municipal-level auxiliary data and compare the performance against a standard Fay-Harriot area-level model. The results are useful to verify whether the household-level model, even when using area-level covariates, generates predictions that are as accurate as an area-level model in repeated simulations. Second, the results give an indication of whether the area-level results reported in Table 2, Table 3, and Table 7 hold in repeated simulations. The welfare variable used is household per capita labor income, which differs from the total household income measure used in the previous sections. We use the value of per capita labor income at the 28.2 percentile, when weighting by household weights, as a poverty line to match the 2014 monetary poverty rate. Households with per capita income below this threshold are classified as poor, while those above it are not. These classifications can then be used to compute the “true” poverty rates for each municipality in the intercensus, weighting by the product of the household weights and the number of household members for each household. For each simulation, we draw a three-stage sample of households from the 5.8 million households in the intercensus as follows: 1. Draw 896 municipalities sequentially, without replacement, with probability proportional to size. The measure of size is municipal population taken from the 2010 census. 2. Draw up to nine PSUs (UPM) using simple random sampling. Sample all available PSUs if there are fewer than nine in the municipality included in the 2015 intercensus. We select nine to match the average number of PSUs per municipality in the MCS-ENIGH 2014 survey. 3. Draw up to seven households in each PSU using simple random sampling. Sample all available households if there are fewer than seven in the PSU included in the 2015 64 intercensus. We select seven to match the average number of households per PSU in the MCS-ENIGH 2014 survey. 4. Calculate weights as the reciprocal of the approximate inverse probability that a household was selected. We approximate the probability that a household was selected as: 896 (9, ) (7,ℎℎ ) ℎ = [1 − (1 − ∑ ) ]∗ ∗ , ℎℎ where Popm is the population of the municipality in which household h resides, Npsu is the number of PSUs in household h’s municipality, and Nhh is the number of households in household h’s PSU, according to the 2010 census. The first term approximates the probability that a municipality is selected.26 The second and third term are the probability that the PSU and household are selected, respectively. To estimate the household model, these weights were normalized by dividing by their mean so that they sum to the total number of observations, as is common when estimating mixed models. 5. Normalize household per capita labor income using the ordered quantile normalization. 6. Estimate the poverty line as the 28.2nd percentile of the transformed household per capita labor income in the sample. 7. Estimate municipal poverty rates using a household unit-context model using the sample data and the intercensus population as the population data. The sample weight is the product of the original household weight from the intercensus and the additional sampling weight calculated in step 4. The population aggregation weight is the household weight times the number of household members provided in the 2015 intercensus data. 8. Calculate direct estimates using the sample, with the same weights as the previous step, and variances calculated using the Horvitz-Thompson approximation. 9. Merge the direct estimates with a version of the intercensus collapsed to the municipality level. 10. Estimate a Fay-Herriot model to generate predicted poverty rates using Stata’s fayherriot command, utilizing the Li and Lahiri (2010) adjusted maximum likelihood method. 26 This formula assumes sampling with replacement, which makes it an approximation. 65 11. Repeat steps 1 through 10 ninety-nine times, storing the predictions each time, for a total of 100 simulations. The final step is to merge the actual municipal “poverty rates” from the intercensus data (based on low per capita labor income) and generate evaluation statistics. Table E1 - Mean poverty and accuracy in simulation study Average over 100 Simulations Mean Poverty RMSD Correlation Sampled municipalities Direct survey estimates (H-T) 0.368 0.294 0.926 Household model 0.358 0.272 0.941 Area-level model 0.336 0.274 0.937 Intercensus benchmark 0.379 0.000 1.000 Non-sampled municipalities Household model 0.456 0.380 0.803 Area-level model 0.434 0.405 0.749 Intercensus benchmark 0.522 0.000 1.000 Notes: Table shows results from one hundred simulations based on repeated samples from the intercensus survey, as described in section 7. Mean poverty refers to the simple average of poverty rates prior to benchmarking, RMSD refers to mean squared deviation, and Correlation refers to correlation, relative to the intercensus benchmark. The evaluation results are displayed in Table E1. For sampled municipalities, the household model, direct estimates, and Fay-Herriot estimates all perform very well, as the correlations with the actual full intercensus are 0.941 for the household model, 0.937 for the area-level model, and 0.926 for the direct estimates. These results are substantially better than the correlations for the two MCS- ENIGH samples shown in Table 2, Table 3, and Table 7, particularly for the direct estimates. This is partly because the sample is drawn directly from the intercensus survey, the same data source used for evaluation. Out of sample, however, the area-level model performs worse than the household-level model, with a correlation of 0.75 compared to 0.8 for the household-level model. The household-level model also produces somewhat less biased estimates of poverty rates, for both 66 in-sample and out-of-sample municipalities. In sample, the average poverty rate for the household- level model, prior to benchmarking, is 35.8%, which is closer to the true value of 37.9% than that given by the area-level model (33.6%). The out-of-sample estimates are more biased than the in- sample estimates for both sets of estimates, but once again the household-level model estimates are less biased than the area-level estimates. Finally, Figure E1 shows box plots for relative bias using the household and area-level simulations. These boxplots confirm that the relative bias is very similar for the household and area-level models for in-sample municipalities, each of which noticeably improves on the direct estimates. For out-of-sample municipalities, the household-level model exhibits markedly less relative bias than the area-level results, perhaps because it better incorporates sample weights when estimating model parameters. In-sample municipalities (2016 survey) Out-of-sample municipalities (2016 survey) -1.5 -1 -.5 0 .5 1 1.5 2 2.5 3 3.5 4 4.5 5 5.5 6 0 5 10 15 20 25 Relative bias Relative bias Household-level model Sub-area model Household-level model Sub-area model Area-level model Direct survey estimates Area-level model Note: See notes to Figure 3. The household-level, sub-area, and area-level model estimates are produced by combining the 2016 MCS-ENIGH with satellite indicators. The direct estimates are generated using the 2016 survey alone. Figure E1 - Box plots of relative bias, simulation results 67 Appendix F: Potential Omitted Variable Bias under Area-Level Models Assume that data are generated by the following household-level data generating process: 1 ℎ = 0 + 1 ℎ + ⋯ + ℎ   + + ℎ ; h=1,…, ; a=1,…,A (F1) 1 1 Let = ∑  and = ∑  be the sample means of and ℎ for area a, where ℎ∈ ℎ ℎ∈ ℎ is the set of survey sample households and is the number of sample households in area a. Let ̅ = ∑ℎ=1 xℎ and = ∑ℎ=1 ℎ equal the population means of and y for area a, where is the number of population households in area a . and ℎ are area effects and household error terms assumed to be normally distributed. Taking the average of (F1) across sample households in each area yields the data generating process for sample means: 1 = 0 + 1   + ⋯ +   + + (F2) 2 2 2 2 Where ~(0, ) and ~(0, ). and are assumed to be fixed, and and are assumed to be independent of all predictor variables . An area-level model fits: 1 = 0 + 1   + ⋯ +   + + (F3) Where, as noted above, is the direct estimate for area a from the sample and is the population average taken for example from administrative or geospatial data. This omits variables 1 ̃ = − from the model. Model (F2), the assumed true area-level model, can be written in matrix notation as follows: = β + + , (F4) where 1 0 1 ̅1 1 ̅1 … ̅1 1 1 … = [ ], = [… … … … … ] , = [ ] , = [ ],and = [ … ]. … 1 … ̅ 1 ̅ … ̅ Adding and subtracting β from the right hand side of (F4) yields the equivalent expression of: ̃ β + + , = β + (F5) where 68 1 ̅1 ̅1 − 1 … ̅1 ̅1 − ̃=[ … … … ] 1 ̅ ̅ − 1 … ̅ − ̅ On the other hand, model (F3), the estimated model, can be rewritten as follows: = β + + , (F6) where: β0 ̅1 1 1 ̅1 … β = [… … … … …] and = [ 1 ] ̅ 1 … 1 … β ̂ using weighted least squares: The Fay-Herriot model estimates the model coefficients β ̂ = (′ −1 )−1 ′ −1 , β (F7) where 1 … … = [… … …] … … 2 2 2 2 = + and where is assumed to be constant for all areas and varies across areas. and Substituting (F5) into (F7) yields ̂ = β + (′ −1 )−1 ′ −1 β ̃ β + (′ −1 )−1 ′( + ) (F8) And taking expectations gives: ̂ ] = β + (′ −1 )−1 ′ −1 [β ̃β (F9) Since [′] = [′] = 0. This means that the bias is equal to ̂ ] − β = (′ −1 )−1 ′ −1 ̂ ) = [β (β ̃β (F10) 69 ∑ ∑ ̅11 ̅ … ∑ ̅1 ̅12 ̅ 1 ̅ In matrix format, ′ −1 can be rewritten as 2 ∑ ∑ … ∑ 1 (F11) … … … … ̅ ∑ ̅ 1 ̅ 2 ̅ [ ∑ … ∑ ] 2 Where is the shrinkage factor for area a: 2 2 . + ̃ β , can be expressed as a P X 1 matrix: The second term of the bias, ′ −1 ∑ =1 ∑=1 β ̃ β = 12 ∑=1 ̅ 1 ∑ =1 ̃ β ′ −1 (F12) … [∑ ̅ ̃ =1 ∑=1 β ] Meaning that the bias in total is: −1 ∑ =1 ∑=1 1 ̅1 … ∑ ̅ =1 ∑ =1 ∑=1 β 2 ̂ ) = [ ̅ ̂ − ] = ∑=1 1 ∑ ̅1 ̅ 1 ̅ =1 … ∑=1 ∑ ̅ 1 ̃ =1 ∑=1 β (F13) (β … … … … … [ ∑ ̅ ̅ 1 ̅ =1 ∑=1 … ∑=1 ̅ 2 ] [∑ ̅ ̃ =1 ∑=1 β ] ̂ are a weighted average of the direct estimate and the synthetic The resulting area-level predictions prediction: ̂ = ̂ + ̂ = ̅ ̂ ̅ ̅ + (1 − ) (F14) 2 Because > 0 for all areas a, < 1, and any bias in the estimated coefficients ̂ will also be present in the small area estimate ̂ . It is straightforward to show that the bias is equal to: ̂ ) = [ ( ̂ − ] = (1 − ) ̂) ̅ (β (F15) Which is exactly equal to the bias identified in Corral et al (2021, p.39) for unit-context models. Thus area-level models that use population predictors, when the data generating process is derived from a household-level model, are susceptible to the same omitted variable bias. This model-based bias results from the discrepancy between sample and population means of the predictors, coupled with the assumption of a household-level data generation process, rather than the choice of a household versus area-level model. For both household and area-models, design-model bias, which considers bias prior to drawing the random sample, is zero. 70