Policy Research Working Paper 9620 A Map of the Poor or a Poor Map? Paul Corral Kristen Himelein Kevin McGee Isabel Molina Poverty and Equity Global Practice April 2021 Policy Research Working Paper 9620 Abstract This paper evaluates the performance of different small area that the model assumptions hold. A proper data transfor- estimation methods using model and design-based simu- mation can lead to a considerable improvement in mean lation experiments. Design-based simulation experiments squared errors. The results from design-based validation are carried out using the Mexican Intra Censal survey as a show that all small area estimation methods represent an census of roughly 3.9 million households from which 500 improvement, in terms of mean squared errors, over direct samples are drawn using a two-stage selection procedure estimates. However, methods that model unit level welfare similar to that of Living Standards Measurement Study using only area level information suffer from considerable surveys. Several unit-level methods are considered as well bias. Because the magnitude and direction of the bias are as a method that combines unit and area level information, unknown ex ante, methods that rely only on aggregated which has been proposed as an alternative when the available covariates should be used with caution, but they may be census data is outdated. The findings show the importance an alternative to traditional area level models when these of selecting a proper model and data transformation so are not applicable. This paper is a product of the Poverty and Equity Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at pcorralrodas@worldbank.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team A Map of the Poor or a Poor Map? Paul Corral,∗Kristen Himelein,†Kevin McGee‡and Isabel Molina§¶ Key words: Small area estimation, ELL, Poverty mapping, Poverty map, Empirical best, Para- metric bootstrap, nested error model, twofold nested error model JEL classification: C55, C87, C15 ∗ The World Bank Group - Human Development Group Chief Economist’s Office (pcorralrodas@worldbank.org) † The World Bank Group - Poverty and Equity Global Practice ‡ The World Bank Group - Development Economics Data Group § Universidad Carlos III de Madrid - Department of Statistics (imolina@est-econ.uc3m.es) ¶ The authors acknowledge financial support from the World Bank. We thank Sandra Carolina Segovia for research assistance. We thank J.N.K. Rao for comments and suggestions. We also thank Roy van der Weide, Alexandru Cojocaru, David Newhouse, and Samuel Freije-Rodriguez for comments. Additionally, we thank Carolina Sanchez and Roberta Gatti for providing support and space to pursue this work. Finally, we thank the Global Solutions Group on Data for Policy, as well as all attendants of the Summer University courses on Small Area Estimation. Any error or omission is the authors’ responsibility alone. 1 Introduction The eradication of poverty was the first Millennium Development Goals (MDGs) established by the United Nations in 2000 and continues as Sustainable Development Goals (SDG) 1.1.1, but governments can only properly target poverty if they know where it is. Traditionally, for a given country, the best source for information on the living standards of its population are household surveys. These surveys are a powerful tool towards defining and addressing the needs of people. These surveys, however, usually offer reliable information only at highly aggregated levels of the population. In other words, direct survey estimates tend to be adequate for very large populations but inadequate for smaller populations. Small area estimation (SAE) is a branch of statistics focused on improving the reliability of esti- mates and the associated measures of uncertainty for populations where samples cannot produce sufficiently reliable estimates (Rao and Molina, 2015). Small areas can be any population subgroup and are not necessarily tied to geographical areas. According to Ghosh and Rao (1994), the use of small area statistics increased towards the end of the 20th century due to improved computing power and the advent of theoretically sound statistical methods. The main principle behind small area statistical methods is to use modeling to “borrow strength” from auxiliary data sources (e.g. census or administrative data) to produce more efficient estimators than direct survey data alone. Model-based techniques for small area estimation commonly fall within two groups: i) area based models; for examples see Fay and Herriot (1979) and Torabi and Rao (2014); ii) unit-level models; for examples see Molina and Rao (2010) and Elbers, Lanjouw and Lanjouw (2003). The focus of this paper is to conduct model and design-based evaluation of different unit-level SAE methods. Model-based methods assume a data generating process for the population that follows the model’s assumptions, and the target parameters are assumed to be random (Tzavidis et al., 2018). In this paper, model-based evaluation is done considering two-stage sampling strategies and thus assumes a model that includes variation at the domain level as well as at the cluster level, where clusters are the primary sampling units, nested within domains (see Marhuenda et al., 2017). Design-based simulation entails applying the chosen methods under realistic conditions (Tzavidis et al., 2018). Here, a design-based evaluation of several estimators is performed using the 2015 Mexican Intra Censal Survey, which is used here as a census, and from which we draw repeated samples. The estimation methods considered are those of Elbers, Lanjouw and Lanjouw (2003 - ELL henceforth), the empirical best predictor of Molina and Rao (2010 - MR henceforth),1 the twofold nested error extension of MR (2010) presented by Marhuenda et al. (2017), and finally an adaptation, presented by Nguyen (2012), that combines unit and area level information, and which has been proposed as an alternative when the available census data is too outdated to be considered for use. The paper’s contribution is twofold: i) it presents a thorough evaluation of the considered methods as well as their pros and cons, and ii) it employs the sampling strategies most often encountered 1 In reality, we apply the Census EB variant introduced by Guadarrama, Molina and Rao (2016) 2 in real world applications to conduct design-based validations. The results are expected to provide a guideline to others undertaking small area estimation. The results from design-based validation show that all small area estimation methods represent an improvement, in terms of MSE, over direct estimates. However, given the inherent high level of bias of methods relying solely on aggregate information (see Nguyen, 2012; Lange et al., 2018; Masaki et al., 2020) these should be used with caution and only under certain circumstances. The paper first presents an overview of the small area estimation methods considered, directing readers who are interested in a more in-depth description to the appropriate sources. Then the model-based simulation design is detailed and results of the simulation experiments are presented. This section is followed by a description of the 2015 Mexican Intra Censal Survey and how it is adapted to represent a census and how samples are drawn from the created census. Results from the design-based simulation experiments are then presented. Finally, conclusions and lessons learned are provided. 2 Unit-level models The nested error model used for small area estimation by ELL (2003) and Molina and Rao (2010) was originally proposed by Battese, Harter and Fuller (1988) to produce county-level corn and soybean crop area estimates for the U.S. state of Iowa. For the estimation of poverty and welfare, the ELL and MR methods assume the transformed welfare ych for each household h within each location c in the population is linearly related to a 1 × K vector of characteristics (or correlates) xch for that household, according to the nested error model: ych = xch β + ηc + ech , h = 1, . . . , Nc , c = 1, . . . , C, (1) where ηc and ech are respectively location and household-specific idiosyncratic errors, assumed to be independent from each other, following: iid 2 iid 2 ηc ∼ N 0, ση , ech ∼ N 0, σe 2 and σ 2 are unknown. Here, C is the number of locations in which the where the variances ση e population is divided and Nc is the number of households in location c, for c = 1, . . . , C . Finally, β is the K × 1 vector of coefficients. Under the original ELL methodology, the locations indexed with c are supposed to be the clusters, or primary sampling units (PSUs) of the sampling design and do not necessarily correspond to the level at which the estimates will be ultimately produced. In fact, clusters are typically nested within the areas of interest (e.g. census enumeration areas within large administrative areas). Presenting estimates at a higher aggregation level than the clusters (for which random effects are included in the model) may not be appropriate in cases of considerable between-area variability, and may underestimate the estimator’s standard errors (Das 3 and Chambers, 2017). The recommended approach to mitigate this issue is to include covariates that sufficiently explain the between-area heterogeneity in the model (ibid ). In this regard, ELL (2002) suggests the inclusion of cluster-level covariates as a way to explain location effects. Never- theless, this approach is context specific and may not always suffice to ameliorate the issues with between-area heterogeneity. In this regard, Marhuenda et al. (2017) recommend and show that location effects should be at the same aggregation level at which estimation is desired. If the location effect is specified at the same level where estimation is desired, then the difference between ELL (2003) and MR (2010) reduces to differences in how estimates are obtained and the addition of Empirical Best (EB) prediction by MR (2010). The EB method from MR (2010) conditions on the survey sample data and thus makes more efficient use of the information at hand, whereas ELL does not include this component. In essence, under ELL, for any given area present in the sample the ELL estimator of the census area mean y ¯c is obtained by averaging across M ∗(m) ¯ ∗(m) ∗(m) simulated censuses and is given by y ¯c ≈ X β + ηc c ¯c , m = 1, . . . , M , where E [η ∗ ] = 0 +e c M ∗(m) and E [e∗ ch ] = 0. Thus, the ELL estimator M 1 ¯c , which approximates E (¯ m=1 y yc ), reduces ¯ c β (MR, 2010). On the other hand, under MR (2010), to the regression synthetic estimator, X conditioning on the survey sample ensures the estimator includes the random location effect, since ¯ c β + ηc . Mechanically, however, conditioning on survey data requires the linking of yc |ηc ] ≈ X E [¯ areas across surveys and census, something that is not always straightforward.2 Other differences between ELL (2003) and MR (2010) are the computational algorithms used to obtain point and noise estimates; see Corral, Molina and Nguyen (2020 - CMN henceforth) for further discussion. ELL’s (2003) approach to obtain estimates builds upon the multiple imputation (MI) literature in that it uses a single algorithm that produces point and noise estimates by varying model parameters across simulations (see Tarozzi and Deaton, 2009 as well as CMN, 2020). The use of MI methods for obtaining point and noise estimates has shortcomings, however. Under multiple imputation, the method that yields the lowest MSE does not necessarily yield valid statistical inference (Van Buuren, 2018), which is contradictory to the goal of small area estimation in improving precision. Accordingly, CMN (2020) show using simulated populations that the noise estimate (referred to as variance by ELL) is not an appropriate estimate of the MSE. In contrast, MR’s (2010) point estimates are obtained through two separate procedures. Point estimates are obtained by a Monte Carlo simulation and noise estimates are obtained by a parametric bootstrap procedure originally proposed by González-Manteiga et al. (2008).3 In light of the emerging literature, including MR (2010), the World Bank expanded the original ELL design by adding EB predictors and revising its fitting methodology to incorporate heteroskedastic- ity and survey weights to account for complex sampling designs (Van der Weide, 2014). The survey weights are incorporated in the estimates of the regression coefficients and in the variance of the components following Huang and Hidiroglou (2003) based on Henderson method III (Henderson, 2 Under ELL, when including area level covariates, linking the survey and the census areas is also required. Note that it is not necessarily the case that enumeration areas for a census and survey will match. 3 For more details, see Corral, Molina and Nguyen (2020). 4 1953) as well as in the predicted area effects as in the pseudo empirical best linear unbiased predictor (EBLUP) proposed by You and Rao (2002). In the absence of survey weights and heteroskedastic- ity, the fitting method yields parameter estimates that are quite similar to the restricted maximum likelihood (REML) fitting method used by MR (2010). The difference remains, however, in how point and noise estimates of the target indicators are obtained. As further detailed in CMN (2020), however, even following these revisions, the bootstrap proce- dure, as implemented by PovMap software (Zhao, 2006) for EB and later detailed in Nguyen et al. (2018) differed from the original EB procedure from MR (2010). CMN (2020) note the remaining issues rested in the continued reliance on MI methods as the basis of the methodology. Because 2 under the updated fitting methodology, Van der Weide (2014) does not offer an estimate of var ση bootstrap samples of the data must be taken for each simulation to obtain the fitting parameters ˆ∗ , σ β ˆ 2∗ .4 This approach was taken to allow for an algorithm similar to the procedure used in ˆ 2∗ , σ η e the original implementation of ELL. However, if clusters were equal to areas, then it is unlikely that all areas are included in the sample and thus an area could benefit from EB only in a subset of the simulations, introducing bias into the resulting point and noise estimates. CMN (2020) also present evidence on the fact that, even if the location effect was modeled at the domain level and bootstrap samples of clusters are taken, some bias would still likely remain in the resulting estimates of the original implementation of EB in PovMap and the sae Stata package. The Stata package for small area estimation developed by Nguyen et al. (2018) has been updated to integrate the fitting methods from Van der Weide (2014) with the prediction and bootstrap methods from MR (2010). The new method is referred to as the H3-CensusEB (Molina, 2019), and the previous method (i.e. original bootstrap using EB from Van der Weide (2014)) is called H3-CBEB.5 CMN (2020) performs a model-based validation of the different methods: i) CensusEB, ii) EB, iii) H3-CBEB and iv) ELL. CMN (2020) extend the simulations done by MR (2010) by i) including the area means of the covariates as additional variables in the model; ii) considering a model that has considerably higher explanatory power by adding more covariates; iii) considering larger population sizes and smaller sampling fractions; and iv) generating errors from a Student’s t5 instead of a normal distribution. However, in all these simulations, population data are generated under model (1). Currently, the software available for small area estimation of non-linear parameters, such as the sae Stata package by Nguyen et al. (2018) as well as the R package sae by Molina and Marhuenda (2015), only allow for estimation under the nested error model specified in (1). However, since household surveys often use two stage sampling, it seems appropriate to consider a twofold nested error model. Marhuenda et al. (2017) extend the EB method from MR (2010) to a twofold nested error model, given by:6 4 2 ELL (2002) provides the derivation of the estimate for var ση in their appendix. 5 The method is called CensusEB because it does not link survey and census households. The old method is called the clustered bootstrap EB (CBEB). For the updated Stata package, see https://github.com/pcorralrodas/SAE- Stata-Package. 6 For simplicity, we omit the heteroskedasticity weights considered by Marhuenda et al. (2017). 5 yach = xach β + ηa + ηac + each , h = 1, . . . , Nac , c = 1, . . . , Ca , a = 1, ..., A, (2) where ηa is the random effect for area a and ηac is the random effect of cluster c within area a. These effects along with the individual model errors, each , represent the unexplained variation of the transformed welfare, yach , across areas, clusters, and households (Marhuenda et al., 2017).7 All three components are assumed to be mutually independent, following:8 iid 2 iid 2 2 iid ηa ∼ N 0, σa , ηac ∼ N 0, σac , each ∼ N 0, σe . Using the assumed twofold nested error model, Marhuenda et al. (2017) study the effect of a misspecified model (i.e. when assuming a cluster effect and presenting results at the area level) on the MSE estimator. The argument posited by Das and Chambers (2017) is that auxiliary variables should explain the between-area variation of the response variable. If this fails, there may be model misspecification, which can lead to an underestimation of the true MSE of the ELL estimator (Marhuenda et al., 2017). Through model-based simulation experiments under the assumed model (2), Marhuenda et al. (2017) reach three important conclusions under the model-based setup: 2 and σ 2 are of key importance; the larger the value of σ 2 relative to 1. The relative values of σac a a 2 , the more problematic it is to apply models where effects are specified at the cluster level, σac including the original ELL and EB with locations specified at the cluster level. Additionally, EB with location effects specified at the cluster level, while performing better than ELL,9 will 2 is relative to σ 2 . also perform worse the larger the value of σa ac 2. Even if the true model contains random effects only at a single level, the assumption of a twofold model practically does not entail loss in efficiency. 3. EB estimates under model (1) with random effects specified at the area level will have similar performance to the twofold EB estimates. The recommendation is that if the estimation is done using model (1) because of software availability or simplicity, then the model’s random effects should be specified at the level at which results are desired. As noted by Marhuenda et al. (2017), small area estimators based on unit level models often achieve very large reductions in MSE compared to direct estimators. EB estimators based on unit level models are also likely to achieve considerable gains in terms of MSE over Fay-Herriot (FH) area level models.10 For example, Molina and Morales (2009) obtained mild gains over direct poverty 7 Marhuenda et al. (2017) present these effects as domains and sub-domains. 8 For full derivation of EB predictors under the twofold model, see Marhuenda et al. (2017). 9 The performance is better than ELL because it includes the average of the cluster effects as opposed to ELL which only includes the linear fit. 10 Fay-Herriot models were introduced by Fay and Herriot (1979) and are a popular area-level small area estimation model. 6 and poverty gap estimates when using a FH model. Molina and Rao (2010) though, using the same data sources as Molina and Morales (2009) but applying unit level models and EB prediction, obtain considerably larger gains. However, when the available census and survey are not from the same year, small area estimates based on unit level models may result in biased estimates. In such scenarios, FH models offer an alternative because these do not require census microdata. Alternatively, twofold models such as that of Torabi and Rao (2014) could also be considered to achieve larger gains as noted by Molina (2019). Another potential solution is to use only aggregated covariates in the model for the household level welfare. This alternative is presented by Nguyen (2012) in an application for Vietnam. The author proposes a model where the dependent variable is household level logarithm of per capita expenditure from a recent survey, in this case the Vietnam Household Living Standard Survey from 2006, whereas all covariates are commune level means. These means are obtained from a dated (1999) census, although the author notes geographic information system data (GIS) could also be included into the set of covariates. Nguyen (2012) obtains ELL estimates for small areas under that model and compares the performance with that of typical ELL estimates obtained using unit level covariates from the Vietnam Household Living Standard Survey from 2006 and the 2006 Rural Agriculture and Fishery Census. The author finds provinces and districts hovering around the middle of the distribution suffer from considerable re-rankings across methods, but those at the top and the bottom are relatively stable. Lange et al. (2018) present an approach similar to Nguyen’s (2012) which the authors suggest as an alternative in cases when census and survey data are not from similar periods, though the same issues noted above for the ELL method would likely persist in a model using only area- level covariates. Masaki et al. (2020) use a similar modeling approach to Nguyen’s (2012), but take measures to address some of the shortcomings of a standard ELL approach and obtain EB estimators of Molina and Rao (2010) which appear to be more precise. The authors conduct a design-based validation study using a wealth index constructed with principal component analysis using census data for Sri Lanka and Tanzania. Their results show the approach may hold promise. In the sections that follow, the different procedures discussed here are tested under model-based and design-based simulation experiments. 3 Model-based simulation experiments The simulation experiment described here is based on those conducted by Marhuenda et al. (2017) in which the true data generating process is a twofold nested error model. Such a model will better accommodate the usual applications of poverty mapping, where household surveys use two-stage sampling. In fact, the traditional ELL method includes random effects only at the cluster level but estimates are given for a higher level. 7 In this simulation experiment a census data set of N = 20, 000 observations is created, where observations are allocated among 40 areas (a = 1, . . . , A). Within each area, observations are uniformly spread over 10 clusters (c = 1, . . . , Ca ). Each cluster, c, consists of Nac = 50 observations, and each cluster is labeled from 1 to 10. The assumed model contains both cluster and area effects. iid iid Cluster effects are simulated as ηac ∼ N 0, 0.12 , area effects as ηa ∼ N 0, 0.052 and household iid specific residuals as each ∼ N 0, 0.52 , where h = 1, . . . , Nac ; c = 1, . . . , Ca ; a = 1, . . . , A. Covariates are simulated as follows:11 1. x1 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at a c the household level, is less than or equal to 0.3 + 0.5 40 + 0.2 10 . 2. x2 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at the household level, is less than or equal to 0.2. 3. x3 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at a the household level, is less than or equal to 0.1 + 0.2 40 . 4. x4 is a binary variable, taking value 1 when a random uniform number between 0 and 1, at a c the household level, is less than or equal to 0.5 + 0.3 40 + 0.1 10 5. x5 is a discrete variable, simulated as the rounded integer value of the maximum between 1 a and a random Poisson variable with mean λ = 3 1 − 0.1 80 . 6. x6 is a binary variable, taking value 1 when a random uniform value between 0 and 1 is less than or equal to 0.4. Note that the values of x6 are not related to the area’s label. 7. x7 is a binary variable, taking value 1 when a random uniform number between 0 and 1 is a c greater than or equal to 0.2 + 0.4 40 + 0.1 10 The welfare vector for each cluster within an area is created from the model with these covariates, as follows: yach = 3+0.09x1ach −0.04x2ach −0.09x3ach +0.4x4ach −0.25x5ach +0.1x6ach +0.33x7ach +ηa +ηac +each , (3) The dependent variable, yach , is in the log scale. The poverty line in this scenario is fixed at z = 12. From the created “census,” 20% of the observations are sampled in each of the clusters using simple random sampling without replacement;12 this yields our “survey” data. The generation and sampling processes are repeated L = 10, 000 times. In each simulation replicate, the following quantities are computed for the poverty rates and gaps in each area: 11 Covariates are simulated following the approach from Molina and Rao (2010) and Marhuenda et al. (2017), with slight modifications. 12 The same units are sampled in every simulation and the values of x1 to x7 for all the census units are also kept fixed; this implies that the values of these covariates for the sample units are always the same across simulations. 8 1. True poverty indicators τa , using the “census”. 2. Direct estimators τ ˆaDIR using the “survey”, defined as the sample versions of τ . c 3. Census EB estimators τ ˆaCEBac presented in Marhuenda et al. (2017) based on a twofold nested error regression model, and obtained using a Monte Carlo approximation with M = 50 replicates. Note that, for this estimator, the fitted model agrees with the true data generating process (3). 4. Census EB estimators τ ˆaCEBa presented in CMN (2020) based on a nested error model with only area random effects obtained using a Monte Carlo approximation with M = 50 repli- cates. 5. Census EB estimators τ ˆaCEBc presented in CMN (2020) based on a nested error model with only cluster random effects and including the aggregate cluster and area means of the con- sidered auxiliary variables, where M = 50. 6. Traditional ELL estimators τ ˆaELLc , based on a nested error model with only cluster ran- dom effects and including the aggregate cluster and area means of the considered auxiliary variables, where M = 50. 7. Unit-context Census EB estimators τ ˆaU C −CEBa based on a nested error model with random effects at the area level. This estimator follows the approach from Masaaki et al. (2020) that is a modified version of that of Nguyen (2012), which uses only area means for some of the right hand side variables.13 Nguyen (2012) proposes this solution for the case when only a dated census and a recent survey are available. 8. Unit-context two-fold nested error Census EB estimators τ ˆaU C −CEBac based on a two- fold nested error model with random location effects at the area and cluster level. This estimator follows the approach from Masaaki et al. (2020) and Nguyen (2012), where only area means for some of the right hand side variables are used.14 Model bias and MSE are approximated empirically as in MR (2010), as the averages across the L = j (l) (l ) 10, 000 simulations of the prediction errors in each simulation (l), τ ˆa − τa and of the squared pre- diction errors respectively, where j stands for one of the methods: DIR, CEBac , CEBa , CEBc , ELLc , U C − CEBa , U C − CEBac .15 Model bias and root MSE for a given area’s estimate are computed at the area level as follows: L j 1 j (l ) (l ) Bias τ ˆa = τa (ˆ − τa ) L l=1 13 ¯1ac , x The covariates used in this model are: x ¯4ac , x ¯ 3a , x ¯5a , and x ¯7ac . 14 ¯1ac , x The covariates used in this model are: x ¯4ac , x ¯ 3a , x ¯5a , and x ¯7ac . 15 j j 2 E τˆc − τc for the bias and E τˆc − τc for the MSE, where E (.)denotes expectation under model 2. 9 L j 1 j (l) (l) RM SE τ ˆa = τa (ˆ − τa )2 L l=1 3.1 Results The section presents the results from the model-based simulation experiments where the goal is to compare the performance of the different methods. Marhuenda et al. (2017) consider multiple 2 and σ 2 . The authors note what matters are scenarios where they simulate different values for σa ac the relative values. In this instance the interest is to assess how results differ when the random cluster effect is considerably smaller than the random area effect and when the random cluster effect is considerably larger than the random area effect. ELL would commonly specify its random location effect at the cluster level and then aggregate results to the area level. Consequently, we expect ELL to perform better when the random cluster effect is larger than the area random effect. We consider two scenarios: iid iid 1. ηac ∼ N 0, 0.12 and ηa ∼ N 0, 0.052 iid iid 2. ηac ∼ N 0, 0.052 and ηa ∼ N 0, 0.12 Simulation results under the two considered scenarios are presented respectively in Figures 1 and 3 for bias, and Figures 2 and 4 for MSE. The results for bias and MSE are presented at the area level. With the exception of a few areas in which the ELL and unit-context methods demonstrate a slightly higher bias, all examined methods perform better than the direct estimators, in most cases by a substantial margin. For MSE, the result varies between the two simulation scenarios. Under scenario 1 with results shown in Figure 2, where the variance of cluster effects is double that of the area effects, the methods considered, including ELL and the unit-context methods, perform relatively well and in nearly all cases now do as well as or better than direct estimates, though again ELL and the unit-context methods perform worse than the other options.16 However, in Figure 4, where the variance of the area effects is now much larger, ELL in particular performs poorly in terms of MSE likely due to the error misspecification and the contextual variables not explaining sufficiently the variability of the area effects. To a lesser extent a similar effect can be observed for the CensusEB estimator, based on a model with only cluster effects and contextual variables (CEBc with context), which performs well in terms of MSE under scenario 1 but under-performs in scenario 2. The twofold model results are aligned to the results presented by Marhuenda et al. (2017); the bias and MSE of estimates obtained under twofold fitting and onefold CensusEB fitting at the area level 16 Despite the result, other issues from the implementation of ELL (2003) noted by CMN (2020) like the underes- timation of the MSE still remain, unless the method to estimate MSE is adjusted. 10 Figure 1: Empirical model bias of FGT0 estimators for σac = 0.1 and σa = 0.05 are largely indistinguishable. This result is interesting in that it resonates with the findings from Marhuenda et al. (2017); in the absence of a software solution for a twofold nested error regression, it is preferable to specify the random effects at the level at which results are desired. This will ensure that MSEs are minimized despite mistaken model assumptions. Surprisingly, the two unit-context models used to obtain CensusEB estimators, one with random effects only at the area level and another with random effects at the cluster and area levels, show more bias than ELL within any given area.17 The results shown here are not evident under the model based simulation conducted in Masaki et al. (2020)18 because under the simulation presented here, true welfare is generated from household level covariates as is likely the case in real world scenarios. In Masaki et al. (2020) the authors chose to model the dependent variable using only 2 subdomain level covariates which are constant for all households in the subdomain. The bias observed in the simulations conducted here for unit-context models is in part due to omitted variable bias.19 If the true model includes x5ach then when we only consider x ¯5ac we are omitting z5ach = x5ach − x ¯5ac from the model. Note that x ¯5ac is obtained in the “census” and thus ¯5ac will be correlated to z5ach in the survey due to it being a sample and not the entire population, x we also know that zach is correlated to the dependent variable thus leading to the omitted variable 17 The covariates used in this model are x ¯ 3a , x ¯1ac , x ¯5a , and x ¯4ac , x ¯7ac . In other simulations run, not shown, all the covariates’ aggregates at the cluster level are used and similar results are obtained. 18 See page 36 of Masaki et al. (2020). 19 The unit-context models also display an upward bias in simulations where the population (20,000) is used to fit the model and to obtain the FGT0 predictors. 11 Figure 2: Empirical model MSE of FGT0 estimators; σac = 0.1 and σa = 0.05 Figure 3: Empirical model bias of FGT0 estimators for σac = 0.05 and σa = 0.1 12 Figure 4: Empirical model MSE of FGT0 estimators; σac = 0.05 and σa = 0.1 Figure 5: Omitted variable bias under unit-context models 13 Table 1: Aggregate results across areas (FGT0) 1 2 3 4 Model R2 ~0.42 ~0.42 ~0.42 ~0.42 σa 0.05 0.1 0.05 0.1 σac 0.1 0.05 0.1 0.05 σe 0.5 0.5 0.5 0.5 Sample per cluster 10 per all clusters 10 per all clusters 10 per 20% clusters 10 per 20% clusters Population 20,000 20,000 100,000 100,000 Direct AAB (×100) 1.780 1.783 1.802 1.795 ARMSE (×100) 4.039 4.041 4.396 4.563 CEBac AAB (×100) 0.020 0.023 0.018 0.023 ARMSE (×100) 2.630 2.859 2.493 2.191 CEBa AAB (×100) 0.020 0.024 0.021 0.017 ARMSE (×100) 2.642 2.876 2.515 2.237 CEBc AAB (×100) 0.024 0.029 0.045 0.020 ARMSE (×100) 2.793 3.885 4.731 2.614 ELLc AAB (×100) 0.238 0.236 0.144 0.139 ARMSE (×100) 3.433 5.269 5.048 2.768 U C − CEBa AAB (×100) 1.115 1.362 1.296 0.986 ARMSE (×100) 3.074 3.404 3.068 2.584 U C − CEBac AAB (×100) 1.106 1.370 1.272 0.929 ARMSE (×100) 3.059 3.383 3.046 2.591 AAB: Average absolute bias; ARMSE: Average root MSE 14 bias.20 This misspecification is manifested as failure of the linearity assumption and the magnitude of this bias depends on the form of the covariates (see Figure 5 and appendix). Additionally, it will vary for each type of indicator. For example, when estimating mean income in a model without log transformation, linearity is preserved when aggregating and thus the approach is expected to perform well.21 However, the empirical MSEs for the unit-context models in most areas outperform those from ELL (see Table 1 for average results). In fact, under scenario 1 (Figure 2) in some areas the MSE for the unit-context models are only slightly worse than those of the CEBa and CEBac . Under scenario 2, where the area effects have a higher variance relative to that of the cluster effect, which is likely the case in real world scenarios, the empirical MSE for the unit-context models show considerable variability for both fitting models. The results suggest the method could outperform direct estimates as well as misspecified models with random effects for the area level only, in terms of MSE, and may be an alternative under scenarios where census and survey data are not aligned. However, bias in these unit-context models may be a considerable concern. As one can see in the results from Table 1 the unit context method yields FGT0 estimators with an average absolute bias that is 56 times larger than the bias of the CensusEB estimators, and almost 5 times larger than ELL’s average absolute bias.22 In the sections that follow this is further explored. 4 Design-based simulation In this section we present a design-based simulation experiment based on the Mexican Intra Censal Survey of 2015 (Encuesta Intracensal). The purpose of the simulation is to observe performance of the different methods under more realistic scenarios. The survey is carried out by the Mexican National Institute of Statistics and Geography (Instituto Nacional de Estadistica y Geografia - INEGI). The survey has a sample of 5.9 million households and is representative at the national, state (32 states) and municipal or delegation level (2,457 municipalities), as well as for localities with a population of 50,000 or more inhabitants. The 2015 Intra Censal Survey questionnaire includes the following topics related to the housing unit: dwelling features, size and use of the dwelling, conditions for cooking, ownership and access conditions, access to water, sanitation facilities and sanitation, electric power, solid waste, equip- ment, appliances and automobile; and information and communication technologies (ICT). It also includes the following demographic information: total population and structure, birth registration, marital status, health services, ethnicity, education, economic characteristics, nonpaid work, migra- tion, daily mobility, fertility and mortality, household composition, non-labor household income, 20 ¯5ac and other omitted deviations, zach . The correlation is also present between x 21 The empirical average relative root MSE of the mean for the UC approach is slightly higher than that of the CensusEB. 22 The unit context method also yields mean welfare estimators that are 38 times more biased than Census EB variants (not shown). 15 food security, agricultural land use, relationship to the household head, indigenous language, oc- cupation, economic activity, and accumulated education. The survey also includes indicators for states, municipalities, and counties. One of the key features of the survey is its size and the fact that it includes a measure of income at the household level.23 The inclusion of an income measure allows for a design-based validation of the different methods presented above. The next section describes how the survey is modified to create a census data set and how samples are then drawn from the created census data with the goal of obtaining small area estimates of poverty at the municipal level. 4.1 Creating a census and survey Because the goal of this exercise is to test how well the different methods perform under a real world scenario, the Intra Censal Survey is modified to mimic a Census in order to allow for a design-based simulation. The first step consists of randomly removing 90 percent of households that reported an income of 0. This is done to ensure that some households with an income of 0 are included in the population, but not as many as in the original data to make it more realistic.24 In the second step, all municipalities with less than 500 households are removed, thus observations by municipality range from 501 to 23,513.25 The final “Census” consists of 3.9 million households and 1,865 municipalities. To draw survey samples, primary sampling units (PSU) need to be created.26 Within each mu- nicipality, the original data’s PSUs are sorted27 and joined until each created PSU has close to 300 households. Under the proposed approach, original PSUs are never split, just joined to others. Additionally, all original PSUs that were larger than 300 households are designated as a created PSU. The entire process yields 16,297 PSUs. The resulting Census data is used to draw 500 survey samples to conduct a design-based simu- lation to establish how a method will perform over repeated sampling from a finite population (Tzavidis et al., 2018). The sampling approach reflects standard designs used in most face-to- face household surveys conducted in the developing world, such as those conducted by the Living Standards Measurement Study (LSMS) program of the World Bank (Grosh and Muñoz, 1996), with some simplification for convenience. First, the 32 states that comprise Mexico are treated as the intended domains of the sample. The main indicator of interest for the survey is the welfare measure: household per capita income. The desired level of precision to be achieved in the sam- ple is assumed to be a relative standard error (RSE ) of 7.5 percent for mean per capita income 23 Income is defined as money received from work performed during the course of the reference week by individuals of age 12 or older within the household. 24 Welfare values of 0 and/or missing are a common feature of household surveys. 25 The median municipality has 1,613 households. 26 The Intra Censal Survey has its own PSUs, however many of these have too few observations to properly work as PSUs in the created “Census” data. 27 This assumes that the clusters’ numbering is tied to how proximate each cluster is to one and other. 16 in each state.28 A two-stage clustered design is assumed with clusters (defined above) serving as primary sampling units (PSUs) selected in the first stage within each domain and then a sample of 10 households within each cluster is selected in the second stage. With these design features established, the trimmed census data was analyzed to identify the parameter estimates for per capita income (mean and standard deviation) for each state and the target standard error implied by the parameter estimates that corresponds to an RSE of 7.5 percent. The minimum sample size required for state s, given these parameters, under simple random sampling (SRS) design is then obtained as: 2 2 σs σs nSRS s = = ¯s × RSE tgt y ¯s × 0.075 y where σs is the standard deviation of per capita income in state s, RSE tgt is the target standard ¯c is the mean per capita income in state s. The minimum error of 7.5 percent in state s, and y sample size under SRS must then be adjusted to account for the clustered design. This design effect due to clustering is accounted for by estimating the intra-cluster correlation (ICC ) for per capita income within each state using the trimmed census data. The ICC estimates can then be applied to the SRS size obtained above to arrive at the minimum sample size for state s under the clustered design employed here, given by: ns = nSRS s × DEF Fs = nSRS s × (1 + (npsu − 1) ρs ) where DEF Fs is the design effect in state s, npsu is the number of households selected within each cluster (10 in this case), and ρs is the ICC for per capita income in state s. The minimum number of clusters to achieve ns (assuming 10 households per cluster) was calculated and then the final (household) sample size established by multiplying the number of clusters by 10.29 Taking the sample size requirements from above as fixed, the sample in each simulation is selected in accordance with the two-stage design. PSUs within each state, referred to here as clusters, are selected with probability proportional to size (P P S ) without replacement, where the measure of size is the number of households within the cluster. Then 10 households were selected within each cluster via simple random sampling. According to this design, the inclusion probability for household h in cluster c and in state s is approximated30 as: Nsc Cs nsc psch = × Ns Nsc 28 The desired precision of an RSE of 7.5 percent is somewhat arbitrary but corresponds to precision targets in similar surveys and yields a sample of reasonable size. 29 Full estimates and results for the sample size determination are available upon request. 30 The equation used to calculate inclusion probabilities assumes sampling with replacement but is used here as an approximation of inclusion probabilities under PPS selection without replacement. This should provide a reasonable approximation in this case since there are a relatively large number of PSUs present in the frame. 17 where Ns is the total number of households in the census for state s, Nsc is the number of households in cluster c from state s, Cs is the number of clusters selected in state s, and nsc is the number of households selected in cluster c within state s, which is fixed at 10.31 The sample size across the 500 samples is roughly 23,540. Under the proposed sampling scenario, not all municipalities are included, and the number of municipalities included varies from sample to sample, ranging between 951 to 1,020 municipalities. The median municipality included in a given sample, is represented by a sole PSU and thus its sample size is of 10 households. 4.2 Model selection Model selection is conducted using the first sample drawn from the scenario detailed in the previous section. The target variable is household per capita income. However, this variable is highly skewed and to achieve an approximately normal distribution we test three transformations: i) natural logarithm,32 ii) log-shift transformation, and iii) Box-Cox transformation of the natural logarithm.33 As one can see in Figures 6, 7 and 8, the Box-Cox transformation as well as the log shift fix the skewness in the distribution of model residuals that appears after taking the natural logarithm of per capita income.34 The goal of the model selection process is to arrive at a model that only includes stable covariates. Under each transformation, model selection is done using a least absolute shrinkage and selection operator, commonly known as lasso,35 where the candidates for covariates include household char- acteristics and characteristics at the PSU, municipal and state level. Two models are selected: i) a model that includes household characteristics and characteristics at the PSU, municipal, and state levels and ii) another model that only includes characteristics at the PSU, municipal and state levels. The second model is used for the unit-context approach. All household level characteristics that are included also as aggregates at the PSU, municipality, and state levels have been previously standardized to ensure that these have mean 0 and standard deviation 1 for each PSU, municipal- ity and state respectively. Note that aggregated covariates have been obtained from the “census,” thus, the aggregated covariates will be the same within the sample and the census. The lasso model selection process applied here ignores the error structure presented in (1) and (2) and does not ensure that selected covariates will be significant under the assumed models. Consequently, after the initial lasso model selection, model (1) is fit using Henderson’s Method III (with sampling weights) and random effects specified at the municipality level. Because the resulting model may 31 The design weight for each household is simply the inverse of the inclusion probability. In a typical survey, the design weights would be further adjusted for nonresponse and calibrated to known population characteristics. However, since the sampling is only a simulation exercise, there is no nonresponse and thus no nonresponse adjustment is required. Calibration or post-stratification could be performed but was not implemented to simplify the process. 32 In any given sample, roughly 11 observations have an income of 0, these are assigned an income of 1 prior to transformation. 33 For more details on transformations, see Tzavidis et al. (2018). 34 Figures make use of sample from a two-stage clustered design. 35 Model is selected using 20 fold cross validation and shrinkage parameter (λ) that is within 1 standard error of the one that minimizes the cross validated MSE. 18 Figure 6: Histogram of residuals from unit level onefold nested error model fitted to Nat. log. of per capita income (municipal random effects) Figure 7: Histogram of residuals from unit level onefold nested error model fitted to log-shift transformation of per capita income (municipal random effects) 19 Figure 8: Histogram of residuals from unit level onefold nested error model fitted to Box-Cox of Nat. log. of per capita income (municipal random effects) Figure 9: Normal Q-Q-plot of unit level onefold nested error model - household level and predicted subdomain effects (Nat. log transformation) 20 Figure 10: Normal Q-Q-plot of unit level onefold nested error model - household level and predicted subdomain effects (log-shift transformation) Figure 11: Normal Q-Q-plot of unit level onefold nested error model - household level and predicted subdomain effects (Box-Cox of nat. log transformation) 21 still include covariates that are not significant under the fitted model, all non-significant (at the 1% level) covariates are removed sequentially.36 Finally, we check for multicollinearity and remove covariates with a variance inflation factor (VIF) above 3. Fitted models using the first sample and the described model selection are presented in Tables 3 and 4 in the appendix. Since in small area estimation models are used for prediction, assessing the predictive power of the model is important. This may be measured using the coefficient of determination, R2 (Tzavidis et al., 2018). For the household level model used in this exercise the R2 is close to 0.45 while that of the unit-context model centers around 0.25. Figures 9, 10, and 11 represent checks on the normality assumptions using normal Q-Q -plots of household level residuals and estimated area effects for the onefold unit level model with random municipality location effects. The resulting plots for the unit-context models can be seen in the appendix figures 26, 27, and 28. The natural logarithm transformation (Fig. 9) presents evidence of deviation from normality. Marhuenda et al. (2017) notes in applications that use real data the exact fit to a distribution is barely met; however the Box-Cox and log-shift transformations provide considerably better normal approximations (see Figures 10 and 11).37 4.3 Results Once the models to be used have been selected and L = 500 samples under two-stage sampling described in section 4.1 have been taken from the “census” created using the Mexican Intra Censal Survey of 2015 we obtain estimates using the different considered models. The target parameters for this simulation are mean welfare and the Foster et al. (1984) - FGT class of decomposable poverty measures for municipalities present in the census. The true values of the target parameters at the municipal level are based on the census data. As a first step, we compare results using the three transformations discussed in the previous section. Figures 12 and 13 show box plots of empirical absolute relative bias and MSE of CensusEB esti- mates of poverty rates based on model (1) under each transformation. The Box-Cox and log-shift transformation yield estimates that are not only less biased, but also results with a lower empirical design MSE. As can be seen in the results from the first 3 columns in Table 2 the log-shift transfor- mation yields aggregate results somewhat preferable to the Box-Cox. Additionally, in Table 2 it is quite clear that the natural logarithm may actually result in aggregate results that present consid- erably smaller gains over direct ones and thus model and residual checks should always be done to ensure adequate transformations are used to avoid such an outcome. The rest of the discussion will focus on estimates obtained from the log-shift transformation applied to the two-stage samples.38 36 The removal is done sequentially starting with the least significant covariate, then the model is fit again without the covariate. The process is repeated until all covariates in the model are significant at the 1% level. 37 Box-Cox and log-shift transformations are available under R’s sae package as well as under Stata’s sae package. In the World Bank’s PovMap software, the only available transformation was natural logarithm. However, PovMap does allow for drawing residuals from the empirical distribution as well as from a Student’s t distribution. 38 Figures under other transformations are available upon request. 22 Figure 12: Box plots of Census EB FGT0 design bias for unit level onefold nested error model estimates for fitted model 1 and two-stage samples (Municipal level) Figure 13: Box plots of empirical MSE of unit level onefold nested error model CensusEB FGT0 estimates for fitted model 1 and two-stage samples (Municipal level) 23 We consider estimators based on onefold and twofold nested error models, including unit-context (UC) models with only aggregated covariates, and direct estimators. Concretely, we consider the following:39 Direct Estimates 1. Direct estimates from the survey samples for each municipality. These are calculated with weights from the considered design. Specifically, the calculation of FGT indicators under the two-stage design detailed in subsection 4.1 using the inclusion probabilities is obtained as: direct = pach τ ˆa h F ; pach ach where, pach is the inclusion probability for household h in cluster c and in municipality a, and Fach is the FGT or welfare measure of interest for household h in cluster c and in municipality a.40 Unit-level models: 1. CEBa : Model fit is done using Henderson’s Method III (with sampling weights) and estimates are obtained using CensusEB as noted in CMN (2020). The fitted model reads: yach = xach β + zac α + ta ω + gs λ + ηa + εach ; where xach is a vector of household specific characteristics, zac contains cluster level char- acteristics, ta includes municipality level characteristics and gs is composed of state level characteristics. The random effects, ηa , are specified at the municipal level. 2. CEBc : Model fit is done using Henderson’s Method III (with sampling weights) and estimates are obtained using Census EB as noted in CMN (2020). The difference with model (1) is that random effects are specified at the PSU level. That is, the fitted model is: yach = xach β + zac α + ta ω + gs λ + ηac + εach ; where, ηac , is a random effect for cluster c within municipality a. 3. CEBac : Model fit is done using REML and estimates are obtained using CensusEB as noted in Marhuenda et al. (2017). Fitted model follows:41 (a) yach = xach β + zac α + ta ω + gs λ + ηa + ηac + εach ; random effects are specified at the municipality and PSU level. 4. CEBsa : Similar to 4, however random effects are specified at the municipal and state level. The goal here is to borrow strength from the state for municipalities that are not in the sample which takes a cue from Marhuenda et al. (2017). 39 Every model includes a constant term. 40 Note that depending on the sampling strategy, it is likely that we do not have direct estimates for all areas. 41 All twofold nested error methods are fit without the use of probability sampling weights and are thus not comparable to those that use survey weights. 24 5. ELLc : under the same model as in 2. In cases where we use a transformation different from the natural logarithm, random location effects and household residuals are drawn from their empirical distribution. Unit-context models: 1. U C − CEBa : Unit-context model originally proposed by Nguyen (2012) but with EB, similar to Masaki et al. (2020). Model fit is done using Henderson’s Method III (with sampling weights) and estimates are obtained using CensusEB as noted in CMN (2020). The fitted model follows: yach = zac α + ta ω + gs λ + ηa + εach . 2. U C − CEBac : Unit-context model fit is done using REML and estimates are obtained using CensusEB as noted in Marhuenda et al. (2017). The fitted model is: yach = zac α + ta ω + gs λ + ηa + ηac + εach ; 3. U C − CEBsa : Similar to model 8, however random effects are specified at the municipal and state levels. Just like 5, the goal here is to borrow strength from the state for municipalities that are not in the sample. 4. U C − ELLc : ELL estimates under model in U C − CEBa but random effects specified at the PSU level, as was originally proposed by Nguyen (2012) and then by Lange et al. (2018). In cases where we use a transformation different from the natural logarithm, random location effects and household residuals are drawn from their empirical distribution. yach = zac α + ta ω + gs λ + ηac + εach . The chosen measures to evaluate performance of the considered predictors are bias, MSE, and root MSE obtained as: L j 1 j (l) Bias τ ˆa = τa (ˆ − τa ) L l=1 L j 1 j (l ) M SE τ ˆa = τa (ˆ − τa )2 L l=1 L j 1 j (l) RM SE τ ˆa = τa (ˆ − τa )2 L l=1 where j stands for one of the methods: CEBa , CEBc , direct, CEBac , CEBsa , ELLc , U C − CEBa , U C − CEBac , U C − CEBsa , and U C − ELLc ; τa denotes the true population parameter 25 for municipality a.42 Formal evaluation of the MSE estimators is not undertaken here since it is beyond the scope of the exercise and computationally intensive as it would require obtaining a bootstrap MSE for each of the 500 samples. In section 3.1, like Marhuenda et al. (2017), we noticed how the relative size of the random effects affected the precision of the different methods. The first sample from the simulation experiment under two-stage sampling is used to assess the magnitude of the random effects under the unit level model. The values of σ 2 and σ ˆac ˆa2 under the twofold nested error model with a log shift transformation for this sample are equal to 0.021 and 0.013, respectively. The value of σ 2 when ˆac specifying random effects only at the cluster level is equal to 0.072, and σ 2 /σ ˆac ˆe2 is equal to 0.054. When specifying random effects only at the municipality level, σ ˆa2 is equal to 0.022, and the σ 2 /σ ˆac ˆe2 ratio is equal to 0.045. Figure 14: Box plots of empirical design MSE for FGT0 under two-stage sampling (Nat log shift transformation) In Figure 14 we can clearly see in general all SAE methods outperform the direct estimates in terms of design MSE.43 In terms of design bias (Fig. 15), direct estimators are very close to being unbiased under the design and would likely converge to the true estimate as the number of simulations increase. On the other hand, model-based estimators are all biased under the design. The result is similar to the one presented by Marhuenda et al. (2017), where gains in MSE for 42 Note that here there is only one true population parameter because our census is fixed. In the model-based simulation we assume the target parameters are random. 43 See the appendix figure 31 for the untrimmed version of the figure. 26 Table 2: Aggregate results for 1,865 municipalities in “Census” (FGT0) - Results from 500 samples 1 2 3 4 Transformation Box-Cox Nat. log. Log. Shift Log. Shift Direct AAB (×100) 11.314 11.314 11.314 9.722 ARMSE (×100) 14.051 14.051 14.051 11.997 CEBac AAB (×100) 6.277 8.642 6.273 5.393 ARMSE (×100) 6.580 9.169 6.574 5.964 CEBsa AAB (×100) 6.382 8.953 6.380 5.740 ARMSE (×100) 6.695 9.414 6.687 5.997 CEBa AAB (×100) 6.080 8.800 6.092 5.395 ARMSE (×100) 6.584 9.525 6.589 6.034 CEBc AAB (×100) 6.253 8.690 6.277 6.054 ARMSE (×100) 6.363 8.847 6.384 6.181 ELLc AAB (×100) 6.685 8.781 6.685 6.961 ARMSE (×100) 6.820 8.854 6.820 7.087 U C − CEBa AAB (×100) 6.171 7.531 6.016 5.982 ARMSE (×100) 6.607 8.274 6.461 6.672 U C − CEBac AAB (×100) 6.121 7.596 5.998 5.926 ARMSE (×100) 6.446 8.186 6.332 6.652 U C − ELLc AAB (×100) 6.250 8.019 6.250 7.581 ARMSE (×100) 6.421 8.100 6.421 7.758 U C − CEBsa AAB (×100) 6.002 7.539 5.875 5.937 ARMSE (×100) 6.414 8.284 6.294 6.570 AAB: Average absolute bias; ARMSE: Average root MSE ELL under Box-Cox and log shift, model is still for Nat. log. but errors are drawn from the empirical distribution. Columns 1, 2, and 3 under two-stage samples (Section 4.1) Column 4 results are under 1% SRS by PSU, all PSUs are included 27 model-based estimators are achieved at the expense of design bias. As an additional check, we add simulations where L = 500 samples are taken, each consisting of a 1% SRS without replacement in every PSU within the fixed census population. Aggregate results for this scenario are presented in column 4 of Table 2 and box plots for MSE and bias are presented in appendix Figures 29 and 30, respectively. The first thing to note in these figures is there are fewer extreme outliers when compared to results from the two-stage sample scenario in Figure 31 in the appendix. Despite fewer outliers, the results under the additional sampling scenario mimic those from two-stage samples. However, under the two-stage sampling used here, direct estimates for most municipalities are not available across all 500 samples (see subsection 4.1). Consequently, direct estimates are not included under the remaining figures that discuss results from the two-stage samples. Under the simulation experiment with two-stage sampling, the empirical MSE of ELLc estimates, where the location effect has been specified at the PSU level, appear to have a tighter spread than direct estimates (Fig. 14). Consequently, though ELL appears to perform relatively well, relative to direct estimates, the number of ELL outliers with a high MSE is considerable and ELLc performs worse than all other small area approaches. This result was not expected, as the FGT0 results for U C − ELLc appear to perform better than the traditional ELLc which has a considerably better model fit.44 Nevertheless, U C − ELLc does very poorly on mean welfare where it has an MSE that is as large as that of direct estimates and is also considerably biased (see appendix Table 7). These results also hold under the simulation experiment with the 1% SRS by PSU samples. As expected, CEBa estimates show a considerably tighter MSE spread than direct estimates, but still with outliers. Another interesting finding is that under CensusEB but with random effects specified at the PSU level (CEBc ), the empirical MSE is tighter than that of the direct estimates and also displays better properties than ELLc as seen in Table 2. However, given the discussion in the model-based validation and the results shown here, the results with the random effects at the municipality level are preferred over the ones where the effect is specified at the PSU level. Perhaps the most surprising result is the low MSE of the U C − CEBa method with only contextual variables and the other U C − CEB variants, as proposed by Masaki et al. (2020). The results display a tight spread and the results from Table 2 corroborate the finding. Nevertheless, beyond U C models being an alternative if contemporaneous census data is not available, the method presents a couple of advantages over traditional Fay Herriot (FH) area level models under sampling scenarios that follow a two-stage design like the one considered here. First, as noted toward the end of section 4.1, the majority of municipalities are represented by 1 PSU and thus have quite small sample sizes which makes the likelihood of direct FGT0 estimates being equal to 0 or 1 much higher. Under these cases, the method from Nguyen (2012) is a valid alternative to FH because FH is not applicable in these municipalities with a sampling variance of 0. An additional advantage is that the model can be used for multiple indicators whereas the FH requires a model for each indicator considered. 44 Note that under the MI inspired bootstrap of ELL, a very poor model fit will be heavily penalized and will likely yield noise estimates for U C − ELLc that fare worse than direct estimates in most applications. 28 A possible explanation for the good performance of unit-context models under the considered experiments could be due to estimated random effects, where σ 2 > σ ˆac ˆa2 , which coincides with the better performing scenario in subsection 3.1. Although this observation is likely a coincidence since the performance of unit-context models depends on the particular covariates included in the model and on the shape of the target indicator. Under the Mexican data with the covariates used, the unit-context model performs rather well. However, the method considerably lags all others for estimation of the welfare mean by area. A similar result is observed in the results presented in subsection 3.1.45 Under the simulations of column 3 of Table 2, the gains in MSE for the unit-context methods appear to come in municipalities with larger populations.46 This is expected because in our sampling scenarios, municipalities with a larger population are more likely to be included in all the samples. For example, in the census there are 16,293 PSUs spread over 1,865 municipalities. The median municipality in the created census has 7 PSUs, and the median municipality in a given two-stage sample is only represented by one PSU. Consequently, it is not surprising the unit-context variants under-perform relative to unit models in terms of bias and MSE in municipalities with smaller populations, as can be observed in Figure 16. For municipalities with larger populations, and hence those likely to consist of more PSUs and more of these PSUs in the sample, direct estimates for FGT0 are more precise and unit-context models begin to catch up to unit models (see Fig. 32 in the appendix). Under the simulations from a 1% SRS by PSU, shown in Table 2, column 4 and in the appendix Figures 29 and 30, we notice the unit models (CEBa ) present a slight upward bias that seems to become more pronounced in more populous municipalities (Fig. 17). Notice that some bias is acceptable since gains in MSE are achieved at the expense of bias, however in this case the presence of outliers seems to lead to increased upward bias that affects more municipalities as we move to more populous deciles (box-plots for CEBa in Fig. 17). Unit-context (U C − CEBa ) models on the other hand, appear to have a downward bias. In the box-plots in Figure 17 for U C − CEBa , the bias in lower deciles is downward and as we move to upper deciles the downward bias is considerably reduced. As noted in section 3.1, the problem faced by unit-context models is omitted variable bias man- ifested as a lack of linearity (see Fig. 18).47 Figure 19 further presents the issue. Under CEBa residuals appear to follow a random pattern, however under U C − CEBa households in municipal- ities represented by only one PSU in the sample will all have the same linear fit. This manifests itself in the figures for U C − CEBa as a column of vertical dots. The problem can also be observed 45 See Table 7 in the appendix. 46 Similar findings are obtained under the simulations for column 1. 47 The true model includes household size at the household level. Consequently, it is a determinant of the dependent variable. Household size at the household level (xach ) can be broken down into zach − x ¯ac , thus the omitted component, zach , is also a determinant of the dependent variable. The unit-context model only includes the PSU average household size (x¯ac ) obtained from the census as a covariate. When a given survey sample is taken, the non-included covariate (zach ) is correlated with the PSU average household size and the dependent variable. 29 Figure 15: Boxplot of empirical design bias for FGT0 under two-stage sampling (Nat log shift transformation) Figure 16: Average empirical MSE for FGT0 under two-stage sampling by municipality population deciles (Nat log shift transformation) 30 Figure 17: Box plots of design bias under 1% SRS by PSU sampling, by municipality population deciles (Nat log shift transformation) Figure 18: Omitted variable bias under unit-context models 31 Figure 19: Household residuals plotted against linear fit under two-stage sampling (Nat. log shift transformation) Figure 20: Linear fit plotted against municipalities under two-stage sampling (Nat. log shift trans- formation) 32 Figure 21: Box plots of design bias under ordered quantile normalization by municipality population deciles (Two-stage sampling) Figure 22: Box plots of design bias under 1% SRS by PSU sampling (Hybrid simulation) 33 in Figure 20 where the households within municipalities with just one PSU in the sample will have the same linear fit. More specifically, all the points corresponding to different households from the same PSU become superposed in Figure 20 right, and for those municipalities with just one PSU, there is a single point representing the same predicted value for all the households in that municipality. Figure 23: Box plots of empirical MSE under 1% SRS by PSU sampling by municipality population deciles (Hybrid simulation) To check whether the upward bias of estimators based on unit-level models is due to deviations from model assumptions, specifically deviation from normality, we apply a normalization transformation called ordered quantile normalization (Peterson and Cavanaugh, 2019).48 In the absence of tied values the transformation is guaranteed to produce a normally distributed transformed data. The transformation is of use only for FGT0 because it cannot be fully reversed 1 to 1 without the original data and thus has to extrapolate values not observed in the original data (ibid ).49 The result for this transformation can be seen in Figure 21, and it shows that the deviation from normality may be an issue in our models. The previously observed upward bias in the CEBa models is less evident in these results. However, now that the deviation from normality is less of an issue, the U C − CEBa models show a clear downward bias. The figure adds further evidence of a possible bias inherent in the U C − CEBa model offsetting the bias due to the deviation from the model’s assumptions - in this case normality. Offsetting of biases is not guaranteed to always occur. As an additional check, we also performed a hybrid experiment that consists in using the census data created in section 4.1 and a 5% SRS from each PSU to construct a synthetic census based on a twofold model. The 5% SRS sample is used to select a model with all eligible covariates (household and aggregate) following the same process described in section 4.2. Using the model’s resulting parameter estimates from a twofold model as in (2) we create a new welfare vector in the 48 The transformation is the same one utilized by Masaki et al. (2020) in their design based simulations. 49 The original functional transformation is only defined when a given value is in the observed original data (Peterson and Cavanaugh, 2019). 34 Figure 24: Box plots of design bias under 1% SRS by PSU sampling, by municipality population deciles (Hybrid simulation) Figure 25: Box-plot of empirical MSE for FGT0 under 1% SRS by PSU sampling by municipality population deciles (Hybrid simulation) 35 census for all households. Then a unit-context model and a new unit model are selected, once again following the process described in section 4.2 using the first out of 500 samples where 1% SRS by PSU is selected. This is done to remove the issue of outliers from the data and to ensure that the data generating process follows the one assumed in Eq. (2). The simulation removes the potential misspecification due to deviations from normality in the data and allows us to isolate the problem present in the unit-context model (U C − CEBa ). Results for the new hybrid simulations are presented in Figures 22 and 23. Note that in this simulation, where we have removed the normality issue, the upward bias that was present in the unit level model (CEBa ) is no longer evident. On the other hand, the previously suspected downward bias of the unit context models (U C − CEBa ) is salient, as can be seen in Figure 22 and by municipality deciles sorted by population in Figure 24. Note that under the U C − CEBa model more than 75% of the municipalities present a downward bias (Fig. 22). This finding is aligned to what we observed in Figure 17. However, because there is no deviation from normality in the hybrid simulation, the downward bias of the unit-context models (U C − CEBa ) is never offset, and is quite considerable and leading to substantially larger empirical MSEs for the unit context models (Fig. 25). Simulations were repeated, where, instead of performing model selection, the selected model for CEB estimators contains exactly the same covariates as those used to generate the welfare, and considering only the area aggregates for U-C models. This was done just to check whether the observed biases could be due to model misspecification, in the sense that the selected covariates are different from those in the true model. Results were very similar to those observed in the previous hybrid simulation with a model selection step. Hence, the results suggest deviations from the assumed model are an issue and the countering of biases is what is driving the seemingly good results for unit-context models in the two-stage sampling scenario, highlighting the importance of proper data transformations and model selection to ensure that model assumptions hold when using Census EB methods. Given the direction of the bias of unit-context models is not known a priori 50 - and that these might present high bias - unit-context models are unlikely to be preferred over traditional FH methods when the census auxiliary data are not aligned to survey microdata, unless the calculation of variances is not possible for various locations as noted before.51 In this case benchmarking is not a recommended procedure and may not help in reducing the bias. EB estimators are approximately model unbiased and optimal in terms of minimizing the MSE for a given area, thus when adjusted afterwards for benchmarking, that is, so that these match usual estimates at higher aggregation levels, the optimal properties are lost and estimators usually become worse in terms of bias and MSE under the model. When benchmarking adjustments are large, as those likely required by U C variants, it is an indication that the model does not really hold for the data. In the case of U C models we have shown that the model will not hold due to omitted variable bias. Also, note bias can lead to considerable re-ranking of locations and thus a limit on the acceptable 50 Notice how under the simulations presented in Figures 1 and 3 the method appears to be upward biased. 51 This holds for other measures of welfare, and particularly for ELL variants of the unit-context models. 36 bias should usually be determined according to need. This is of particular importance when de- termining priorities across areas based on small area estimates. If an area’s true poverty rate is 50% and the method yields an estimator of 10% due to a biased model, there is a real risk that this area may not be given assistance when needed. Molina (2019) suggests 5 or 10 percent of absolute relative bias as an acceptable threshold. An additional problem for unit-context models in many applications is it is not possible to match census and survey PSUs; in some cases it is due to confidentiality reasons and in others it is due to different sampling frames used for the survey. The latter is something that is likely to affect applications where census and surveys correspond to different years. Under these scenarios, unit-context models are unlikely to be superior to FH and alternative area models. 5 Conclusions In this paper we have illustrated that one of the most important aspects of SAE applications with Census EB methods under unit models is selecting a proper model; especially the issue of data transformation. Such a finding is not new, and has been quite often echoed by others in the area (see Marhuenda et al. 2017; Molina and Marhuenda 2015; Tzavidis et al. 2018). Here we show how data transformations can lead to improved results. For example, under onefold nested error models, the aggregate gains from moving to a log-shift transformation as opposed to just taking the natural logarithm are close to 30 percent in terms of MSE. Nevertheless, as Marhuenda et al. (2017) note, finding an appropriate transformation is not always straightforward and the resulting data could stray from normality which would lead to biased estimates as we also find here. Consequently, model checks and residual analysis are something that every SAE application should include in order to test if the model’s assumptions are not completely invalidated. In case of data deviating from model assumptions, the model should be changed accordingly. For instance, in case of area outliers, fixed effects could be included for those outlying areas in the model to reduce the design bias. If their sample sizes are not too small, the efficiency of the resulting model-based estimates might be acceptable in this case, even if specific model parameters are specific to these areas. Second, we have validated SAE applications under model-based simulation and design-based simu- lation methods. Under model-based validation, where the data generating process follows a twofold nested error model, we note the ELL method still performs poorly in terms of MSE even with con- textual level variables. The result is most evident in scenarios where area random effects are larger than cluster random effects, since contextual variables do not explain a sufficient amount of the area’s variability. Issues regarding the underestimation of this noise under ELL are not evaluated here, but should be considered by future practitioners, particularly when the noise is estimated un- der the MI inspired bootstrap method (see Corral et al. 2020; Das and Chambers 2017; Marhuenda et al. 2017 ). On the other hand, model-based simulations conducted here provide further evidence to the finding from Marhuenda et al. (2017) that misspecification of the model under the onefold CensusEB, i.e. modeling random effects at the area level only, when the true model has cluster and 37 area level random effects, with clusters nested within areas, entails virtually no loss of efficiency when estimating area level welfare-based indicators. Under design-based validation, where the sampling strategy mimics real world scenarios such as those implemented under LSMS surveys, SAE methods present improvements over direct estima- tors. We have also investigated estimators based on unit-context models, originally proposed by Nguyen (2012), which can be applied when census auxiliary data at the household level is not valid. Given that under the two-stage samples used, many municipalities, which are the target areas, are represented by a small number of observations (or even zero), these data are not always suitable for FH area level models. Despite model-based simulations yielding poor results in terms of bias for unit-context models the CensusEB variant does considerably better than the ELL variant under the design-based simulations. The reason for the positive results is shown to be due to the bias that is inherent in the unit-context model, which was being offset by bias due to deviations from normality. This offset is something that is not guaranteed to occur in all scenarios because the direction of the bias of unit-context models is not known a priori. In simulations where deviations from normality are not an issue, the bias of the unit-context method becomes quite clear. Addi- tionally, the method’s performance is contingent upon the number of subdomains present in the domain. Moreover, PSUs or subdomains must be matched across census and survey for unit-context models, something that is not always feasible. Finally, since the direction and size of the bias of the unit-context model are not known beforehand, the models may be considered an alternative to FH and other area level models only when area level models are not applicable, like in our design-based experiment with the more realistic two-stage sampling. 38 References Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error-components model for prediction of county crop areas using survey and satellite data. Journal of the American Statistical Association, 83 (401), 28–36. http://www.jstor.org/stable/2288915 Corral, P., Molina, I., & Nguyen, M. C. (2020). Pull your small area estimates up by the bootstraps. World Bank Policy Research Working Paper, (9256). Das, S., & Chambers, R. (2017). Robust mean-squared error estimation for poverty estimates based on the method of elbers, lanjouw and lanjouw. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180 (4), 1137–1161. Elbers, C., Lanjouw, J. O. [Jean O], & Lanjouw, P. (2003). Micro–level estimation of poverty and inequality. Econometrica, 71 (1), 355–364. Elbers, C., Lanjouw, J. O. [Jean Olson], & Lanjouw, P. (2002). Micro-level estimation of welfare. World Bank Policy Research Working Paper, (2911). Fay III, R. E., & Herriot, R. A. (1979). Estimates of income for small places: An application of james-stein procedures to census data. Journal of the American Statistical Association, 74 (366a), 269–277. Foster, J., Greer, J., & Thorbecke, E. (1984). A class of decomposable poverty measures. Econo- metrica: Journal of the Econometric Society, 761–766. Ghosh, M., & Rao, J. (1994). Small area estimation: An appraisal. Statistical science, 9 (1), 55–76. González-Manteiga, W., Lombardía, M. J., Molina, I., Morales, D., & Santamaría, L. (2008). Boot- strap mean squared error of a small-area eblup. Journal of Statistical Computation and Simulation, 78 (5), 443–462. Grosh, M. E., & Muñoz, J. (1996). A manual for planning and implementing the living standards measurement study survey. The World Bank. Guadarrama, M., Molina, I., & Rao, J. (2016). A comparison of small area estimation methods for poverty mapping. Statistics in Transition new series, 1 (17), 41–66. Henderson, C. R. (1953). Estimation of variance and covariance components. Biometrics, 9 (2), 226–252. Huang, R., & Hidiroglou, M. (2003). Design consistent estimators for a mixed linear model on survey data. Proceedings of the Survey Research Methods Section, American Statistical Association (2003), 1897–1904. Lange, S., Pape, U. J., & Pütz, P. (2018). Small area estimation of poverty under structural change. World Bank Policy Research Working Paper, (9383). Marhuenda, Y., Molina, I., Morales, D., & Rao, J. (2017). Poverty mapping in small areas under a twofold nested error regression model. Journal of the Royal Statistical Society: Series A (Statistics in Society), 180 (4), 1111–1136. Masaki, T., Newhouse, D., Silwal, A. R., Bedada, A., & Engstrom, R. (2020). Small area estimation of non-monetary poverty with geospatial data. World Bank Policy Research Working Paper, (9383). 39 on de datos en encuestas de hogares: Metodologías de estimaci´ Molina, I. (2019). Desagregaci´ on en areas peque˜ ´ nas. Molina, I., & Marhuenda, Y. (2015). Sae: An R package for small area estimation. The R Journal, 7 (1), 81–98. Molina, I., & Morales, D. (2009). Small area estimation of poverty indicators. Estadistica e Inves- tigacion Operativa, 25 (3). Molina, I., & Rao, J. (2010). Small area estimation of poverty indicators. Canadian Journal of Statistics, 38 (3), 369–385. Nguyen, M. C., Corral, P., Azevedo, J. P., & Zhao, Q. (2018). Sae: A stata package for unit level small area estimation. World Bank Policy Research Working Paper, (8630). Nguyen, V. C. (2012). A method to update poverty maps. The Journal of Development Studies, 48 (12), 1844–1863. https://doi.org/10.1080/00220388.2012.682983 Peterson, R. A., & Cavanaugh, J. E. (2019). Ordered quantile normalization: A semiparametric transformation built for the cross-validation era. Journal of Applied Statistics. Rao, J., & Molina, I. (2015). Small area estimation (2nd). John Wiley & Sons. Tarozzi, A., & Deaton, A. (2009). Using census and survey data to estimate poverty and inequality for small areas. The review of economics and statistics, 91 (4), 773–792. Torabi, M., & Rao, J. (2014). On small area estimation under a sub-area level model. Journal of Multivariate Analysis, 127, 36–55. Tzavidis, N., Zhang, L.-C., Luna, A., Schmid, T., & Rojas-Perilla, N. (2018). From start to finish: A framework for the production of small area official statistics. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181 (4), 927–979. Van Buuren, S. (2018). Flexible imputation of missing data. Chapman; Hall/CRC. Van der Weide, R. (2014). GLS estimation and empirical bayes prediction for linear mixed models with heteroskedasticity and sampling weights: A background study for the povmap project. World Bank Policy Research Working Paper, (7028). You, Y., & Rao, J. (2002). A pseudo-empirical best linear unbiased prediction approach to small area estimation using survey weights. Canadian Journal of Statistics, 30 (3), 431–439. Zhao, Q. (2006). User manual for povmap. World Bank. http://siteresources. worldbank. org/INTPGI/Resources/34 1092157888460/Zhao_ ManualPovMap. pdf. 40 Appendix Figure 26: Normal Q-Q-plot of onefold unit-context model and predicted subdomain effects (Nat. log transformation) 41 Figure 27: Normal Q-Q-plot of onefold unit-context model and predicted subdomain effects (log- shift transformation) 42 Figure 28: Normal Q-Q-plot of onefold unit-context model and predicted subdomain effects (Box- Cox of nat. log transformation) 43 Figure 29: Box plots of empirical design MSE for FGT0 under 1% SRS by PSU sampling (Nat log shift transformation) 44 Figure 30: Box plots of empirical design bias for FGT0 under 1% SRS by PSU sampling (Nat log shift transformation) 45 Figure 31: Box plots of empirical design MSE for FGT0 under two-stage sampling (Nat log shift transformation) 46 Figure 32: Average empirical MSE for FGT0 under two-stage sampling by municipality population deciles (Nat log shift transformation) 47 Table 3: Unit level model - under log shift transformation Variable Coefficient Std. Error PSU HH head age -0.0726*** (0.00700) PSU % hh with internet 0.133*** (0.00937) PSU % of hh with television 0.141*** (0.00864) Head’s age -0.00127*** (0.000406) HH owns a cellphone 0.214*** (0.0155) HH owns a computer 0.190*** (0.0178) Number of hh members -0.118*** (0.00315) HH has access to internet 0.105*** (0.0181) Male HH head 0.0640*** (0.0136) Max. education is tertiary 0.393*** (0.0156) Share of adult hh members 0.688*** (0.0247) State % of male head hh -0.0683*** (0.00870) State share of elderly population -0.0514*** (0.0120) HH owns a washing machine 0.119*** (0.0136) Constant 6.984*** (0.0351) Observations 23,516 Adj. R2 0.445 2 σa 0.0220 2 σa/σe 2 0.0452 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 Table 4: Unit-context model - under log shift transformation Variable Coefficient Std. Error PSU avg. hh number of members -0.127*** (0.0105) PSU share of male headed households -0.0595*** (0.00976) PSU share of households with max. tertiary education 0.219*** (0.0105) PSU avg. share of elderly -0.0760*** (0.00783) PSU avg. share of female members -0.0283*** (0.00913) Municipal share of hh owning a television 0.168*** (0.0116) State share of hh owning a computer 0.123*** (0.0110) Constant 7.353*** (0.0104) Observations 23,516 Adj. R2 0.253 σa2 0.0234 2 σa/σ 2 e 0.0357 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 48 Table 5: Aggregate results for 1,865 municipalities of Mexico (FGT1) - Results from 500 samples 1 2 3 4 Transformation Box-Cox Nat. log. Log. Shift Log. Shift Direct AAB (×100) 5.549 5.549 5.549 4.773 ARMSE (×100) 6.975 6.975 6.975 5.943 CEBac AAB (×100) 3.583 5.843 3.699 3.210 ARMSE (×100) 3.747 6.217 3.867 3.523 CEBsa AAB (×100) 3.665 6.068 3.773 3.417 ARMSE (×100) 3.830 6.386 3.941 3.553 CEBa AAB (×100) 3.427 5.889 3.533 3.205 ARMSE (×100) 3.693 6.406 3.807 3.553 CEBc AAB (×100) 3.548 5.779 3.680 3.556 ARMSE (×100) 3.614 5.904 3.746 3.631 ELLc AAB (×100) 4.209 5.783 4.209 4.516 ARMSE (×100) 4.300 5.844 4.300 4.600 U C − CEBa AAB (×100) 3.329 4.589 3.166 3.319 ARMSE (×100) 3.543 5.080 3.400 3.668 U C − CEBac AAB (×100) 3.287 4.606 3.165 3.284 ARMSE (×100) 3.451 4.998 3.344 3.653 U C − ELLc AAB (×100) 3.463 4.730 3.463 4.522 ARMSE (×100) 3.574 4.797 3.574 4.637 U C − CEBsa AAB (×100) 3.265 4.628 3.136 3.319 ARMSE (×100) 3.470 5.122 3.357 3.638 AAB: Average absolute bias; ARMSE: Average root MSE ELL under Box-Cox and log shift, model is still for Nat. log. but errors are drawn from the empirical distribution. Columns 1, 2, and 3 under two-stage samples (Section 4.1) Column 4 results are under 1% SRS by PSU, all PSUs are included 49 Table 6: Aggregate results for 1,865 municipalities of Mexico (FGT2) - Results from 500 samples 1 2 3 4 Transformation Box-Cox Nat. log. Log. Shift Log. Shift Direct AAB (×100) 3.901 3.901 3.901 3.441 ARMSE (×100) 5.017 5.017 5.017 4.361 CEBac AAB (×100) 2.432 4.135 2.566 2.267 ARMSE (×100) 2.539 4.421 2.682 2.476 CEBsa AAB (×100) 2.498 4.289 2.617 2.411 ARMSE (×100) 2.605 4.535 2.731 2.501 CEBa AAB (×100) 2.327 4.144 2.434 2.259 ARMSE (×100) 2.498 4.540 2.619 2.492 CEBc AAB (×100) 2.415 4.073 2.555 2.514 ARMSE (×100) 2.462 4.173 2.602 2.564 ELLc AAB (×100) 3.053 4.085 3.053 3.364 ARMSE (×100) 3.124 4.137 3.124 3.435 U C − CEBa AAB (×100) 2.354 3.200 2.185 2.329 ARMSE (×100) 2.482 3.558 2.335 2.547 U C − CEBac AAB (×100) 2.306 3.204 2.182 2.303 ARMSE (×100) 2.406 3.492 2.299 2.535 U C − ELLc AAB (×100) 2.478 3.297 2.478 3.312 ARMSE (×100) 2.560 3.352 2.560 3.405 U C − CEBsa AAB (×100) 2.307 3.228 2.177 2.334 ARMSE (×100) 2.431 3.591 2.319 2.532 AAB: Average absolute bias; ARMSE: Average root MSE ELL under Box-Cox and log shift, model is still for Nat. log. but errors are drawn from the empirical distribution. Columns 1, 2, and 3 under two-stage samples (Section 4.1) Column 4 results are under 1% SRS by PSU, all PSUs are included 50 Table 7: Aggregate results for 1,865 municipalities of Mexico (Mean income) - Results from 500 samples 1 2 3 4 Transformation Box-Cox Nat. log. Log. Shift Log. Shift Direct AAB (×100) 38,838.44 38,838.44 38,838.44 30,028.16 ARMSE (×100) 54,351.71 54,351.71 54,351.71 41,841.21 CEBac AAB (×100) 18,374.61 45,904.53 18,276.26 16,908.87 ARMSE (×100) 19,998.56 48,531.76 19,899.96 19,569.77 CEBsa AAB (×100) 19,574.38 45,440.07 19,452.75 18,426.15 ARMSE (×100) 21,207.41 47,861.59 21,069.74 19,745.09 CEBa AAB (×100) 18,146.66 43,445.38 18,158.09 17,016.17 ARMSE (×100) 20,596.54 47,108.70 20,591.85 19,967.33 CEBc AAB (×100) 19,048.90 46,845.91 19,093.24 21,112.72 ARMSE (×100) 19,534.06 47,576.32 19,570.16 21,653.29 ELLc AAB (×100) 32,576.30 48,594.36 32,576.30 44,808.60 ARMSE (×100) 33,650.59 49,018.62 33,650.59 45,504.33 U C − CEBa AAB (×100) 31,330.70 67,999.59 30,715.13 29,647.72 ARMSE (×100) 33,900.14 72,555.97 33,309.27 33,513.86 U C − CEBac AAB (×100) 32,718.31 69,364.83 31,893.15 29,700.46 ARMSE (×100) 34,599.54 72,841.40 33,823.68 33,765.53 U C − ELLc AAB (×100) 53,184.14 73,673.70 53,184.14 71,189.32 ARMSE (×100) 54,672.16 74,068.42 54,672.16 72,282.49 U C − CEBsa AAB (×100) 33,148.54 69,355.82 32,296.15 29,642.77 ARMSE (×100) 35,502.18 73,879.73 34,677.96 33,217.28 AAB: Average absolute bias; ARMSE: Average root MSE ELL under Box-Cox and log shift, model is still for Nat. log. but errors are drawn from the empirical distribution. Columns 1, 2, and 3 under two-stage samples (Section 4.1) Column 4 results are under 1% SRS by PSU, all PSUs are included 51 Omitted variable bias for unit-context models Consider that data are generated as: p yah = β0 + β1 x1 ah + · · · + βp xah + ηa + eah ; h = 1, . . . , Na ; a = 1, . . . , A (4) Let us decompose xk ah as follows: ¯k ¯k ¯k −1 Na xk ah = xk ah − Xa + Xa , where Xa = Na k h=1 xah is the population mean of xk ah in area a. However under unit-context model we fit: ¯a yah = α0 + α1 X 1 ¯a + · · · + αp X p + ηa + eah ; h = 1, . . . , Na ; a = 1, . . . , A (5) Note that here we are omitting variables: ˜k x k ¯k ah = xah − Xa ; k = 1, . . . , p Let us write model (4) in matrix notation, for the sample data. For this we define the vectors:       β0   y 1 x1 ··· xp e  a1  a1 a1  a1    .  . . .. .   β1   .  ya =  . , X = . . . . , β =   , ea =  .       .  a  . . . . .  .       .   yana 1 x1 . . . xp eana   ana ana βp Then, the model is given by: ya = Xa β + 1na ηa + ea , a = 1, . . . , A Finally, define:           y X1 1 η e  1   n1  1   a1  . . ..  . .     y= . , X =  . , Z =  . , η =  .  ea =  .          .   .   .   .     yA XA 1nA ηA eana where 1nA is an nA × 1 column of ones. Then the model in (4) may be written as: y = Xβ + Zη + e On the other hand, for model (5) we define: 52   1 ¯ 1 1n X ··· ¯ p 1n X  n1 1 1 1 1 ¯ = . . .. .  X  . . . . . . .     1nA ¯ 1 1n X ··· ¯ p 1n X A A 1 A Then model (5) can be written as, for α = (α0 , α1 , . . . , αp ) : ¯ + Zη + e y = Xα the WLS estimator of α in model (5): −1 α ¯ V −1 X ˆU C = X ¯ ¯ V −1 y X (6) However, y actually follows model (4), that is: ¯ + Zη + e ˜ (0) + Xβ y = Xβ (7) where       β ˜1 X x1 − X ¯a 1 ··· xp ¯p a1 − Xa   0   a1  .  ˜  . . .. .    ˜ β(0) = . , X =  .  , Xa =  . . .    .   . . .     βp ˜A X x1 ¯1 ana − Xa · · · xp ¯p ana − Xa replacing (7) in to (6), we get: −1 −1 −1 α ¯ V −1 X ˆU C = X ¯ ¯ V −1 Xβ X ¯ V −1 X ¯ + X ¯ ¯ V −1 Xβ X ¯ V −1 X ˜ (0) + X ¯ ¯ V −1 (Zη + e) X Taking expectations, and since E [η ] = 0 and E [e] = 0, we get: ˆU C ˆU C = β + B α E α where the bias is equal to: −1 ¯ V −1 X ˆU C = X B α ¯ ¯ V −1 Xβ X ˜ (0) where 53 ¯1 ¯p   a γa a γa Xa ··· a γa Xa  2  ¯1 ¯1 ··· ¯1 ¯p a γa Xa a γa Xa a γa Xa Xa   ¯= 1 ¯ V −1 X   X 2  . . .. .  ση   . . . . . . .   2   ¯p ¯1 ¯p ··· ¯p a γa Xa a γa Xa Xa a γa Xa and where 2 ση γa = 2 σe 2+ ση na Additionally, we obtain −1 ˜   a 1na Va Xa β(0) ¯1 −1 ˜   a Xa 1na Va Xa β(0)   ¯V −1 ˜ (0) =  X Xβ   . .    .   ¯p −1 ˜ a Xa 1na Va Xa β(0) noting that: ˜ a = 1 γa 1 x 1na Va−1 X 2n na ˜a ση a where x ˜1 ˜a = x ˜p a, . . . , x a , with 1 ¯k ¯k ˜k x a = xk ¯k ah − Xa = xa − Xa na h∈Sa and where Sa is the survey sample households in area a. Therefore,   a γa k ˜k x a βk ¯1 ˜k   1 a γa Xa k xa βk    ¯ V −1 Xβ X ˜ (0) = 2  . .   ση   .   ¯p ˜k a γa Xa k x a βk Finally, the bias is given by: −1  ¯1 ¯p  a γa a γa Xa ··· a γa Xa  a γa k ˜k x a βk  2  ¯1 ¯1 ··· ¯1 ¯p ¯1 ˜k a γa Xa a γa Xa a γa Xa Xa   a γa Xa xa βk    k   UC   B α ˆ =  . . . . .. . .   . .  .    . . .    .   2   ¯p ¯1 ¯p ¯p ¯p ˜k a γa Xa a γa Xa Xa ··· a γa Xa a γa Xa k x a βk 54 Consequently, the bias of αˆ U C is due to the discrepancy between the sample mean of a given covariate and the population mean of that covariate, x˜k = x ¯ k. ¯k − X a a a Bias of individual predictors under unit-context models Under model (5), ¯ α + ηa + eah yah = Xa the conditional expectation under model (5) for h ∈ Sa is: µU C ¯ ¯aα ah|s = E [yah |ys ] = Xa α + η ˜a , for η ¯a − X ˜a = γa y ˆU C Then, and replacing α with its estimate, α µU C ¯ ˆ U C + γa y ¯aα ¯a − X ˆ U C = γa y ¯ α ¯a + (1 − γa ) X UC ah|s = Xa α aˆ and taking its expectation we get: E µU C ¯ E α ya ] + (1 − γa ) X ˆU C , ah|s = γa E [¯ a where E [¯ ya ] = x ˆ U C . Consequently, ˆU C = β + B α ¯a β and E α E µU C ¯ ¯aB α ¯ a β + (1 − γa )X ¯a − X ˆU C ah|s = Xa β + γa x Note that once again, the discrepancy between the sample and population means play a role in the bias. 55