Global Poverty Monitoring Technical Note 35 Changes to the Extrapolation Method for Global Poverty Estimation Daniel Gerszon Mahler and David Newhouse March 2024 Keywords: Extrapolation, nowcasting, missing data, national accounts Development Data Group Development Research Group Poverty and Equity Global Practice Group GLOBAL POVERTY MONITORING TECHNICAL NOTE 35 Abstract This technical note summarizes changes to how the Poverty and Inequality Platform (PIP) lines up survey-based estimates of poverty to a common reference year. The prior line-up method assumed that welfare vectors grow in accordance with growth in real Household Final Consumption Expenditure (HFCE) per capita for all countries except for countries in Sub-Saharan Africa, where growth in real GDP per capita was used instead. This note leverages PIP data to test various alternative line-up rules and evaluates their performance out of sample. It proposes an equally simple rule that can reduce the error as measured by the mean absolute deviation by 7.5%, which is estimated to amount to 60% of the potential error that can be reduced given available information. This rule is that upper middle- income and high-income countries use growth in real HFCE per capita, while low and lower middle-income countries use growth in real GDP per capita, and that only 70% of growth in either HFCE or GDP passes through to growth in consumption vectors. Both authors are with the World Bank Development Economics Data Group. Corresponding author: Daniel Gerszon Mahler (dmahler@worldbank.org). The authors are thankful for comments and guidance received from the Global Poverty Working Group under the leadership of Deon Filmer, Haishan Fu, Luis- Felipe Lopez-Calva, and Carolina Sanchez-Paramo. The authors are also grateful for comments from Christoph Lakner, Espen Beer Prydz and Sergio Olivieri. The authors gratefully acknowledge financial support from the UK government through the Data and Evidence for Tackling Extreme Poverty (DEEP) Research Programme. The Global Poverty Monitoring Technical Note Series publishes short papers that document methodological aspects of the World Bank’s global poverty estimates. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Global Poverty Monitoring Technical Notes are available at https://pip.worldbank.org/publication. 1. Introduction This note summarizes changes to the method used by the Poverty and Inequality Platform (PIP) for extrapolating, or “lining up�, country poverty rates to a reference year. Poverty rates are calculated using household surveys that collect household per capita consumption or income (referred to henceforth as ‘welfare’) depending on the country. Lining up these survey-based estimates is necessary because the surveys measuring poverty are not fielded each year, but global and regional poverty estimates need to be reported for a common year. The prior method shifted the welfare measures obtained from a household survey by a common scale factor, which is equal to a measure of growth taken from national accounts data: growth in real Household Final Consumption Expenditure per capita (henceforth ‘growth in HFCE’) for countries outside of Sub-Saharan Africa, and growth in real GDP per capita (henceforth ‘growth in GDP’) for countries in Sub-Saharan Africa (Prydz et al. 2019). This rests on four key assumptions: 1. The shape of the welfare distribution changes little over the extrapolation period. 2. Growth in welfare can be proxied reasonably well by growth from national accounts data. 3. All growth from national accounts data is passed through to growth in welfare. 4. In Sub-Saharan Africa, growth in GDP is a better proxy for growth in welfare than growth in HFCE, while the opposite is the case everywhere else. In recent research, we have found that the first two assumptions are difficult to improve upon without using microsimulation tools (Mahler et al. 2022). In particular, efforts to model inequality add little to nowcasting accuracy, and more sophisticated models using a thousand variables to predict growth in the mean perform about as well as scaling up the mean by a fraction of GDP growth. In contrast, the third assumption is not supported by the data, as there is a large and growing gap between means from national accounts and those from survey data (Prydz et al. 2022). Little systematic research has been carried out on the final assumption regarding the relevance of HFCE in Sub-Saharan Africa for extrapolations. The decision to use growth in GDP instead of HFCE dates to the early 2000s and was justified partly by the difficulty of obtaining HFCE estimates in Sub-Saharan Africa. The other justification is the weak empirical correlation between growth in HFCE and survey consumption in that region using data from the 1990s, although the reported correlation was equally weak in several other regions (Ravallion 2003). To address concerns about the third and fourth assumption, we propose two changes summarized in Table 1. These changes are proposed because they offer a combination of accuracy and simplicity. The first major change introduces a “passthrough rate�, a factor that determines what share of growth from national accounts that “passes through� to the survey welfare measure. We adopt a passthrough rate of 0.7 for countries that use consumption to measure welfare. In other words, if a country using consumption grew 2 percent the year following a survey according to national 1 accounts, consumption measured in the survey would be increased by 1.4 percent. The selection of a 0.7 passthrough rate, and the decision to only apply this to countries that use a consumption aggregate is based on empirical evidence presented in what follows, which shows a systematic tendency of growth in mean consumption measured in surveys to consistently underestimate growth in national accounts. There is no similar pattern for income. Table 1: Summary of proposed changes Prior method New method • 1 (full passthrough) for all • 0.7 for consumption aggregates Passthrough rate countries • 1 for income aggregates • HFCE outside of Sub- • HFCE for upper middle-income Measure of Saharan Africa and high-income countries economic growth • GDP in Sub-Saharan • GDP for low and lower middle- Africa income countries The World Bank has used passthrough rates for nowcasting and forecasting since at least 2016 (World Bank 2016). This has generated an inconsistency between the method used for extrapolating backwards and forwards for historical estimates, and the method used for generating more recent estimates. The approach developed here removes this inconsistency. The second major change alters the measure of economic growth that is used to scale up survey- based estimates of welfare. In particular, we suggest using GDP growth for all countries classified as low-income or lower middle-income, and HFCE growth for countries classified as upper middle-income and high-income. A myriad of methods exists to extrapolate estimates of poverty backwards and forwards in time (World Bank 2023). In this paper, we restrict ourselves to a very narrow set of methods; those that extrapolate entire welfare vectors using a fraction of growth rates in national accounts. These are referred to as “distribution-neutral� methods because they assume that a common scale factor is applied all households’ welfare in the survey. It is possible that microsimulation tools, which allow different households to experience different growth rates, could outperform the methods used here, although we are not yet aware of systematic global evidence that demonstrates this. For a discussion of how to evaluate microsimulation models, see Walsh and De Menezes (2023). The rest of the paper is organized as follows. The next section outlines the prior method used for lining-up in PIP while section 3 details how we test alternative approaches. Section 4 presents the main results and section 5 contains robustness checks. Section 6 shows the implications on global and regional poverty while section 7 discusses the results and concludes. 2 2. Prior approach Few countries have survey estimates of poverty available every year. To estimate poverty at the regional and global level, the survey estimates need to be aligned to a reference year and aggregated. Such alignment and aggregation require assumptions about how to interpolate and extrapolate data. Below we explain how the extrapolation was done prior to the changes implemented with this paper. For countries that do not have welfare aggregates at or since a specific reference year, but for which earlier welfare aggregates are available, PIP extrapolates their most recent aggregate forward using growth rates from national accounts. This is done by first finding the growth in national accounts that occurred between the survey and the reference year, 𝑔�𝐴,𝑠,𝑟 , and scaling the survey welfare distribution, 𝑓(𝑦𝑠 ), by this growth factor. This yields an estimate of the welfare distribution at the reference year, 𝑓(𝑦𝑟 ), as summarized below: 𝑓(𝑦𝑟 ) = (1 + 𝛽 ∗ 𝑔�𝐴,𝑠,𝑟 ) ∗ 𝑓(𝑦𝑠 ) [1] Prior to the implementation of the suggestion of this paper, real GDP per capita was used for countries in Sub-Saharan Africa, while real HFCE per capita was used for all other countries. 𝛽 indicates the share of growth from national accounts passed through to growth in welfare. In the prior approach, this was assumed to be 1. Poverty for the reference year is then estimated using this extrapolated distribution. The extrapolation method assumes distribution-neutral growth, i.e. that everyone’s welfare grows at the same rate. This implies that inequality is assumed to stay constant. A similar approach is used to extrapolate backwards, when the earliest survey estimate available is more recent than the desired reference year. Figure 1 shows the consumption aggregate from the 2015.5 Ethiopian survey expressed in 2017 PPP-adjusted dollars, and the 2018 Ethiopian distribution one would get from extrapolating this welfare vector to 2018 using the method described above. The 2015.5 daily consumption mean is $3.12 per capita and the growth rate in national accounts between 2015.5 and 2018 is 14.5%. The positive growth pushes the distribution to the right, changing the share of extreme poor from 30.8% in 2015.5 to 22.3% in 2018. The extrapolation only applies after the latest survey or before the first survey. In cases where the reference year falls between two surveys, poverty is interpolated for the reference year using the nearest survey on each side of the reference year, and the resulting estimates are averaged. In general, such interpolations are less dependent on whether a passthrough rate less than 1 is applied and which measure from national accounts is used. For more information on interpolations, see Prydz et al (2019). This paper only deals with extrapolations. 3 Figure 1: Illustration of extrapolation 3. Data and Methodology 3.1 Sample selection We conduct an empirical analysis of 1,940 surveys from PovcalNet, the precursor to PIP, covering 158 countries between 1981 and 2019. We rely on the spring 2021 vintage of the data. We merged this data with national accounts data from the World Development Indicators, supplemented when necessary with data from the World Economic Outlook and Maddison Project Database. We retain information on real GDP per capita, real HFCE per capita, real FCE per capita (the sum of HFCE and government expenditure) and real GNI per capita. We chose these variables based on what has been used in past research to extrapolate welfare vectors forward and what was found to be most predictive of growth in mean welfare in a companion paper (Mahler et al. 2022). We convert these data into a dataset of spells. A spell refers to two adjacent household surveys from a country measured using comparable welfare aggregates, using PIP’s information on survey comparability within countries over time. We retain 1,501 spells spanning 141 countries. For each spell, we calculate the annualized growth in mean welfare and the annualized growth in the four national accounts variables mentioned above. Instead of computing spells using adjacent household survey data, we could have calculated spells using any comparable two estimates, no matter if other surveys were constructed in between. Suppose for example a country has comparable welfare aggregates in 2010, 2015, and 2020. Our method consists of creating two spells from 2010-15 and 2015-20, but we could likewise have created a spell from 2010-2020. Including these longer spells does not change our qualitative results but lowers the errors of all models we test, likely because idiosyncratic errors decrease in importance over longer spells (Appendix Figure A.1). We chose only to work with adjacent spells 4 in the main specification because these spell lengths reflect the approximate length for which actual extrapolations take place. In other words, the performance over 20 years for a country with annual data is less relevant for this exercise because we rarely need to extrapolate 20 years for such a country. Another choice would have been not to restrict our sample to comparable spells but use any spells, even if the welfare aggregates are not comparable. This could be relevant under the following rationale: Suppose that incomparabilities often happen because the survey instrument changes in a way such that more consumption gets captured in the newer survey. This would imply that passthrough rates would be higher (as welfare growth between non-comparable surveys would be higher than between comparable surveys, all else equal). If we believe that there will be similar noncomparabilities in the future, then accounting for it in the passthrough rates would give nowcasts closer to what eventually will be captured from survey data. We opt only to use comparable spells because we are interested in extrapolating the current welfare vector forward and backwards in time in a manner that retains comparability with the survey-based welfare aggregate. Because of this, the extrapolations should be interpreted as changes in poverty due to economic factors, excluding changes in how consumption or income is measured in surveys. Most survey data pre-2000 is rarely used, and we are arguably more interested in extrapolating poverty from 2000 onwards. As another robustness check, we try only using spells that start in 2000 or later to see if this subsample yields different results. This does not appear to be the case, though it again lowers the errors of all models, likely because welfare aggregates were measured with more noise in earlier decades. Using post-2000 data, Final Consumption Expenditure as the national accounts measure performing relatively better on this subsample (Appendix Figure A.2). 3.2 Model estimation To estimate passthrough rates and evaluate different types of national accounts data, we run regressions of the following form: 𝑔𝑤𝑒𝑙𝑓,�,𝑠 = 𝛽 ∗ 𝑔�𝐴,�,𝑠 + 𝛾 ∗ 𝑔�𝐴,�,𝑠 ∗ 𝑋�,𝑠 + 𝜀�,𝑠 [2] Here 𝑔𝑤𝑒𝑙𝑓,�𝑠 represents the annualized growth rate in mean welfare (consumption or income) per capita in country � for spell 𝑠. 𝑔�𝐴,�,𝑠 represents the annualized growth rate taken from national accounts data between the two surveys of a particular spell. 𝑋�,𝑠 represents a candidate characteristic for calculating separate passthrough rates for different countries or spells. We examine seven different characteristics: Region, welfare type (income vs. consumption), fragility status, income group, the decade of the spell, the length of the spell, and the sign of the growth rate. For income group and fragility status, we use the classification at the end of the spell to maximize the number of spells for which we can assign an income group or fragility status.1 𝛽 + 1Spells ending before 1989 have no income classification and spells ending before 2000 have no fragility classification, using World Bank definitions. When running regressions, we assign these spells to a separate group so the regressions 5 𝛾 ∗ 𝑋�,𝑠 equals the estimated passthrough rate for country � and spell 𝑠 -- the percentage point change in the welfare growth rate associated with a one percentage point increase in the national accounts growth rate. The specification largely follows Ravallion (2003), with the exception that we do not include an intercept term. Including an intercept would make the passthrough rates harder to communicate. Rather than the passthrough rate equaling 𝛽 + 𝛾 ∗ 𝑋�,𝑠 , it would equal 𝛽 + 𝛾 ∗ 𝑋�,𝑠 + α�𝑔�𝐴,�,𝑠 , and every single spell would have its own passthrough rate based on its national accounts growth rate. Including an intercept would also imply that even if a country has no growth in its national accounts, its mean welfare would grow by α, which seems counterintuitive. For these reasons, we prefer the specification without intercepts. As a robustness check, we ran the regressions with an intercept. The intercepts were rarely significantly different from zero (a finding similar to Ravallion 2003) and excluding them gave better out-of-sample accuracy (Appendix Figure A.3). This was also true when we included group-specific intercepts defined by 𝑋� . When running the regressions, we weight each country equally, meaning that for a country with 10 spells, each spell gets a weight of 0.1, while for a country with two spells, each spell gets a weight of 0.5. HFCE, GNI, and FCE estimates are missing for some spells. To avoid the results being driven by the use of different samples across regressions, we replace missing predictions when HFCE, GNI, or FCE is missing with the predictions from the same regression using GDP data. 3.3 Model evaluation To evaluate the predictions, we use ten-fold cross-validation to ensure that we do not overfit the model. We repeat ten-fold cross-validation ten times to average out any randomness with a particular fold partition. We also tried using temporal-block cross-validation (Roberts et al. 2017), which instead of dividing the spells into random folds, sequentially holds-out data from a specific time range. This does not change our results (Appendix Figure A.4). To evaluate the fit, we calculate the mean absolute deviation (MAD) between the predicted and the withheld survey growth rate in mean welfare: 1 1 ∗ 𝑀𝐴𝐷 = ∑ ̂𝑤𝑒𝑙𝑓,�,𝑠 − 𝑔𝑤𝑒𝑙𝑓,�,𝑠 ∑ |𝑔 | [3] �� �𝑆� �∈𝐶 𝑠∈𝑆� Here C is the set of countries, 𝑆𝐶 is the set of spells of a country, �� is the number of countries in ∗ the data, �𝑆� is the number of spells of country �, 𝑔𝑤𝑒𝑙𝑓,�,𝑠 is the (true) growth rate in welfare ̂𝑤𝑒𝑙𝑓,�,𝑠 is the predicted growth rate in welfare calculated from the withheld survey data, and 𝑔 all have the same sample of spells. In practical implementations of our preferred approach, which relies on income group status, we suggest forward-extrapolating the last welfare vector year-by-year continually using the income classification of the year in question. For backward-extrapolating the welfare vector to the 1980s, we suggest using the first income classification available. 6 estimated from the regression. Each country is given equal weight in the calculation of the mean absolute deviation, so the accuracy measure is not disproportionately weighted towards countries with more survey data. We prefer the mean absolute deviation over the mean squared error because the latter gives disproportionate weight to extreme outcomes. In Appendix Figure A.5, we show that using the mean squared error gives similar results. When using the MAD, we create an inconsistency between how we estimate passthrough rates (which relies on an OLS regression, and hence minimize the MSE) and how we assess the performance of this model fit, which relies on absolute deviations. As a robustness check, we instead run quantile regressions estimating the conditional median, as quantile regressions minimize mean absolute deviations. Doing so gives rise to the same model selection (Appendix Figure A.6), with one exception: whereas our main results suggest using GDP for low-income and lower-middle income countries, quantile regressions give a marginally lower error when only using GDP for low-income countries. Yet, as the error of that model is greater than our preferred model using OLS, we stick to OLS. We do not estimate the uncertainty of the mean absolute deviation of the different models. This is largely because our accuracy measures are obtained using cross-validation. Estimating proper measures of uncertainty from statistics obtained using cross-validation is not straightforward due to correlations between the measured accuracies of each fold (Bates et al. 2022). Although nested cross-validation methods show some promise, their ability to produce accurate uncertainty estimates has not been established in all settings. 4. Results 4.1 Finding the best performing model We first calculate the mean absolute deviation from running various iterations of regression [2] with different national accounts measures and different covariates to estimate group-specific passthrough rates. We consider nine candidate covariates for which passthrough rates are allowed to differ, as well as the prior practice of setting the passthrough rate equal to one, as well as estimating a constant passthrough rate for all observations (without using any covariate, referred to as “no covariate� in the figures that follow). The results are shown in Figure 2. Each dot is the result of a different regression with different estimated passthrough rates (safe for when the passthrough rate is set to equal). Each color shows the regression results using different national accounts measure in equation [2]. 7 Figure 2: Error by national accounts measure and covariate Note: The legend refers to the national accounts measure used. X-axis categories are sorted by the average error across national accounts measures. The error with FCE and passthrough=1 is larger than the maximum value of the y-axis. “No covariate� refers to running the regression without a covariate and estimating one unique passthrough rate for all observations. All methods yield errors that are quite high. The prior method, which used GDP for Sub-Saharan Africa and HFCE for the remaining countries along with a passthrough equal to one, has an average error of 3.58. This means that on average across all spells, the annualized predicted growth rate in mean welfare is 3.58 percentage points off the survey-based estimates. By comparison, the standard deviation of the distribution of annualized growth in mean welfare is 6. The error is reduced when passthrough rates are applied. Most notably, a passthrough rate by income and consumption type provides the lowest average error. Using a passthrough rate by income and consumption together with using GDP for Sub-Saharan Africa and HFCE for all else would reduce the error from 3.58 to 3.37, a reduction of 6%. Somewhat surprisingly, using passthrough rates that vary by covariates can yield higher average errors than using a constant passthrough rate for all observations. This indicates that it is very easy to overfit the data. To check this more thoroughly, we look at the average errors if an income- consumption specific passthrough rate is used in conjunction with region and/or income group specific passthrough rates -- the two candidate covariates that provided the best fit with the mixed national accounts measure. Figure 3 shows the results. Evidently using income-consumption passthrough rates and region and/or income group specific passthrough rates increases the error. Hence, more complex models easily increase the error. 8 Figure 3: Error by national accounts measure and multiple covariates Note: The legend refers to the national accounts measure used. The evidence thus far supports the national accounts measure that uses GDP for Sub-Saharan Africa and HFCE for the rest, while having a separate passthrough rate for income and consumption aggregates. This mixed national accounts measure may work well because the calculation of GDP in developing countries gets relative more attention than HFCE. Contrary to common beliefs, there is no compelling evidence that developing countries are worse at measuring GDP growth than developed countries (Angrist et al. 2021). HFCE may be measured more poorly in developing countries because countries with low statistical capacity measure HFCE as the residual portion of GDP. If this is indeed the case, it would be better to use HFCE for countries with high statistical capacity and GDP for countries with low statistical capacity, rather than using it for countries in Sub-Saharan Africa. Checking this requires a measure of statistical capacity. This runs into the challenge that there is no cross-country series of statistical capacity dating from 1981 to the present time as far as we know. Two of the World Bank’s attempts at measuring statistical capacity (the statistical capacity index and its successor, the statistical performance indicators) do not have sufficiently long temporal coverage. We therefore proxy statistical capacity with country income groups, which are based on thresholds for per capita GNI determined by the World Bank. We see if using GDP for countries below a certain income level can improve the fit. The results are shown in Figure 4. In Appendix Figure A.7, we proxy statistical capacity with the share of years in which an international poverty estimate exists within a five-year window on each side of a reference year. This does not give rise to a lower error than what is showed below. 9 Figure 4: Error by use of GDP for various subgroups Note: GDP shows the error if only using GDP data, and similarly for HFCE. LIC = Low-income country, LMIC= Lower middle-income country, SSA = Sub-Saharan Africa. Using GDP for low-income and lower middle-income countries (LICs and LMICs) instead of for Sub-Saharan Arica reduces the error from 3.37 to 3.32. One could question whether it is precisely at the LMIC/UMIC cut-off that HFCE starts being more accurate at predicting growth in welfare. We explore this by looking at the errors of only using GDP or HFCE at any given welfare level. Though average welfare does not map exactly to income groups, for each average welfare level, we overlay the likeliest income group. The results are shown in Figure 5. Figure 5: Error using GDP and HFCE by welfare level Note: Average error using only GDP or only HCE data for spells around a given mean welfare level. Income-group overlays show the likeliest income group at each welfare level. 10 It turns out that it is exactly at the UMIC threshold that HFCE tends to produce a lower error than GDP. For that reason, our preferred model is to use GDP for LIC and LMICs and HFCE for UMICs and HICs. Doing so leaves us with a reduction in the error of 7.5% vis-à-vis the prior method. Although this may seem like a small reduction in the error, there is a lot of irreducible error in the estimates, so getting to zero, or even close to zero is impossible. One useful benchmark is the error from using 1000+ variables and machine learning to predict growth in mean consumption. Such an exercise from Mahler et al. (2022) gave an error of 3.16 pct. points. If we treat this as a lower bound, the suggested model would eliminate more than 60% of the reducible error. 4.2 Estimating the passthrough rates The best-performing model from the prior analysis separates spells by whether an income or consumption aggregate are used and by whether the country is a LIC/LMIC or UMIC/HIC. This means that there will be four passthrough rates. Estimating these passthrough rates on all past data gives the results shown in Figure 6. Tables with the underlying regression output are shown in Appendix Table 1. Figure 6: Estimated passthrough rates Note: Estimated passthrough rates for consumption aggregates (left part) and income aggregates (right part). GDP data is used for LICs and LMICs while HFCE data is used for UMICs and HICs. The error bars represent 95th percentile confidence intervals. The two consumption passthrough rates are both close to and statistically indifferent from 0.7. For that reason, we suggest to proxy consumption passthrough rates by 0.7, denoted by the red horizontal line in the left panel of Figure 6. The income passthrough rate for UMICs and HICs using HFCE is 1.02, and not statistically different from 1. For that reason, we suggest proxying it by 1. The estimated passthrough rate for income aggregates for LIC and LMICs using GDP is 1.34, 11 but this estimate is associated with large uncertainty. This is because few LICs and LMICs use income aggregates. There are only 169 such cases, most of which concerns spells from countries in Latin America from the 1990s and 2000s. Because this combination of welfare and national accounts measure is rare and 1 is within the lower bound, we propose also to use a passthrough rate of 1 in these cases. Note that these passthrough rates are consistent with evidence from Lakner et al. (2022), which finds a passthrough rate of 1.01 for income aggregates and 0.72 for consumption aggregates, and with Wollburg et al. (2023), which finds a passthrough rate of 0.70 for consumption aggregates (while not estimating passthrough rates for income aggregates). It is largely consistent with evidence from Prydz et al. (2022) which finds a passthrough rate of 0.75 (0.73) for consumption aggregates with GDP (HFCE) and 0.94 (0.89) for income aggregates with GDP (HFCE), all of which are not significantly different from 0.7 and 1, respectively. 5. Robustness checks 5.1 Resource rents There are reasons to believe that relying on GDP data is less appropriate in certain circumstances. For example, Mahler et al. (2022) found GDP extrapolations to be less accurate when specific components of GDP make up an unusually large share of overall GDP growth, or during periods of large price increases. Of particular concern may be when GDP growth is driven by natural resource rents, as they may not trickle down to households’ consumption. In this case using HFCE could be preferable to GDP, even for low- and lower-middle income countries. Figure 7 tests this. It shows the errors using either GDP or HFCE, both with a separate passthrough rate for income and consumption aggregates, as a function of natural resource rents (i.e. for spells within a particular range of natural resource rents). It uses data only for low-income and lower-middle income countries. Figure 7 shows that using GDP growth does not lead to less accurate results than using HFCE growth for countries with high natural resource rents. On the contrary, GDP performs slightly better, although the difference is far from statistically significant. As such, the plot suggests that switching to HFCE when GDP growth is driven by natural resources will not improve accuracy. 5.2 Bias It is possible for a method to predict growth in the mean well while being biased in doing so. Given that a bias would systematically over- or underestimate poverty, it is desirable that the bias is small. We calculate the bias as 1 1 ∗ 𝐵𝑖𝑎𝑠 = ∑ ̂𝑤𝑒𝑙𝑓,�,𝑠 − 𝑔𝑤𝑒𝑙𝑓,�,𝑠 ∑ (𝑔 ) �� �𝑆𝐶 �∈𝐶 𝑠∈𝑆𝐶 12 Figure 7: Errors by natural resource rents Note: Errors of using separate passthrough rates for income and consumption aggregates while relying only on GDP or HFCE data, as a function of total natural resource rents. Only spells from low- and lower middle-income countries are included. This is similar to our primary loss function, but it does not take the absolute value before taking the mean. Biases by national accounts measure and passthrough covariate are shown in Figure 8. The prior method had a bias of 0.28, meaning that it overstated growth in mean consumption by 0.28 percentage points on average and therefore underestimated poverty rates. The new method (the black dot at the “Income/consumption� x-axis category) has a bias of -0.19, so it systematically underpredicts growth in mean consumption. Though this represents a 32% decline in the magnitude of the bias, it is worthwhile considering if other methods could decrease the bias further. Nearly all methods have equal or larger biases. The only exception is the method that estimates separate passthrough rates by whether the growth rate in national accounts is positive or negative. The bias in such a model with the proposed national accounts variable is -0.14. One could speculate that separate passthrough rates by positive or negative GDP growth arises because individuals are loss averse and would like to smooth consumption in times of crisis. This would make passthrough rates from GDP to consumption aggregates lower for negative growth spells. Yet we find that the passthrough rate is 0.7 from GDP to consumption both for negative and positive growth spells. In negative growth spells, the passthrough rate is lower from GDP to income aggregates and from HFCE to consumption and income aggregates, which cannot be attributable to consumption smoothing. Due to our inability to rationalize these differential passthrough rates, and because this model increases the mean absolute deviation from 3.32 to 3.36, we prefer the simpler model. 13 Figure 8: Bias by national accounts variable and passthrough covariate Note: The legend refers to the national accounts measure used. X-axis categories are sorted by the average absolute value of the bias across national accounts measures. “No covariate� refers to running the regression without a covariate and estimating one unique passthrough for all observations. In general, the bias seems to appear because the relationship between passthrough rates and growth rates is not fully linear. A model which includes a piece-wise second-order polynomial, such that 𝑔𝑤𝑒𝑙𝑓,�,𝑠 = 𝛽 ∗ 𝑔�𝐴,�,𝑠 + 𝛾1 ∗ 𝑔�𝐴,�,𝑠 ∗ 𝐼[𝑔�𝐴,�,𝑠 < 0] + 𝛾2 ∗ 𝑔�𝐴,�,𝑠 2 ∗ 𝐼[𝑔�𝐴,�,𝑠 > 0] + 𝛾3 ∗ 𝑔�𝐴,�,𝑠 2 ∗ 𝐼[𝑔�𝐴,�,𝑠 < 0] + 𝜀�,𝑠, reduces the bias to nearly zero, but again with increases in the mean absolute deviation. This model would also have the disadvantage of implying a different passthrough rate for each spell which is more challenging to explain. Another way to reduce the bias would be to include the constant term, but again this comes at the expense of increasing the overall error and complexity of the model. 5.3 Distribution of errors Another possible concern is that the new method might perform better on average but may be worse at some points of the distribution or errors. Figure 9 plots the cumulative distribution of the error with the prior method and the new method. It shows that the new method nearly first order dominates the prior method in terms of error magnitudes. 5.4 Stability of passthrough rates Yet another concern may be that the passthrough rates are not stable over time. We have already shown that calculating separate passthrough rate by decade does not reduce the error, yet Figure 10 shows there is a minor downwards trend in passthrough rates over time (or at least until 2013). 14 Though this trend is hardly statistically significant, it could suggest revisiting the estimation of passthrough rates periodically. Figure 9: Distribution of errors Note: The figure shows the inverse cumulative distribution function of errors with the prior method and the new method. Figure 10: Passthrough rates estimated at different years Note: Estimated passthrough rates using only spells in a 10-year window (five years on both sides). GDP data is used for low- and lower-middle-income countries while HFCE data is used for upper middle-income and high-income countries. 15 5.5 Using the CPI deflator Another concern is that the GDP and HFCE deflators used do not reflect the price changes consumers experience. If this is the case, one could instead use the nominal growth in GDP or HFCE deflated by the inflation measured from the consumer price index. Doing so gives notably larger errors throughout, as shown in Figure 11. Figure 11: Errors by national accounts measure and deflator Note: X-axis categories are sorted by average error across national accounts measures. “No covariate� refers to running the regression without a covariate and estimating one unique passthrough for all observations. ‘GDP’ uses the GDP deflator and ‘HFCE’ the HFCE deflator. 5.6 Using different passthrough rates at various parts of the distribution It is possible that growth on expectation trickles down to different parts of the distribution at different rates. If, for example, the elite disproportionately capture the benefits of economic growth, the passthrough rate at the top of the distribution would be higher than at the bottom of the distribution. This would imply that using one passthrough rate together with distribution- neutrality would be a poor choice. We test for this by calculating the passthrough rate from GDP growth to each percentile of the distribution. To that end, we have derived the welfare level at 100 points of the distribution of each survey. We take the points on the distribution where the CDF equals 0.005, 0.015, … ,0.995. For the purpose of this exercise, we pool the comparable income and consumption spells. Figure 12 shows that the passthrough rate is relatively stable over the distribution. If anything, it is slightly higher at the bottom, though this is not statistically significant. The fact that it is slightly higher at the bottom is equivalent to saying that over the entire sample of spells we consider, inequality on average decreased a little. 16 Figure 12: Passthrough rate from GDP growth to each percentile of welfare Note: The figure shows then 𝛽 ’s from 100 regressions of the following form, 𝑔𝑤𝑒𝑙𝑓,�,�,𝑠 = 𝛽 ∗ 𝑔�𝐴,�,𝑠 + 𝜀�,𝑠 , where p reflects the welfare at a percentile. 6. Implications for global and regional poverty Figure 13a shows the implications of implementing our proposed method vis-à-vis the prior method on global poverty. We continue to rely on the spring 2021 vintage of data. Evidently, the changes to global poverty are hardly noticeable, with the exception of the 1990s where the proposal would lower the global poverty rate by about 1 percentage point. This global picture masks important changes at the regional and country level. At the regional level, particular the Middle East & North Africa and Europe & Central Asia experience visible changes (Figure 13b). In the Middle East and North Africa, the trend is also impacted somewhat, with our proposed method removing a one-off hike in poverty in the early 90s and mitigating the increase in poverty projected in recent years. As a result of the latter, the Middle East & North Africa would not be poorer than Latin America & the Caribbean in recent years. Notably, the regional trend for the Middle East and North Africa is also smoother with the proposal, as will be explained in more detail below. These patterns can be explained by looking at country-level difference. Figure 14 shows the 15 countries with the largest change in poverty at any given year if our proposal were to be implemented. The changes in Middle East and North Africa in the 1990s are driven by Iraq, while the change in the recent years is driven by Syria and Yemen. In both cases, the passthrough rate of 0.7 adopted mitigates the projected changes. A similar picture is found for countries such as Myanmar and Cabo Verde. For other countries, such as Tajikistan and Bhutan, the changes are driven by a switch from HFCE to GDP. In both cases, the trends with the new method appear more stable because the HFCE data for those countries were very volatile. 17 Figure 13: Implications for global and regional poverty (a) Global poverty (b) Regional poverty Note: The figure compares the prior line-up rule with the new rule. The figure uses the spring 2021 vintage of data. The $1.90 poverty line with 2011 PPPs is used given that the $2.15 line was not yet adopted at this point. Figure 14: Implications for country-level poverty Note: The figure compares the prior line-up rule with the new rule. The figure uses the spring 2021 vintage of data. It includes the 15 countries with the largest changes to poverty rates. 18 7. Discussion and conclusion This paper considers how best to extrapolate welfare vectors only using national account measures and assuming distributional neutrality. It addresses (i) which national account variable should be used to extrapolate forward in time, (ii) what fraction of growth in national accounts should be passed through to growth in welfare, and (iii) how this varies for different cases. We argue that under these conditions, a well-performing method is to use growth in HFCE for upper middle-income and high-income countries, and growth in GDP for low and lower-middle income countries. For income aggregates, all growth in GDP and HFCE is passed through, while for consumption aggregates only 70% is passed through. We show that these results have few implications for poverty estimates at the global level but do entail large difference in estimated poverty in some countries. Notably, the switch to using GDP for low- and lower-middle income countries makes the poverty trends less erratic over time for some countries. This is likely due to large levels of uncertainty in measured HFCE for some poor countries. The proposed changes can be rationalized by both economic theory and measurement issues. A challenge with our largely empirically driven approach is that it is difficult to distinguish these two concerns. We speculate that three factors can explain a passthrough rate for consumption of less than one: First, consumption aggregates may exclude or underestimate some items that make up an increasingly large share of household budgets as countries develop, such as health expenditures and food eaten away from home. Second, the share of consumption devoted to non- food items tends to increase as economies grow, and longer recall periods for non-food items may lead to greater underreporting of non-food expenditures than food expenditures. Finally, a passthrough rate less than one from GDP to consumption is consistent with an increasing savings rate as countries develop, which in turn is consistent with empirical evidence (Gross et al. 2020, Drescher et al. 2020, and Crozier and Zavaleta 2022).2 We have not discussed what happens to the prediction error the longer one extrapolates. For very long spells, it is plausible that the old survey data carry little relevant information, and it may be better to in part or fully predict without using the survey data. For example, the latter could be done by assigning the country a poverty rate based on a cross-country regression of poverty rates on country characteristics such as GDP/capita, demographic variables, or indicators such as built- up area derived from satellite imagery. How best to combine estimates obtained through intertemporal extrapolation of the type considered here with cross-country predictions is an interesting question for future research. 2This explanation, though, cannot explain why the passthrough rate is less than one when using HFCE instead of GDP, because HFCE excludes savings by construction. Perhaps HFCE is measured with more noise than GDP. 19 References Angrist, Noam, Pinelopi Koujianou Goldberg, and Dean Jolliffe. 2021."Why Is Growth in Developing Countries So Hard to Measure?" Journal of Economic Perspectives 35(3): 215-42. Bates, S., Hastie, T., & Tibshirani, R. 2023. “Cross-Validation: What Does It Estimate and How Well Does It Do It?� Journal of the American Statistical Association: 1-12. Crozier S. L., F. B. Zavaleta. 2022. “The Marginal Propensity to Consume of 2020 COVID-19 Stimulus Payments in Peru.� International Journal of Economics and Finance 14 (3): 115–15 Deaton, Angus. 2005. "Measuring Poverty in a Growing World (or Measuring Growth in a Poor World)." Review of Economics and Statistics 87 (1): 1-19. Drescher K., Fessler P., Lindner P.. 2020. “Helicopter Money in Europe: New Evidence on the Marginal Propensity to Consume across European Households.� Economics Letters 195(October): 109416. Gross T., M. J. Notowidigdo, J. Wang. 2020. “The Marginal Propensity to Consume over the Business Cycle.� American Economic Journal: Macroeconomics 12 (2): 351–84 Lakner, Christoph, Daniel Gerszon Mahler, Mario Negre, and Espen Beer Prydz. 2022. "How Much Does Reducing Inequality Matter for Global Poverty?" The Journal of Economic Inequality 20(3): 559-585. Mahler, Daniel Gerszon, R. Andrés Castañeda Aguilar, and David Newhouse. 2022. "Nowcasting Global Poverty." The World Bank Economic Review 36(4): 835-856. Prydz, Espen Beer, Dean M. Jolliffe, Christoph Lakner, Daniel Gerszon Mahler, and Prem Sangraula. 2019. "National Accounts Data Used in Global Poverty Measurement." Global Poverty Monitoring Technical Note 8. Prydz, Espen Beer, Dean Jolliffe, and Umar Serajuddin. 2022. "Disparities in Assessments of Living Standards Using National Accounts and Household Surveys." Review of Income and Wealth 68: S385- S420. Ravallion, Martin. 2003. "Measuring Aggregate Welfare in Developing Countries: How Well Do National Accounts and Surveys Agree?" Review of Economics and Statistics 85(3): 645-652. Roberts, David R., Volker Bahn, Simone Ciuti, Mark S. Boyce, Jane Elith, Gurutzeta Guillera�Arroita, Severin Hauenstein et al. 2017. "Cross�Validation Strategies for Data with Temporal, Spatial, Hierarchical, or Phylogenetic Structure." Ecography 40(8): 913-929. Wollburg, Philip, Stephane Hallegatte, and Daniel Gerszon Mahler. 2023. “Ending Extreme Poverty Has a Negligible Impact on Global Greenhouse Gas Emissions.� Nature 623: 982-986. Walsh, Brian, and Ana De Menezes. 2023. “Historical Validation for Macrosocial Metrics.� Mimeo. World Bank. 2016. Poverty and Shared Prosperity 2016: Taking on Inequality. The World Bank. World Bank. 2023. Macro Poverty Outlook. https://www.worldbank.org/en/publication/macro-poverty- outlook. The World Bank. 20 Appendix A: Additional figures and tables Figure A.1: Errors using all possible spells rather than only adjacent spells a. By national accounts measure and covariate b. By use of GDP for other subgroups Note: The legend refers to the national accounts measure used. Note: GDP shows the error if only using GDP data, and similarly X-axis categories are sorted by average error across national for HFCE. LIC = Low-income, LMIC= Lower middle-income, accounts measures. “No covariate� refers to running the SSA = Sub-Saharan Africa. regression without a covariate and estimating one unique passthrough for all observations. The error with FCE and passthrough=1 is larger than the maximum value of the y-axis. Figure A.2: Errors using spells from 2000 onwards a. By national accounts measure and covariate b. By use of GDP for other subgroups Note: The legend refers to the national accounts measure used. Note: GDP shows the error if only using GDP data, and similarly X-axis categories are sorted by average error across national for HFCE. LIC = Low-income, LMIC= Lower middle-income, accounts measures. “No covariate� refers to running the SSA = Sub-Saharan Africa. regression without a covariate and estimating one unique passthrough for all observations. The error with FCE and passthrough=1 is larger than the maximum value of the y-axis. 21 Figure A.3: Comparison of errors with and without intercept(s) a. With overall intercept b. With group-based intercept Note: The left figure compares errors without an intercept to errors with an intercept. The right figure adds group- specific intercepts as well (for example, intercepts by region, datatype, income group – whichever group is used to calculate passthrough rates). Each dot is the error from a particular model, that is, a particular national accounts measure combined with a particular group to distinguish passthrough rates by. Figure A.4: Errors using temporal cross-validation a. By national accounts measure and covariate b. By use of GDP for other subgroups Note: The legend refers to the national accounts measure Note: GDP shows the error if only using GDP data, used. X-axis categories are sorted by average error across and similarly for HFCE. LIC = Low-income, LMIC= national accounts measures. “No covariate� refers to Lower middle-income, SSA = Sub-Saharan Africa running the regression without a covariate and estimating one unique passthrough for all observations. The error with FCE and passthrough=1 is larger than the maximum value of the y-axis. 22 Figure A.5: Errors using the mean squared error a. By national accounts measure and covariate b. By use of GDP for other subgroups Note: The legend refers to the national accounts measure Note: GDP shows the error if only using GDP data, and used. X-axis categories are sorted by average error across similarly for HFCE. LIC = Low-income, LMIC= Lower national accounts measures. “No covariate� refers to middle-income, SSA = Sub-Saharan Africa running the regression without a covariate and estimating one unique passthrough for all observations. Figure A.6: Errors using quantile regression a. By national accounts measure and covariate b. By use of GDP for other subgroups Note: The legend refers to the national accounts measure Note: GDP shows the error if only using GDP data, and used. The error with FCE and passthrough=1 is larger similarly for HFCE. LIC = Low-income, LMIC= Lower than the maximum value of the y-axis. X-axis categories middle-income, SSA = Sub-Saharan Africa are sorted by average error across national accounts measures. “No covariate� refers to running the regression without a covariate and estimating one unique passthrough for all observations. 23 Figure A.7: Errors by use of GDP or HFCE based on statistical capacity proxy Note: The blue line shows the error if spells associated with a sufficiently low statistical capacity uses GDP while spells with a high statistical capacity uses HFCE. We proxy statistical capacity with the share of years in which an international poverty estimate exists within a five-year window on each side of a reference year. If a country conducted a household survey in 2000 and 2007 and no other years, then in 2000, it would have a statistical capacity score of 1/11, in 2005 of 2/11. We do this to check if there is a level of statistical capacity below which it makes sense to use GDP growth, in the sense that it would give a lower error than our main specification. We estimate errors if any spell with a statistical capacity score above x uses HFCE and any score below x uses GDP. Regardless of which cut-off we use, we do not reduce the error beyond the LIC/LMIC cut-off (dashed black horizontal line). Table A.1: Regression output Consumption Income LIC/LMIC UMIC/HIC LIC/LMIC UMIC/HIC GDP HFCE GDP HFCE Coefficient 0.68 0.73 1.34 1.02 Standard error (0.053) (0.084) (0.195) (0.085) R2 0.50 0.46 0.25 0.47 Observations 373 160 169 739 Note: Results from regressing annualized growth in mean GDP or HFCE per capita on annualized growth in income or consumption per capita. Spells for LICs and LMICs use GDP while spells for UMICs and HICs use HFCE. Regressions are run without an intercept. Only comparable spells using adjacent surveys are used. Standard errors are clustered at the country level. 24