Policy Research Working Paper 10301 Estimating House Prices in Emerging Markets and Developing Economies A Big Data Approach Daniela M. Behr Lixue Chen Ankita Goel Khondoker Tanveer Haider Sandeep Singh Asad Zaman International Finance Corporation February 2023 Policy Research Working Paper 10301 Abstract Despite the relevance of house prices for a variety of stake- intra- and inter-country comparison of residential prop- holders as well as for macroeconomic and monetary policy erty prices. It then outlines the usability of these data by making, reliable, publicly available house price data are employing random forest estimation to predict the price largely absent in emerging markets and developing econ- of a standard housing unit—the basic house price—that omies. Filling this void, this paper presents a systematic is comparable across countries. While this approach is approach to collecting, analyzing, and assessing private also applicable to filling wide data gaps in the provision of property prices in emerging markets and developing econ- private property prices in developed economies, the paper omies. The paper uses data scraped from five countries’ focuses on how this approach can be applied to emerging largest real estate websites where private properties are markets and developing economies, where private property listed for sale, to obtain price data and property attributes price data are particularly scarce. to establish a comprehensive data set that allows for both This paper is a product of the Development Impact Measurement Department, International Finance Corporation. It is part of a larger effort by the World Bank Group to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank. org/prwp. The authors may be contacted at dbehr@worldbank.org; khaider@ifc.org; ssingh13@ifc.org; and azaman@ifc.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Estimating House Prices in Emerging Markets and Developing Economies: A Big Data Approach Behr, Daniela M.; Chen, Lixue; Goel, Ankita; Haider, Khondoker Tanveer; Singh, Sandeep*; Zaman, Asad1 Keywords: house price, web scraping, random forest, machine learning, residential real estate JEL classification: R31 E30 C80 1 The paper was written while the authors were at the International Finance Corporation (IFC). This paper is part of a larger research project on affordable housing at IFC’s Development Impact Measurement Department. Sandeep Singh, Khondoker Haider and Asad Zaman are Economists at IFC. Daniela Behr is currently an Economist at the World Bank. Lixue Chen and Ankita Goel are currently Research Assistants at the IMF. *Corresponding author: ssingh13@ifc.org. The paper has benefitted from thoughtful guidance and encouragement by Issa Faye, Dan Goldblum, Camilo Mondragon-Velez, Zekebweliwai Geh, and participants of the IFC Sector Economics research sounding board. We want to thank our peer reviewers Adrian Alter, Nina Biljanovska, Jean-Charles Bricongne, Rui Costa, Niall O’Hanlon, Jane Jenkinson, Dunstan Matekenya, Nisachol Mekharat, Maurice Nsabimana, Edie Purdie, Marko Olavi Rissanen, Inyoung Song, Nouhoum Traore, and Simon Walley. We also received very helpful feedback and comments throughout the process from Jeffrey David Anderson, Richard Chamboko, Friedemann Roy, Ronald Kai Him Leung, Justice Mensah, and Philip Schellekens. Views expressed here are those of the authors, and all errors remain our own. 1. Introduction Private property price levels and their movement over time have critical implications for the economy of most countries, but they also play a fundamental role in household (HH) wealth. Buying property is undoubtedly the most significant investment for many families around the world and reliable information on private property prices and their determinants are essential decision-making factors for HHs. The relationship between property prices and HHs’ disposable incomes determines what type of property HHs can afford. Therefore, from a policy standpoint, reliable property price data are key to understanding the scale of affordability issues and identifying market failures leading to supply-demand mismatches in the housing sector. House price data also bear key insights on countries’ financial and macroeconomic stability. Disproportionally rising private property prices, or high price-to-income ratios, can be an early indication of imbalances and risks in the financial system (e.g., Anundsen et al. 2016; Drehmann & Juselius 2014). Decreasing house prices, in turn, may be associated with a decrease in HH wealth and a decline in consumption (e.g., Campbell & Coco 2007; Mian et al. 2017). If monitored over time, property prices can function as an early warning system of systemic banking stress or economic downturn. In addition, urban planners, private developers, economists, and policy makers depend on reliable property price data to update zoning regulations, formulate housing policies, and decide how to efficiently allocate scarce resources to support housing solutions where they are needed most. Despite the importance of property price data for various stakeholders, comprehensive house price data are scarce in developed economies and virtually non-existent for emerging markets and developing economies (EMDEs). If available at all, residential property prices are mostly presented and published in indexed format, allowing to track changes over time; however, indices are not informative to understanding distributional aspects of prices, affordability issues, or the degree of price segments underserved in a market. When housing markets mature, properties' characteristics and attributes may be more stable over time. Hence, changes captured by price indices reflect price movements over a relatively constant set of properties in the market. However, in many EMDEs, there are dynamic changes to the type of properties being built by formal developers. Due to the nascency of markets, formal developers and builders in EMDEs are evolving to expand their portfolio to also cater to housing solutions to HHs with lower or even informal incomes. In such circumstances, aggregate price indices may not be fit for purpose and may suffer from biases due to rapidly changing underlying property types. The wide gap of publicly available property price data in EMDEs is a key impediment to extending the understanding of housing markets widely available from developed economies to these markets. Insights on the determinants of property prices and their drivers over time are heavily researched subjects, but exclusively rely on studying this dynamic in developed economies. If comparable property price data were available in EMDEs, they could provide crucial insights on potential inefficiencies along the housing value chain. Analyzing property price data in emerging economies can also help, for instance, to point towards constraints in developers’ ability to access finance, prohibitively high construction costs, or regulatory bottlenecks in land acquisition and titling. In addition, property price data in EMDEs can also help point towards affordability issues and potential ramifications on HH consumption and spending. The width of these challenges is often difficult to grasp as the housing value chain is very interconnected and complex, 2 varying across contexts and countries. Despite the immense importance of understanding these challenges and identifying how to mitigate housing market failures, the empirical literature in this area is highly underdeveloped for EMDEs. This can be primarily attributed to a dearth of robust property price data and associated analysis in these markets. This paper pursues a two-fold approach to address the gap: First, addressing the information vacuum in EMDEs regarding property price data, this paper presents a novel approach to collecting house price data in data-scarce environments common in EMDEs. Instead of relying on survey data or obtaining property price data from various official and unofficial sources, we scrape available price information from EMDEs’ listing websites where private properties are listed for sale. In doing so, we largely constrain our analysis to the formal housing market and disregard informal housing not transacted in online markets, such as the incremental self-building of houses typical in EMDEs. Our approach offers a first foray in providing the distribution of property price data otherwise nonexistent in EMDEs. In the second part of this paper, we demonstrate the useability of distributional private property price data and present an approach to estimate the price of a “standard house”. We refer to this price as the Basic House Price (BHP). This BHP allows for comparability of prices across EMDEs, standardizing the significant variation in the type of property across markets. Further, the BHP presents a standard measure that can be used in future research to monitor aspects of inclusiveness. When linked to other data such as, for instance, household income or housing finance data, progress on the Sustainable Development Goals (SDGs) may be monitored. The BHP also offers a standard measure to assess and compare housing affordability across population segments if paired with income data. We present the web scraping approach, descriptive statistics, and the methodology for estimating the BHP exemplarily for five countries: Albania, Costa Rica, Morocco, Pakistan, and South Africa.2 Overall, we collect residential property price data from over 200,000 online listings and present an approach to clean, analyze, and visualize these data. In this paper, we only demonstrate the applicability of the collected house price data for the price estimation of a defined housing unit. We leave for future research on how a standard house (BHP) price may contribute to understanding affordability in EMDEs. This paper proceeds as follows: After this introduction, Section 2 discusses previous efforts in collecting and estimating house prices, which are mainly focused on developed economies. This section outlines how – if collected at all – private property prices are mostly available in an indexed format which only allows for monitoring of changes over time. Section 3 describes the data collection process for the five countries and discusses the prospects and pitfalls of a web scraping approach to obtain data for EMEDs. Section 4 defines a standard housing unit, which is the underlying concept for estimating the BHP. This section also rolls out how we apply a Random Forest (RF) model to estimate the BHP. Section 5 presents BHP results for the five markets, including estimations for the largest cities in these countries. Section 6 concludes and outlines how the methodology we propose can be expanded to a more extensive set of EMDEs. 2 Our ongoing research extends the collection and estimation of private property prices to over 60 EMDEs and will be presented in future papers. 3 2. Literature: Collecting and Estimating House Prices Despite their importance, private property price data are not readily available for most countries and are particularly scarce for EMDEs. Further, the lack of standardization of property prices and the heavy reliance on indices to monitor changes over time make comparisons across countries cumbersome. Addressing this gap, the paper places itself at the crossing of three streams of literature: existing efforts to collect or collate data on residential property prices and to construct residential property price indices (Section 2.1); studies on determinants of house prices (Section 2.2); and a relatively new area of big data and machine learning approaches to estimate (determinants of) house prices (Section 2.3). Finally, we summarize the existing gap within this line of research and outline how we address it (Section 2.4). 2.1 Current Efforts to Collect House Price Data and Existing House Price Indices Since housing plays a key role in the growth of many aspects of a country’s economy including the development of the construction industry, job creation, and improving the living conditions of many HHs, governments have a great interest in understanding property prices and their developments over time. Most advanced economies’ statistical offices or central banks, therefore, started collecting data on residential property prices in the 1970s (Knoll et al. 2017). Also, tax authorities, land registries, or real estate associations collect, hold, and sometimes even publish data on residential property prices in many advanced economies.3 With the 2008-2009 global recession, which many scholars attributed to misalignments in housing and housing-related asset prices, the interest in dynamics in housing markets rose significantly (e.g., Goodhart & Hofmann 2008; Del Negro & Otrok 2007). Since then, several international organizations and central banks have increased efforts to develop global property price indices and collated real property price data for various predominantly developed economies to monitor macro-financial stability and price developments. One early methodology to track property prices is the Case-Shiller National Home Price Index which measures the value of residential real estate in major US metropolitan areas and serves as a blueprint for subsequent price indices in other developed countries.4 Extending the collection of residential property prices to EMDEs has been slow, in either indexed or other forms. In the following section, we briefly discuss the most important sources. First, a primary source for residential property prices, covering a relatively large number of countries, is provided by the Bank for International Settlements (BIS). BIS collects quarterly data on residential property prices for 60 countries, predominantly focusing on advanced economies. Property prices are harmonized as much as possible by BIS according to the recommendations outlined in the Handbook on Residential Property Price Indices (RPPIs), which summarizes best practices in how to calculate property price indices (European Union [EU] et al. 2013). As BIS compiles data from various sources, the data series differ from country to country, varying in frequency, type of property, covered area, priced unit, compilation method, or seasonal adjustment. In addition, while BIS publishes some actual prices, most data are only available in an indexed format, allowing for tracking aggregate price movements over time. BIS does not provide 3 The UK Land Registry, for instance, makes house price data publicly available: https://www.gov.uk/guidance/about-the-price- paid-data. 4 These price indices, including twenty cities, low-, medium-, and high-tier home price indices, condominium indices, and a U.S. national index, are now published as the S&P/CoreLogic/Case-Shiller Home Price Indices by Standard & Poor’s. 4 insights on the distribution of property prices.5 Despite these shortcomings, this database currently offers the most comprehensive data series on house prices. Second, the International Comparison Program (ICP 2011), which collects prices for a range of goods and services that make up final consumption expenditure and gross capital formation, also captures housing expenditures. The ICP survey collects annual rental prices and dwelling stock data. Rents are either captured as actual or imputed rents (World Bank 2020). In the most recent ICP cycle (2017), participating economies collected rental data for 21 different dwelling types, ranging from one-bedroom apartments to single-family homes. Third, a more regional-focused data source on property prices is provided by the Organization for Economic Co-operation and Development (OECD), which publishes nominal residential property price indices for OECD countries, as well as price-to-rent and price-to-income ratios.6 The database particularly focuses on house price developments across regions and cities within countries to capture spatial price variation. For select countries, OECD also offers the number and value of housing transactions. While insightful for advanced economies, this database does not cover any emerging economies and mostly publishes indexed data to track price changes over time. Fourth, institutions such as the International Monetary Fund (IMF) or the United States Federal Reserve Bank collate property price data from various national sources. IMF’s Global Housing Watch platform, for instance, tracks developments in housing markets across the world on a quarterly basis.7 The database collates property price data from different sources (e.g., BIS, European Central Bank, Federal Reserve, and national source) for 63 countries – mostly advanced economies – to assess valuation in housing markets. Further, it provides metrics such as price-to-rent and price-to-income ratios. Similarly, the Dallas Federal Reserve Bank’s International House Price Database publishes quarterly house prices for 25 mostly developed economies by drawing on national public sources primarily from central banks, statistical offices, or other non-government organizations (Mack & Martínez-García 2011).8 The data collected by these institutions are mostly for developed economies, and these institutions collate secondary data from a plethora of different sources, and primarily make data available in an indexed format. Fifth, in collecting and analyzing actual house prices across emerging economies, the Center for Affordable Housing Finance in Africa (CAHF) is unique in its efforts. It systematically collects house prices for African countries by surveying local housing experts on the cost and size of the cheapest house built by a private developer. Figures are published in CAHF’s annual housing finance yearbook, covering the last decade. CAHF’s approach also turns the conversation on house prices away from mean or aggregate measures and to the lower tail of the formal market. From the perspective of policy dialog on affordable housing, this approach may be more appropriate. However, the usability of the data for policy purposes suffers partly because i) the price point provided represents only the extremely lower end of the formally developed new housing units and ii) is not paired with information regarding the quantity supplied at or near this price range. 5 https://www.bis.org/statistics/pp_detailed.htm 6 https://data.oecd.org/price/housing-prices.htm 7 https://www.imf.org/external/research/housing/index.htm 8 https://www.dallasfed.org/institute/houseprice 5 Finally, in recent years, crowd-sourced platforms such as Numbeo,9 which rely on user inputs on property prices in various locations around the world, have added their own house price index along with publicly available per square foot price ranges for properties within the city center and outside the city center. These platforms add more distributional aspects to the average house prices and point predictions published by other indices, but suffer from the reliability of the self-reported data. Despite the apparent issues in comparability across countries, the listed sources are the most comprehensive databases on property prices currently available for a larger set of countries. Therefore, many papers draw on these indices to conduct country or region-specific analyses on property price developments over time (e.g., Girouard et al. 2006; Igan & Loungani 2012; Yoshino & Helble 2016). One of the earliest systematic presentations of house prices is a historical time series data set of nominal residential property prices in 13 advanced economies by Borio et al. (1994). Some studies that provide comparative assessments combine the data sources outlined above or enhance them with some primary data collection on some additional countries that are not yet covered by the indices (e.g., Deghi et al. 2020). 2.2 Determinants of House Prices The volume of research on the housing market, particularly estimating its impact on real economic activity, has experienced a steep influx after the global financial crisis in 2008–2009. Most studies in this realm investigate the various channels through which housing and house prices affect macroeconomic and financial outcomes, particularly as housing bubbles are associated with significant output losses (e.g., Catte et al. 2004; IMF 2008; Jordà et al. 2015). Single-country studies on house prices and house price developments mainly focus on developed economies, particularly OECD countries, EU countries, and the United States or Canada (e.g., Alter & Mahoney 2021; Davis & Heathcote 2005; Knoll et al. 2017; Philiponnet & Turrini 2017). Most studies investigate determinants of house prices over time. Jordà et al. (2016), for instance, have gathered time series data on disaggregated bank credit for 17 advanced economies since 1870. With this historical data for the total value of the residential housing stock (structures and land), the authors relate household mortgage debt to asset values, showing that the rise in mortgage credit has financed a substantial expansion of home ownership from about 40 percent in 1950 to 60 percent in the 2000s. Similarly, Knoll et al. (2017) assess how house prices have evolved over time for 14 advanced economies, gathering historical house price data to estimate what drives changes in house prices. The authors show that changes in house prices are largely attributed to changes in land prices. This finding is corroborated by others who also attribute rising property prices to sharp increases in residential land prices, while construction costs have remained relatively stable over time (e.g., Glaeser & Ward 2009; Gyourko et al. 2013). In major metropolitan areas, it is not uncommon for the cost of land to exceed 40 percent of total property price; in extreme cases, like San Francisco, the share can stretch to as much as 80 percent (McKinsey Global Institute 2014). Gao et al. (2019) dissect property features into two groups when predicting house prices: non-geographical features, such as the number of bedrooms and floor space area, and geographical features, such as the distance to the city center and the quality of nearby schools. This is also documented by Gröbel and 9 https://www.numbeo.com/property-investment/ 6 Thomschke (2018) who show that housing prices are largely determined by the physical location of the property. In addition, the number of bedrooms and the size of a private property are consistently found to be positively related to the property price (e.g., Fletcher et al. 2000; Garrod and Willis, 1992; Rodriguez and Sirmans, 1994). Other attributes studied include crime rates (e.g., Ceccato & Wilhelmsson 2020), or proximity to transportation (e.g., Zhang et al. 2021; Zong & Li 2016). Other authors estimate that house prices have particularly increased since the financial crisis in 2008 due to the rise of economic activity paired with unusually low mortgage interest rates in most advanced economies (Claessens & Schanz 2019). Also, price changes in major cities are hypothesized to be driven by institutional investors trying to find high yields or safe assets in a low-interest rate environment (IMF 2018; Gauder et al. 2014). While there is a plethora of literature on property price determinants in developed economies, studies on EMDEs are scarce. Some single-country studies focusing on EMDEs are analyzing existing house price data that are published by commercial banks such as e.g., Absa, Standard Bank, and First National Bank for South Africa (e.g., Balcilar et al. 2011; Luüs 2005). Other authors collect their own data by either surveying real estate agencies to estimate the relative importance of housing attributes to house prices (Owusu-Manu et al. (2019) for Ghana), by surveying developers (Libertun de Duren (2018) for peri-urban areas in Brazil and Mexico), or by conducting a household survey to collect data on housing costs (Uwayezu & de Vries (2020) for Kigali city in Rwanda). High property prices in EMDEs are often attributed to prohibitively high building costs due to the need to import materials, the shortage of local skills, and the absence of financial mechanisms that allow for materials to be bought in bulk (e.g., Gardner & Pienaar 2019). While unique in their efforts to shed some light on the housing market in emerging economies, these studies provide only a snapshot of the housing market of one country (or a handful of countries) – often with a regional focus or a focus on the biggest cities. 2.3 Big Data Approaches and Machine Learning for Private Property Price Estimation In addition to traditional approaches of data collection of private property prices discussed in the previous section, in recent years and with the gaining momentum of big data and machine learning in economics, more studies started to gather property price data from online listing websites. While less than a decade ago, most private properties were listed for sale in local newspapers or with private realtors, today, much of the listing activity has moved to websites concentrating on housing advertisements. Analyses that draw on property price data collected from listing websites allow for fine-grained spatial and temporal assessments of the entire housing market. Further, big data approaches to private property prices will enable one to investigate a particular housing market in more detail or add distributional aspects to the mostly averaged house prices made available by indices discussed in Section 2.1. A predecessor of web scraping approaches to collect property price data includes Kim’s (2007) study on Vietnam. The author manually collated over 5,000 observations on property prices and property attributes drawing on classified advertisements in Vietnam’s most prominent newspaper. Applying a hedonic price model, Kim assesses the price differences between Hanoi and Ho Chi Minh City to investigate the impact of social norms on property prices. Over time and with the increased penetration of property listing websites, private property price collection efforts have transitioned to online listings where data collection can be automated. Anenberg & Laufer (2017), for instance, use listing information to construct a new house price index to monitor house price developments in the US. Using property listings, the authors construct a 7 new repeat-sales house price index that describes house values at the contract date when the price is determined rather than the closing date when the property is transferred. Other big data price collection efforts include, for instance, scraping of online listings in Great Britain (e.g., Rae 2015), the US (e.g., Boeing et al. 2021), the Netherlands (ten Bosch & Windmeijer 2014), Türkiye (Keskin & Watkins 2017), Japan (Sadayuki 2018), or China (Hu et al. 2019; He et al. 2019; Wang et al. 2020). In their data collection efforts, most authors focus exclusively on a localized housing market (i.e., a particular region, city, or neighborhood) in developed countries for which well-structured property listing websites with a plethora of private properties listed for sale are available. Additionally, while very comprehensive in scope, most efforts of web scraping of private property prices are centered on developed markets. Similar approaches in EMDEs, particularly in low-income economies, are scarce. Notable exceptions are, for instance, Gnagey and Tans (2018), who collate a data set of over 64,000 properties in 2016 from listing websites to estimate house prices in Indonesia. The authors find that desirable housing attributes, structural quality, advantageous location on major thoroughfares, and secure land tenure increase property asking prices. Almost all studies that collect price data from online listing websites focus on only one or few markets within a particular region. One notable exception is a recent HouseLev database project that assembled house prices for 40 countries, mainly European and advanced economies, including some emerging economies such as Türkiye or the Russian Federation (Bricongne et al. 2019). The authors do not solely rely on web scraping for all 40 countries. They instead relate to national accounts data and implement web scraping as a “fallback methodology” in case of missing data. As the authors use both methods, national accounts as well as web scraping, for a sub-sample of European countries, they can compute the median level of estimated upward bias arising from the use of listed rather than transaction prices, which is then applied as a correction factor to improve comparability of price level data obtained with the two methods (Bricongne et al. 2019: 6). HouseLev, to the best of our knowledge, is the most comprehensive web scraping project of private property prices, primarily focused on developed economies. Advanced price estimation techniques have also evolved with the increased usage of big data approaches to collecting property price data from listing websites. Traditionally, the hedonic price model, which draws on Lancaster’s consumer theory, has long been the predominant model to estimate property prices (Lancaster 1966; Rosen 1974). Property prices are modeled in multiple regression analysis, assessing the association between property price and several hedonic attributes through parametric estimation (Oladunni & Sharma 2016). Attributes frequently applied in hedonic price models include, for instance, number of bathrooms, number of bedrooms, area size, neighborhood, or accessibility of the property (e.g., Borba & Dentinho 2016; Can 1992; Krol 2013). While very simple in their interpretation, hedonic price models require the fulfillment of strong model assumptions, including functional form of the conventional hedonic pricing model, homoscedasticity, independence, and the absence of multicollinearity (e.g., Anderson 2000; Pérez-Rave et al. 2019). 10 In recent years, the applicability of alternate methods to the hedonic price estimation has expanded and machine learning (ML) has emerged as an alternative to predicting house prices (Borde et al. 2017; Čeh et 10 For a critical overview of the different prediction algorithms commonly used for house price predictions, see Montero & Fernández-Aviles (2018). 8 al. 2018; Fan et al. 2006; Mullainathan & Spiess, 2017; Pérez-Rave et al. 2019; Truong et al. 2020; Yan & Zong 2020). Within that realm, Fan et al. (2006) constitute one of the earliest contributions that move beyond hedonic price models to predict property prices. Applying a decision tree technique, the authors explore the relationship between house prices and housing characteristics, which aided the determination of the most important variables for price predictions. While ML techniques are comparatively weak in inference, they have strong predictive power, manage to fit complex data, are very flexible in assumptions on functional form without overfitting, and work well in out-of-sample estimations (e.g., Athey 2018; Mullainathan & Spiess 2017). ML estimations such as random forest have become a suitable, and frequently applied alternative to hedonic price estimates, particularly for property price estimation. While RF and other decision tree-based models also rely on model assumptions, they are better at modeling non-linear relationships compared to simple, multi-linear regression. Authors applying ML to price estimations mostly focus on narrowly defined housing markets in developed economies such as Ljubljana, Slovenia (e.g., Čeh et al. 2018), Gangnam, Republic of Korea (e.g., Hong et al. 2020), London, Great Britain (e.g., Levantesi & Piscopo 2020), Arlington County, USA (e.g., Wang & Wu 2018) or housing markets in upper-middle income economies such as Mamak District, Ankara, Türkiye (Yilmazer & Kocaman 2020), Petaling, Jaya, Selangor, Malaysia (Mohd et al. 2019), or St. Petersburg, Russian Federation (Antipov & Pokryshevskaya 2012). In assessing the housing sector, many of these authors contrast the predictive performance of ML algorithms with standard regression techniques. Across the board, the authors find that RF (significantly) outperforms parametric estimation techniques in terms of accuracy and predictive power. 2.4 Data Gap: EMDEs Are Largely Absent in Property Price Analyses This overview on existing studies and data sources within the realm of house prices points to five major gaps that we try to address with this paper: First, there is a striking data gap in the availability of house prices, particularly for EMDEs. Most existing property price compilation efforts concentrate on developed economies, publish data only in indexed format, and do not report underlying actual house prices. This may be attributed to the fact that national sources, such as central banks or statistical offices on which these indices base their data, do not collect, report, or publish property price data. Further, as underlying data to these indices are very country- and context-specific, they fit the purpose of monitoring price changes over time within a specific country but do not facilitate cross-country assessments. Some countries, for instance, only consider prices for family homes in the capital while other countries use flats in urban areas for the index. The same applies for prices: some report the transaction prices while others draw on listed prices, while yet others average prices (cf. BIS database; Mack & Martínez-García 2011). Mack & Martínez-García (2011), who collate publicly available national sources to build a database of (nominal and real) house prices for developed economies, acknowledge this flaw outlining that the main contribution of their database is “sorting out the existing data by country, selecting the most similar series and documenting the differences across countries to clarify the extent to which international sources can be made comparable for empirical analysis purposes” (Mack & Martínez-García 2011: 3). Achieving comparability across countries with the existing data sources is almost impossible. 9 Second, while price indices present equilibrium outcomes of housing markets, they do not cover details about, broadly, the quantity of housing. They often only include a particular type of housing for which prices are tracked. Whereas in high-income economies, the latter may remain relatively stable in the short term, in EMDEs, with rapidly expanding formal housing markets, quantity and type of housing are important elements to capture. They provide context to changes in prices as the sample over which prices are indexed changes, and as price and quantity and type of housing supplied are highly interrelated. Also, they have important policy implications in the context of markets’ ability to supply homes for different market segments, and formal developers’ ability to go reach lower income groups. The measurement of these dimensions of the housing markets is absent across EMDEs. Third, primary data collection for actual, non-indexed house prices is still somewhat limited and, if available, almost exclusively covers advanced economies. With a few notable exceptions, there is a severe lack of contributions in the literature on property price estimations in EMDEs. Property price data for EMDEs are virtually non-existent – both for within-country assessments, and even more so for cross- country comparison. Most studies on house prices obtain data from readily available sources such as land registries, real estate agencies, or commercial banks, or tap into established indices. Given the significant effort to collect original data on house prices, there are very limited efforts. Since the scope and focus of these studies differ or as they purely rely on price indices, comparing property prices across studies is not feasible. Fourth, efforts to investigate property prices in EMDEs mainly converge to analyze the determinants of mean house prices. Distributional efforts in property price collection for different income segments within emerging economies are largely absent. CAHF is unique in its effort to approach the house price estimation from the perspective of low-cost developers. Yet, CAHF takes it to the other extreme. It only collects the cheapest price of a house built by a formal developer in African countries and does not factor in otherwise transacted housing units in the formal housing market. While insightful, this approach does not allow pricing the entire housing market, offering an understanding of the quantity of “affordable” houses available to different income segments. Lastly, studies assessing the historical developments of private property prices are concerned with measuring financial (in)stability, which they attribute to distorted household mortgage debt to asset values ratios. A myriad of studies estimating house prices in developed economies were published after the collapse of the housing bubble and the resulting financial crisis in 2008/2009 mainly concerned with estimating how to identify housing bubbles in the first place. An examination of property prices from the perspective of affordability and demand-supply mismatches for different income segments is absent. In addition, studies assessing house price data usually draw on different methodologies to estimate property prices and rely on varying data sources. Hence property price estimations are not comparable across studies and scholars have only recently started to use big data approaches to collect actual property price data for a larger number of economies. Yet, their focus mostly remains on developed countries. To fill these gaps, our paper extends the novel approaches in collecting house prices through a web scraping approach to emerging economies, thus addressing the substantial data gap in EMDEs. Further, this paper offers a methodology contributing to a more distributional understanding of private property prices in emerging economies. It also provides a comprehensive methodology to estimate a standard house price that 10 allows for consistent price comparison across countries. These data can then facilitate the extension of the scope of the analysis to affordability assessments of property price data and the segmentation of the housing market – particularly focusing on EMDEs. 3. Data Collection and Processing: House Price Data in Emerging Economies In this section, we outline how we collect house price data through a web scraping approach for five markets: Albania, Costa Rica, Morocco, Pakistan, and South Africa. We demonstrate how a big data approach, hitherto employed mainly in developed economies with good data quality, can also be extended to EMDEs to collect price data efficiently. We collected 200,000 unique property transactions for these five countries in an otherwise data-scare environment. The web scraped data reflect the entire housing market and complement the available indexed data that (mostly) report average property prices only. We selected these five economies to cover different regions and factors in varying country contexts to highlight specificities of web scraping and data processing in EMDEs. These include, for instance, types of properties listed, unique forms of data entry specific to EMDEs, or cultural aspects. We do not strive for the representativeness of these five countries for all EMDEs but seek to exemplify the unique challenges of applying a web scraping approach in EMDEs. Nevertheless, transferring this approach to other EMDEs, especially those with lower data quality, will come with additional unique challenges (as discussed in more detail in Section 3.3). 3.1 Web Scraping House Prices in EMDEs The transaction price would be the ideal source to obtain comparable property price data. Typically, these data can be found in land registries or tax authorities, collated from real estate agencies, collected through online surveys, or obtained through appraisals or valuations as part of the mortgage process. However, none of these sources are feasible for automated data collection in EMDEs as the various institutions holding these price data do not yet have a standardized way of collecting, publishing, or even digitizing them. In markets characterized by lax regulation or enforcement, transacting parties may under-declare property prices to avoid negative ramifications with respect to paying additional registration costs or taxes. We opted to collect property price data for EMDEs through a web scraping approach of real estate websites. While not yielding transaction prices, obtaining listing prices of formal properties is a viable alternative to gathering price data in EMDEs, where data are otherwise non-existent. At least in the context of developed markets, strong evidence exists that listing prices are correlated and a good leading indicator for transaction prices (e.g., Ardila et al. 2021; Anenberg & Laufer 2017; Lyons 2019). For each economy, we scrape property prices and additional data points for all available listings, capturing, to the extent possible, location aspects. Scraping unique property transactions has several advantages: first, they allow us to provide actual property price data of an entire housing market in EMDEs, facilitating analysis beyond aggregated or indexed data. Particularly in EMDEs, we expect significant differences in prices between the biggest business city and rural areas, and more considerable skewness in data even within cities. Second, by web scraping property prices of the entire formal housing market, we can analyze sub-markets in greater detail. Third, collecting house price data from the entire formal, online housing market allows us to also capture the quantity of housing available at different price segments and can, therefore, provide an overview on the 11 distributional aspects of the housing market within a given economy. These distributional price data are very relevant for additional analysis as they can, for instance, be paired with household-level income data for affordability assessments. Finally, an overview of the entire housing market can reveal supply-demand mismatches particularly regarding in which price segment formal housing market activity is generally low or absent altogether. We start the web scraping process by identifying the most up-to-date and complete websites that list private property prices for sale in the five EMDEs. We identify up to three relevant listing websites per economy. Websites were selected based on the following aspects: (i) websites with the most comprehensive number of up-to-date listings, (ii) websites that offer broad ranges of properties and do not only cater to the luxurious segment (i.e., avoiding websites exclusively targeting expats etc.); (iii) websites that offer structured data entries on housing attributes including price and size. We limit ourselves to up to three websites since we notice considerable cross-postings in additional, usually less comprehensive, websites. We then scrape all residential properties that are listed at one point in time for sale on these websites along with all available housing features, including price, size, type (i.e., whether the property is an apartment or a house), location, number of bedrooms, number of bathrooms, and sometimes amenities such as garage, time of construction, number of floors, etc. We extracted online listing data for the entire formal housing market of five EMDEs at one point in time, between April 2020 and August 2020. While this falls within the onset of the COVID-19 pandemic, insights on how house prices were impacted in EMDEs are qualitative and largely anecdotal. Commentary on the matter focuses on the affordability challenges for HHs rather than specifically on changes in property prices.11 Beyond qualitative insights, comprehensive analyses on price changes due to the COVID-19 crisis are preliminary and focused mostly on developed economies (e.g., Pfeifer & Steurer 2020 for Vienna and London; or Bricongne et al. 2021 for the UK). While the results are not transferrable to EMDEs, they still offer some context into a largely under-researched area. On the impact of COVID-19 on housing markets in the UK, Bricongne et al. (2021) show that while the number of offers per week dropped during the first lockdown period, house prices did not change significantly (maximum of 2.6 percent increase) (Bricongne et al. 2021). Pfeifer & Steurer (2020) make a similar observation for the housing market in London, while showing that the housing market in Vienna follows an upward trend following the COVID-19 crisis. Despite the timing overlap, we cannot draw on existing literature to determine if bias may exist in our data, or, more importantly, the direction of the bias. As a first effort to scrape private property prices in several EMDEs, we faced unique challenges compared to similar efforts in developed markets, such as issues pertaining to the number of observations available per website, the organization and reliability of data, measurement units provided, and the overall reliability of the websites. In many EMDEs, property listing websites are often not the primary source for transactions. Often, buyers and sellers revert to real estate agents and personal interactions. Yet, online platforms are becoming increasingly more popular for transacting goods and services, including properties. In Africa, for 11 Some regional analyses in Latin America qualitatively point to the fact that while there was economic slowdown and increased investor uncertainty dampening growth in the short term, but also that the COVID-19 pandemic has shifted consumer preferences to larger properties with more outdoor space. Another analysis for India, for instance, reported that house prices have stagnated in 2020 / 2021, a trend attributed, among others, to the receding demand due to the COVID-19 pandemic (Reuters 2021). 12 instance, Jumia.com, which is an online platform combining an e-commerce marketplace, classified websites, and applications, is widely used across the continent. In Nigeria, Property Pro (formerly Jumia) is the number one property transaction website, covering about 65 percent of the Nigerian online real estate market (Nairametrics 2018). Additionally, EMDEs’ websites, particularly those that do not have a regional spread like Jumia.com in Africa, have limited formal standards regarding data entry. Many websites in emerging economies do not have consistent data on an array of property characteristics and provide somewhat limited information on the listed properties. Often, they only include some pictures of the property, the listed price, and a phone number through which the seller can be reached. Formal developers, who have started to also list newly built properties on online platforms (in addition to their own online or offline platforms), provide slightly more structured information on the transacted property. Yet, they are bound by the format of the online platform, which often only requires submission of property price and size. Few developers provide exhaustive property descriptions in free text format, which, if extracted, needs to be processed for data analysis through text mining. Hence, property data obtained through web scraping in emerging economies, in our experience, will not be as exhaustive in terms of obtaining different property features as found in publications on developed markets (cf. Section 2.2). Furthermore, across websites in EMDEs, there is no standardized way to record the size of the property. Sometimes, the website does not provide the option to insert size information at all. In addition, user- provided information on size might not necessarily comply with the unit required by the platform (e.g., users insert square feet even though the platform requires square meters). The matter is even more complicated in economies where local measurement units for properties are used alongside more “standardized” measurements. South Asian websites (Nepal; Pakistan) allow for the insertion of different size units including Biga, Kattha, Dhur, Ropani, Aana, Paise, and Daam, alongside square feet and square meters. However, not all users consistently specify the measurement unit making data cleaning cumbersome. In addition, particularly for houses, size data can be somewhat muddled as it is unclear whether the plot size or the usable property size is indicated. To account for this difference, we distinguish apartments and houses (cf. Section 4.1) and, where available, use plot size for houses to also account for the value of the land, which – in some EMDEs – can be a significant portion of the property price (cf. Section 2.2). Finally, in some EMDEs, listing websites do not specify whether a property is for rent or for sale. Usually, rental properties can easily be distinguished by relatively low prices. However, in some EMDEs, it is common to pay one year’s rent upfront. In these instances, it is challenging to discern low sales prices from annual rents in cases where listings do not distinctly indicate sale versus rent. In addition, price data do not always include a currency marker, which is mainly problematic in countries where both euros and US dollars are used to transact properties in addition to local currencies. When the aim is to estimate representative property prices of the housing stock in a country or region, scraping at least 0.5-1 percent of the number of HHs in that area is considered to be a large enough sample (cf. Bricongne et al. 2019). The same principle applies when the statistical population being analyzed is the universe of transacted properties in the market over a given period: in developed economies where residential property markets are formalized, it is possible to obtain the total number of transactions (the 13 statistical “universe”) and thus sample appropriately to achieve representativeness. In EMDEs, by contrast, we expect most transactions to occur informally and outside of what is observable publicly. Through web scraping, we constrain our analysis to the formal housing market and to what is transacted online. With this approach, we are able to obtain house prices for the formal housing market but are unable to infer the degree the estimations apply to – in some EMDEs admittedly large – informal housing markets. We aim for representativeness of the formal housing market only, and to achieve this, we scrape entire websites to cover all available listings. Despite these efforts, we encounter smaller sample sizes in some countries, which are likely to stretch margins of error (cf. Section 3.3). Annex 1 provides an overview of the sample coverage and percentage of formal households scraped. 3.2 Data Processing: Data Cleaning and Outlier Removal As with any data set that is obtained from user-inserted data, the scraped data is prone to incorrect, inconsistent, or missing information. Most online listing platforms do not run quality checks on the listed properties or require fully populated identification of property features. Preparing the data for analysis, we diligently cleaned the web scraped data removing data entry errors, duplicates, and outliers. Given the issues outlined in the previous section, which are inherent to EMDEs, data cleaning is more tedious and time- consuming than for more structured data likely to be obtained in developed economies. We describe the data processing steps in detail below, illustrating descriptive statistics of the various stages of the process (Table 1). Duplicates The first step in processing the data is identifying and flagging repeat data and duplicates. These mostly arise for two reasons: first, many EMDEs’ property websites allow for the re-submission of the same property within some days’ interval. Realtors mainly use this option to restore the property at the top of the search results list to improve visibility on the website. Second, duplicates may also arise because of cross- listing of properties across different platforms. Ideally, we want to create a data set that removes both occurrences. Hence, we deduplicate the data set to obtain what we call the original data set, dropping all exact duplicates that either have the same listing ID or that include the same title and description. Typically, properties with the same title and the same description are a clear indication for a repeat entry of a property on the same website. The title and description of properties, however, might bear similarities in instances where multiple, newly built apartments are advertised within the same complex. Also, in these cases, property features such as price, size, address, or number of bedrooms might be identical while referring to unique listings. Retaining these observations in the data set, we only remove exact duplicates with the exact values on price, size, bedroom, title, and description. A downside of this approach is that we run into the risk of keeping observations in the data set where the title or description has been slightly altered during the re-submission of the same property listing to the website. As we assume that we will retain some duplicates in the data set, we also perform a more rigorous de- duplication where we remove all data that could potentially constitute a duplicate to understand how this alters our estimations. In this stringent outlier removal process, we remove all observations that have the same value on available property features only (price, type, bedroom, bathroom, and city) and disregard the title and the description of the property. While we note that this procedure is highly likely to also remove observations that are in fact unique but share the same property features, we perform this robustness test to 14 ensure that repeat data do not drive estimated property prices. More sophisticated duplicate removal would include the use of text analysis techniques to understand the extent of similarity of the title or description of the property to remove those observations that have only been slightly changed during re-submission. Given the significant time effort of this technique, we opted for the more stringent data removal as a robustness test for price estimations (Annex 2). Data Filtering Next, we filter the data by excluding scraped data that are clearly not residential properties. These include storing units, garages, parking lots, undeveloped or agricultural land, or commercial properties. In addition, we truncate the data on price and size to exclude data entry errors and rental data. These include, for instance, rental data likely erroneously captured as sales price, particularly for those properties that include yearly rentals, spam or negotiable listings often detectable by “1” entered as the sales prices, and random data entries on square meter data. We apply a direct data filter to remove obvious errors and undesired data to obtain the truncated data set. We assume that all observations of below 9 square meters (sqm) and above 3,000 sqm are either data entry errors or properties that cannot be considered residential properties (e.g., storing units; large farmland). Also, we assume that properties of less than 9 sqm are not habitable for one person, aligning with the definition of the UN (UN-Habitat 2007). Regarding the truncation on price data, we remove any properties below 5,000 US dollars and above 50 million US dollars to account for data entry errors, rental prices that are accidentally listed as sales, as well as – a very typical feature in EMDEs – entire apartment complexes that are sold in bulk as an investment project. The major issue with apartment complexes or several apartment units being sold in bulk is the mismatch between the size and price data. Often, the price reflects the price of the overall apartment complex while the size reflects that of a single unit. Since it is often impossible to infer the actual per unit price and size, we exclude these properties to avoid distortion. While we apply a context-driven data filter to maintain as many observations as feasible in the truncated data set, we also apply a more rigorous winsorization to the data, common in large data sets such as ours (e.g., Bricongne et al. 2021 for HouseLev Data). We remove the first and 99th percentile of price and size and outline how this winsorization alters summary statistics (cf. Annex 3). Outlier Removal Having obtained the truncated data set, we perform additional outlier removal to ensure that skewed data do not drive estimations. Heavily, positively skewed property price data seem to be particularly acute in EMDEs where very luxurious properties catering to expatriates or foreign investors are transacted. To avoid analyzing severely skewed data, we employed two different approaches: First, we right censored the data to remove luxurious residential properties that are not targeted at the local housing market. In doing so, we use Numbeo, a crowd-sourced global data platform that reports consumer prices, including private property prices in most countries’ largest cities. As Numbeo data is likely to be dominated by a bias towards data entry from higher-income individuals (with internet access), we consider Numbeo’s data maximum as the “true” maximum. Hence, we consider properties within the truncated data set that exceed the maximum per square meter price reported in Numbeo as an outlier. Hence, we obtain the right censored data set. Second, to avoid that outliers at both tails of the distribution are distorting our estimations, we perform multivariate outlier removal on the truncated data set based on the robust Mahalanobis distance of each 15 observation in the sample.12 With this outlier detection technique, we remove outliers throughout the entire distribution, but mostly concentrated on the left- and right-hand tail. Given the numerous data quality issues outlined above, we consider the second avenue of outlier removal, the more restrictive technique, most appropriate for our purposes and hence use the multivariate outlier removal technique to obtain the final data set. All other data sets are contained for robustness check purposes and to illustrate the data processing only (cf. Table 1). Price estimations and predictions are only performed on the final data set. Data Sets Illustrating Data Processing Table 1 summarizes the property price data of the five economies for the different stages of the data processing. In the original and truncated data set, the means of price, size, and per square meter price are (much) greater than their medians as the distribution is positively (and in some cases strongly) skewed by outliers. While more robust statistics such as median and the interquartile range (IQR) stay relatively consistent across data sets, the standard deviation and mean drop significantly from the original data set to the final data set. This pattern remains constant throughout the five countries and provides some suggestion that the outlier removal process, while comprehensive on distance metrics, does not significantly alter the balance of the right and left tail and the order of the distribution. In Albania, the difference in standard deviation between the original data set and the final data set is stark despite the relatively low number of outliers being removed. In Morocco, the right censored data set is the same as the truncated data set as the maximum price listed in Numbeo is smaller than the maximum price in the truncated data set, hence, no right censoring is applied here. In South Africa, the truncated data set – particularly if compared to the other four countries – excludes a relatively larger share of data entry errors and potential rental data. This might be attributed to large farms being included for sale on the website we used for South Africa (property24.com). In Costa Rica, the multivariate outlier removal technique detected particularly high-end, luxurious properties. 12 We are applying the smultiv command in Stata, which has consistently proved to outperform mcd, an alternative robust estimator for outlier detection (e.g., Verardi & McCathie 2012.). 16 Table 1. Illustration of Successive Stages of Data Processing Albania Costa Rica Descriptive Statistic Original Truncated Right Final Descriptive Statistic Original Truncated Right Final Data Set Data Set Censored Data Set Data Set Data Set Censored Data Set Data Set Data Set Number of observations 3,389 3,376 3,303 2,709 Number of observations 10,376 10,202 10,159 7,551 Price (USD) Median 97,619.05 97,619.05 97,619.05 89,285.71 Price (USD) Median 200,000.00 200,000.00 200,000.00 175,000 Price (USD) Mean 137,498.50 137,483.8 137,155.6 99,599.84 Price (USD) Mean 293,751.60 295,211.70 295,126.70 185,570.10 Square meter Median 97,00 97,00 98,00 92.00 Square meter Median 181,00 181,00 181,00 150,00 Square meter Mean 116.73 115.84 116.98 92.40 Square meter Mean 347.11 234.17 234.17 160.71 Price per square meter Median 1,046.57 1,047.62 1,047.62 1,035.87 Price per square meter Median 1,156.72 1,162.28 1,162.28 1,273.90 Price per square meter Mean 1,227.50 1,219.54 1,219.54 1,075.40 Price per square meter Mean 2,744.14 1393,17 1393,17 1,138 Price per square meter IQR 508.32 505.95 505.95 423.66 Price per square meter IQR 673.14 665.61 665.61 569.76 Price per square meter SD 2,886.30 2,804.12 2,804.12 326.74 Price per square meter SD 80,189.03 3,869.17 3,869.17 543.29 Number of bedrooms Mean 1.98 1.98 2.00 1.78 Number of bedrooms Mean13 . . . . Morocco Morocco (cont’d) Descriptive Statistic Original Truncated Right Final Descriptive Statistic Original Truncated Right Final Data Set Data Set Censored Data Set Data Set Data Set Censored Data Set Data Set Data Set Number of observations 10,734 10,479 10,479 8,673 Price per square meter Median 1,119.25 1,117.21 1,117.21 1,039.27 Price (USD) Median 104,395.60 105,494.50 105,494.50 93,406.59 Price per square meter Mean 3,071.55 1,677.39 1,677.39 1,178.07 Price (USD) Mean 271,361.30 234,233.80 234,233.80 117,836.70 Price per square meter IQR 932.10 903.46 903.46 776.44 Square meter Median 95,00 95,00 95,00 88,00 Price per square meter SD 33,458.45 7,570.71 7,570.71 557.07 Square meter Mean 188,59 133.68 133.68 95.47 Number of bedrooms Mean 2.65 2.64 2.64 2.44 13 The number of bedrooms in Costa Rica is missing because it was not available consistently from scraped websites. 17 Pakistan South Africa Descriptive Statistic Original Truncated Right Final Descriptive Statistic Original Truncated Right Final Data Set Data Set Censored Data Set Data Set Data Set Censored Data Set Data Set Data Set Number of observations 107,652 107,524 107,446 75,233 Number of observations 91,904 88,374 70,155 42,369 Price (USD) Median 83,934.34 83,934.34 83,934.34 60,930.12 Price (USD) Median 95,588.23 93,286.45 102,301.8 68,734.02 Price (USD) Mean 156,756.10 154,420.40 154,487.40 68,011.17 Price (USD) Mean 155,699.4 147,055 157,654.9 87,676.75 Square meter Median 151.76 151.76 151.76 126.47 Square meter Median 350 313 312 110 Square meter Mean 955.86 220.93 220.93 134.53 Square meter Mean 2,081.1 555.01 554.09 193.98 Price per square meter Median 535.38 535.38 535.38 491.63 Price per square meter Median 360.38 387.17 387.17 580.06 Price per square meter Mean 634.36 633.16 633.16 507.60 Price per square meter Mean 887.42 629.44 629.44 734.70 Price per square meter IQR 344.14 344.14 344.14 245.81 Price per square meter IQR 614.62 624.15 624.15 620.48 Price per square meter SD 652.22 480.86 480.86 205.29 Price per square meter SD 9,302.60 853.02 853.02 654.53 Number of bedrooms Mean 3.86 3.86 3.86 3.33 Number of bedrooms Mean 3.04 2.99 3.0 2.4 Note: The original data set contains the original set of all listings. The truncated data set retains listings that contain sale prices, size data, and whether the property is an apartment or house and truncates the data based on sqm<9 or sqm >3,000 and Price < $US 5,000 or Price > $US 5,000,000. The right censored data set retains unique listings that contain sale prices, size data, and whether the property is an apartment or house, truncates the data based on sqm<9 or sqm >3,000 and Price < $US 5,000 or Price > $US 5,000,000, and additionally removes the most extreme price data points on the upper end of the spectrum. The final data set retains thorough listings with reasonable values for price, size, and per square meter price, cleaned through multivariate outlier removal. SD= Standard Deviation; IQR=Interquartile range 18 3.3 Limitations While applying a web scraping approach to obtain price data in EMDEs is a very cost-effective and efficient way to collect data, particularly compared to conducting expensive surveys, there are several limitations that pertain particularly in the context of EMDEs. To start with, the web scraping approach does not necessarily yield data that are representative of all properties in the market as we are only able to capture properties of sellers with access to internet and who are able and willing to post their property online. By the same token, accessing house prices on real estate listing websites in emerging economies requires buyers to have access to these online listings. This might not always be the case, especially in lower-income segments of a given market. Particularly in developing areas, information density is low and might lead to data blind zones (Li et al. 2019). Recent research shows that online platforms used for home sales, even in developed markets, may reproduce and even intensify existing forms of inequality within cities (Boeing et al. 2021; Angelo & Vormann 2018). While internet access is less of a concern in the five countries we outline here (cf. Annex 1), expanding the methodology to other countries might become problematic. In Burundi, for instance, only 5 percent of the overall population use the internet either via computer, mobile phone or other digital devices (World Bank 2022). In comparison, in Brazil, close to 74 percent of the population use the internet (World Bank 2022). In countries with relatively low internet penetration rates, HHs might adhere to alternative pathways to buy properties: personal interactions with real estate agents, classified ads in newspapers or through informal, personal interaction. Hence, the web scraping approach might not be suitable to capture local property markets where online advertisements are not frequently used and might, therefore, weaken the generalization of the results to localized markets. Second, the price data collected are concentrated in countries’ biggest business cities and urban centers. This is not surprising, since urban centers are the place where most new housing units are being built, responding to the accelerated urbanization rates currently observed in EMDEs. Further, urban dwellers are more likely to formally transact their property and to use online sources to sell or buy properties. Given the diversity of urbanization across EMDEs, the level of geographical disaggregation differs significantly. Disaggregated data for geographical areas beyond the major business city might not be sufficiently large to provide price estimations beyond the largest urban area. Due to data limitations beyond the biggest business cities and the absence of location markers on many housing listing websites in EMDEs, highly complex and spatially heterogenous housing markets cannot fully be delineated. Third, the data listed on real estate websites include newly developed properties and the secondary housing market, which might bias house price estimations. In addition, some new housing developments are sold as investment projects, often tailored towards foreign investors. These properties are usually sold in bulk, i.e., entire apartment complexes containing several apartment units. Accounting for this potential bias, we conduct careful, multivariate outlier removal. In addition, we differentiate between property types (apartments and houses) and provide distinct price estimates for both property types. Fourth, the final transaction price is likely to be different from the listed price, which often appears to be the price ceiling that precludes the possibility of sales at higher prices (Horowitz 1992). Furthermore, the listed prices advertised online represent the user-inserted price, which could include either the appraised 19 values from some third party such as a tax assessor, or the self-appraised property values of homeowners. Regarding the latter, several studies have pointed towards a large variance of self-appraised values which in large enough samples like ours, positive and negative errors tend to cancel each other out (e.g., Follain & Malpezzi 1981; Goodman & Itter 1992). The difference between the transaction and the listed property price is dependent on multiple factors including the overall state of the housing market, the demand for housing, the time the property remains on the market before willing and able buyers come forward, as well as cultural aspects pertaining to e.g., negotiation. Despite these issues, in the absence of transaction price data sets, listing prices offer a good proxy to estimate the state of the housing market as researchers have consistently found rather low deviations between listed price and transaction price (Arnott 2009; McGreal & Taltavull de La Paz 2013; Haurin et al. 2010). 3.4 Descriptive Statistics Across the final data set of the five EMDEs in our sample, we observe different patterns in terms of availability of apartments versus single family houses (Figure 1). While in Costa Rica, Pakistan, and South Africa, the number of apartments and houses are well distributed, Albania and Morocco have many more apartments than houses available within the data. In South Africa and Costa Rica, the right-hand side of the distribution is dominated by rather expensive single-family houses and only few, expensive apartments. Similar patterns are observable in the price-size relationship (Figure 2). While in Pakistan the price and size differences between houses and apartments are marginal, South Africa – and to a lower extent Costa Rica – have noticeable price differences between houses and apartments, which could potentially be attributed to composition effects as houses and apartments are not equally located in all places. The strength of observed correlation between house price and size also varies across countries (Figure 3). In South Africa, this correlation is weakest – among apartments, houses, and overall. This suggests that size may not be the primary driver of price, and that other attributes collected (e.g., location) may have more explanatory power. Equally insightful are frequency distributions of smaller-sized apartments and houses within the data (Annex 4). Across countries, units smaller than 200 sqm are usually apartments. In Costa Rica and Pakistan, however, there are a significant number of smaller-sized houses, compared to the other emerging economies in our sample. In Pakistan, houses are dominated by 5-Marla14 houses (equal to about 126 sqm), which are considered a typical house for a small family. 5-Marla houses are particularly prominent in Lahore, Rawalpindi, Islamabad, and Peshawar. 14 The Marla is a traditional unit of area that is used in India, Pakistan, and Bangladesh, with one Marla being equal to 25.29 square meters. 20 Figure 1. Frequency Distribution of Property Type, by Country Albania Costa Rica Morocco South Africa Pakistan 21 Figure 2. Relationship of Price and Size, by Country Albania Costa Rica Morocco South Africa Pakistan 22 4. Application: Estimation of the Basic House Price This section demonstrates how the large volumes of property price data collected for select EMDEs can be used beyond descriptive statistics. To do so, we introduce the notion of the Basic House Price (BHP), the price of a standard house that is defined identically across all markets. By fixing the type of house to be the constant, BHP aims to provide a data point on price that is independent of the distribution of type/quality of housing that varies widely across markets. The BHP is a key concept that allows for the comparison across and within EMDEs, assessing critical drivers of price and performance of housing markets at the lower end of the price spectrum. We apply a machine learning technique, Random Forest (RF), to estimate the BHP from the collected web scaped data. Compared to Ordinary Least Squared (OLS) regression, Random Forest has consistently been found to perform better and provide more accurate price predictions (cf. Section 2.3). While the results are presented for five countries as way of application in the next section, the methodology introduced can be rolled out for all emerging economies. 4.1 Defining a Basic House Housing costs reflect the value of the land, the price of the house, the age, condition, and location of the property, as well as the local market. Also, private property prices depend on macroeconomic as well as demographic conditions including migration, urbanization rates, population growth, income growth, a country’s housing finance system including current interest rates and the availability of mortgage lending for all segments of the population. Other aspects relevant to house prices include challenges on the supply side such as a restrictive regulatory environment with lengthy permit granting processes, a shortage of labor and low mobility, as well as high construction costs. In estimating the Basic House Price, we focus on formal housing built by private developers, through public-private partnerships between developers and governments, or by private individuals. We disregard projects that are purely government-sponsored, or housing that is self-built and probably not transacted in the housing market. Formal housing combines specializations in the housing value chain to deliver titled properties that can be pledged as collateral for a mortgage, that is structurally sound, and that complies with local planning standards and building codes (World Bank 2015). As formal housing often remains unaffordable to low-income HHs in many emerging economies, many families in these economies adhere to incremental self-building. Self-building is particularly common at the outskirts of larger cities or in smaller towns. A recent study in India, for instance, found that 62 percent of newly financed houses are self-built (Das et al. 2018). In self-built environments, the initial house serves as anchor for a multi‐room home that accommodates multiple unrelated people or households (World Bank 2015). While these self- builders add to providing shelter to many families where the alternative is often homelessness, we disregard these houses for this project as self-built houses are often highly insecure in terms of tenure and do not necessarily comply with quality housing standards, building codes, or zoning regulations, and are not transacted in formal housing markets.15 Also, we do not consider endogenous factors such as HH preferences over a set of amenities or locations, that might differ across HHs and countries. Given the limited information available and to maintain comparability across countries, we exclude these exogenous 15 In Europe, in contrast, self-building is actively promoted as a means of addressing issues related to housing quality, affordability and sustainability (e.g., Bossuyt et al. 2018; Mullins & Moore 2018). 23 preferences from our model.16 Finally, we are only concerned with home ownership and defer scraping of rental data to further research. The relationship between property prices and rental prices has been discussed in depth elsewhere (e.g., Campbell et al. 2009; Engsted & Pederson 2015; Gallin 2008). We define a basic house as a formally supplied 50 square meters (sqm) one-bedroom, one-bathroom apartment located in an urban area within a given country, assumed to provide basic municipal or on-site services including water, sanitation, road access, and an energy source. While we presented summary statistics for both houses and apartments, the BHP deliberately only includes apartments. We constrain the BHP to apartments for comparability purposes and to avoid potential distortions that can be attributed to the different reporting of size (plot size versus usable surface size) in houses. In addition, by reporting the BHP exclusively for apartments, we also account for the ongoing debate regarding the need to increase the housing density in emerging economies, particularly in cities that experience an influx of migration and growth, through densification of existing settlements or the building of multi-story, complex buildings. This is particularly relevant for Africa, where cities are 20 percent more fragmented compared to cities in Asia, more expensive and less accessible for most (Lall et al. 2017). In some African countries, the densification, which is not served by the market, takes place in the informal realm. In many countries, single family homes built on a plot of land are turned into mini‐compounds where a main house is surrounded by ‘backyard shacks’ that are rented. This phenomenon of backyarding is particularly well documented for South Africa where backyarding increased from 1.1. million in 2011 to about 1.8 million in 2016 providing many families an informal way of overcoming the limitation of housing supply in urban areas (e.g., Brueckner et al. 2018).17 Densification of houses has many beneficial effects, including a reduction of land use costs as well as cost of connecting to utility infrastructure and services, particularly in areas of accelerated urbanization (e.g., Kurvinen & Saari 2020). While we deliberately apply a narrow definition of the BHP, the presented methodology in this paper allows for easy transferability of other comparative units similar to BHP that might be more suitable for other researchers’ focus. 4.2 Estimating the Basic House Price: Random Forest Estimation To estimate the BHP for each country, we run the following predictive regression specification using house- level data that we obtain by web scraping online listings: Price,j,= f(Sizei, Typei, Chari, Locationi) where Pricei is the listed price of the property i; Sizei is the size in square meters; Chari is a vector of characteristics of the property i to include the number of bedrooms and bathrooms; Location is a vector to denote the location on property i, and includes, where available, the municipality, county, and/or city. The predictive framework above is estimated in its linear form using OLS and through non-parametric estimation of the RF model. When it comes to ML approaches to predicting house prices, there is an 16 Recent research has applied deep learning models, especially convolutional neural networks (CNN), which allow the incorporation of the surroundings and other preferences into price estimation models (e.g., Law et al. 2019; Poursaeed et al. 2018). 17 Backyarding is not just a phenomenon in emerging economies. In Los Angeles or Sydney, for instance, so-called granny flats or casitas, which are essentially small backyard houses, are encouraged by the city administration to combat the growing number of homeless people (e.g., Durst & Wegman 2017; Gurran et al. 2020). 24 expanding list of different approaches such as Random Forests, Quantile Regression, LASSO Regression, Adaptive Regression Splines, and Neural Nets (cf. Steurer et al. 2021), gradient boosting machine (GBM) or support vector machine (SVM) (e.g., Ho et al. 2021; Truong et al. 2020). Since previous research has demonstrated that Random Forest algorithms present the most accurate predictions, we decided for this non-parametric estimation technique and present other estimations for robustness checks (Mohd et al. 2019; Mullainathan & Spiess 2017; Pérez-Rave et al. 2019). RF (Breiman, 2001) has recently gained popularity in property price predictions. RF models are based on classification and regression trees, which follow binary rule-based decisions that indicate how an input is related to its predictor variable (cf. Yoo et al. 2012). The RF is random in two ways: (1) each tree is based on a random subset of the observations, and (2) each split within each tree is created based on a random subset of variables (Grömping 2009: 311). In RF models, node splitting is not accomplished using all predictors as conventionally done in regression trees. Instead, RF node splitting is achieved using a random subset of predictors chosen at each node (e.g., Breiman 2001; Liaw & Wiener 2002). Hence, RF models are an ensemble tree-based learning algorithm that averages predictions over many individual trees using bootstrap aggregation (also known as bagging) (Breiman 2001). Applied to the real estate sector, RF maps each vector of house characteristics to a predicted value. The prediction function takes the form of a tree that splits at every node given the value of a particular housing characteristic (e.g., sqm; number of rooms) (Mullainathan & Spiess 2017). Given its very flexible functional form, RF is suited well for out-of-sample predictions and for varied structures of data. Unlike other econometric estimation techniques, RF models do not require training data to be normally distributed, which particularly for property price research in EMDEs is beneficial as data might be heavily skewed. While relatively new to property price estimations, RF models have a variety of advantages over traditional estimation techniques, particularly in EMDEs. First, compared to other price estimation models, RF models perform stronger than other algorithms, offering more precise price estimations (cf. Section 2.3). Second, housing markets in EMDEs often have a series of sub-markets either clustered around housing size, type of housing, or income group. Traditional estimations like hedonic price models would often fail to capture these sub-markets. Hence, if the data set sufficiently covers the characteristics of the property, the RF model is expected to replicate the complex structure of the property price determination process more sensitively (cf. Hong 2020: 142). Third, RF models do not require a detailed model and are hence more suitable for EMDEs with potentially more skewed distributions. Finally, while hedonic price models have been more geared towards inference, RF models focus more on prediction (Yoo et al. 2012). 4.3 Parameter Optimization The model-training process is started by randomly splitting the data set into training and testing data for each country ensuring a random sort order. We split each country’s data set into two subsets: 50 percent of the data are used for training, and 50 percent of the data are used for testing (validation) (cf. Schonlau & Zou 2020). The 50-50 split is the most common split in RF applications. Results on alternate splits are also tested and presented in Annex 5. More in-depth discussions on the effect of alternate splitting options are offered elsewhere (cf. Biau 2012; Ishwaran 2014). 25 Having decided on splitting the data set into training and testing data, we tune the hyperparameters to determine the model with the highest testing accuracy, focusing on the number of sub-trees and the number of variables randomly investigated at each split. The benefit of RF is that there are few hyperparameters with the potential to strongly influence the model’s performance (cf. Hong 2020). RF does not require an external cross-validation procedure to estimate the model's accuracy. Model selection and parameter tuning are driven by parameters that would produce the lowest out-of-bag (OOB) errors.18 First, we fix the number of sub-trees (number of iterations). As RF OOB error rates converge after the number of iterations gets large enough, we set the iterations to 500 for all models instead of tuning the number of observations for each country’s data set individually (cf. Breiman 2001; Schonlau & Zou 2020). While some scholars applying RF spend a fair amount of time in tuning to the most optimal number of sub- trees, recent research has shown that increasing the number of trees does not harm the model and the biggest performance gain is achieved within the first 250 trees (Probst & Boulesteix 2018). To check for the robustness of this assumption for our data, we iteratively run the model for two countries testing incrementally how increasing iterations from 10 to 500 alters the OOB error rate. As error rates stabilize with increasing iterations for both countries, we chose the highest number of sub-trees (500) for all our models (Figure 3). Figure 3. Out-of-Bag Error Rate for Varying Number of Iterations Albania Morocco Second, we select the number of variables to randomly investigate at each split – the depth of the decision trees. RF models applied to property price estimations in developed markets often have many property attributes to choose from (e.g., square meters, bedroom, bathroom, garage, age of property, location, distance to markets etc.). In these scenarios, to select the best RF model, authors often remove lesser important property attributes one at a time to estimate the relative performance of the model (e.g., Hong 2020) or “only” use the ten most important predicting variables in the final model (e.g., Čeh et al. 2018). 18 The error of the Random Forest is approximated by the OOB error during the training process. Each tree is built on a different bootstrap sample which, by random chance, leaves out about one-third of the observations. These left- out observations for a given tree are referred to as the OOB sample. Finding parameters that would produce low OOB error is often a key consideration in model selection and parameter tuning (cf. Schonlau & Zou 2020: 6). 26 Selecting the number of attributes where the OOB-error rate is lowest is another common decision factor in RF model selection (Schonlau & You 2020). Since the number of property attributes is rather limited in most EMDEs that we cover, we include all available attributes to the RF model. For most of the five EMDEs presented in this paper, this includes at least the type of the property (apartment or house), size of the property, location (city, region, or district – depending on availability), number of bedrooms, and number of bathrooms. Costa Rica, unfortunately, does not provide the number of bedrooms and bathrooms, and hence only Size, Type of Property, and City are included as predictor variables. The exact variables used for each country are summarized in Table 2. Table 2. Variables used in RF Model Country Variables used in RF Model Albania Size, Type of Property, Number of Bedroom, Number of Bathroom, County, City Costa Rica Size, Type of Property, City Morocco Size, Type of Property, Number of Bedroom, Number of Bathroom, City Pakistan Size, Type of Property, Number of Bedroom, Number of Bathroom, City South Africa Size, Type of Property, Number of Bedroom, Number of Bathroom, Province, Municipality, City Since all property attributes that we are using for property price estimations have consistently been found to be relevant for price predictions (cf. Section 2.2) and since the number of property attributes is overall limited, we abstain from successively identifying the optimal number of features in the RF model and include all available attributes in our final model.19 5. Results: Private Property Prices in Five EMDEs In the following section, we discuss the results of private property prices in five emerging economies across different regions: Albania, Costa Rica, Morocco, Pakistan, and South Africa. We selected these economies as a way of demonstrating how a big data approach can be applied to notoriously data-scarce environments such as EMDEs. 5.1 Basic House Prices in Five Economies and Their Largest Cities Having provided some overview on the availability of houses and apartments in the market and having discussed some descriptive statistics on price and size of all available private properties within the available data, we now present the estimation of the BHP, which solely focuses on apartments. Estimating the property price of a Basic House as defined in Section 4.1, Table 3 summarizes the results of the estimation 19 To check for the robustness of this approach, we pooled all countries’ data into a global data set and trained the RF model o n the overall data applying the same parameter optimization. Allowing for transfer learning of the model across countries, we then provide country-specific estimates derived from the global data set. Results of this approach are broadly in line for all countries except Albania, where we attribute the deviation of results to the comparably smaller number of observations compared to other countries. Deviation in number of observations across countries, as we are observing in our model, may contribute to bias when a global RF model is applied (cf. Annex 6). 27 model. To compare the performance, we use the same explanatory variables available for every country across models (as outlined in Table 2). All results are robust to more rigorous removal of potential duplicates in the data (Annex 2) as well as the application of alternate splits (Annex 3). Local property markets have their own characteristics featuring from the market itself and the products offered. The national averages in Table 3 mask the differences within the country. Particularly within capitals or the biggest business city, house prices are expected to be more expensive than in less urbanized areas. To capture the different price dynamics, we provide price estimations for the BHP for the countries’ largest cities in Table 3. 28 Table 3. Basic House Prices, by City Random Forest-Based Prediction OLS-Based Prediction Basic Basic Local Basic Basic Ratio Local Ratio House House Ratio House House pre- Ratio pre- Price Price pre- Price Price dicted pre- L_ dicted L_ Country Apartment Apartment dicted MAPE Apartment Apartment and dicted MAPE MA and obs. MAPE (Current (Current and obs. (Current (Current obs. and obs. PE price USD) PPP$) price USD) PPP$) price price (median) (median) (median (median) Albania National 71,205 189,200 1.01 1.01 0.28 0.19 55,167 146,592 1.02 1.07 0.28 0.26 Durres 45,422 120,697 1.05 0.94 0.35 0.20 41,359 109,900 1.16 1.05 0.41 0.26 Sarande 73,794 196,088 1.01 . 0.38 . 49,601 131,801 1.04 . 0.15 . Tirana 74,337 197,531 1.00 1.02 0.26 0.19 56,649 150,529 1.00 1.00 0.26 0.25 Vlore 54,856 145,765 1.27 0.85 0.39 0.16 53,627 142,499 1.38 0.90 0.46 0.17 Costa Rica National 113,065 194,476 0.98 0.97 0.27 0.40 115,139 198,044 0.99 1.06 0.30 0.43 Alajuela 110,284 189,694 1.23 . 0.38 . 107,774 185,376 1.32 . 0.49 . Heredia 104,378 179,535 1.08 1.16 0.22 0.14 111,794 192,290 1.11 1.19 0.22 0.22 Puntarenas 130,370 224,242 1.00 . 0.01 . 115,813 199,204 0.87 . 0.13 . San José 106,457 183,111 0.85 0.82 0.23 0.20 116,961 201,179 0.90 0.91 0.22 0.18 Morocco National 53,282 129,554 1.00 1.04 0.30 0.29 47,201 114,769 1.01 0.96 0.42 0.41 Agadir 40,081 97,456 1.07 1.12 0.37 0.33 53,670 130,570 1.27 1.61 0.54 0.66 Casablanca 79,739 193,886 0.93 1.07 0.26 0.30 64,356 156,482 0.86 0.60 0.24 0.38 Fez 42,155 102,500 1.05 . 0.31 . 41,762 101,544 2.05 . 1.18 . Tangier 40,828 99,272 0.99 1.04 0.31 0.28 28,052 68,209 0.93 0.73 0.31 0.28 Marrakesh 48,856 118,793 1.00 1.09 0.26 0.25 43,852 106,626 0.94 0.82 0.26 0.28 Pakistan National 21,849 91,269 0.99 1.16 0.31 0.30 24,360 101,758 1.00 1.27 0.44 0.52 Islamabad 21,711 90,692 0.99 0.94 0.30 0.29 24,607 102,789 1.05 1.01 0.38 0.33 Karachi 21,646 90,422 0.98 1.18 0.34 0.33 24,380 101,842 0.91 1.35 0.41 0.48 Lahore 26,651 111,328 1.00 1.15 0.28 0.32 24,153 100,895 1.05 1.06 0.50 0.31 Rawalpindi 25,837 107,927 1.02 1.10 0.25 0.24 23,519 98,244 1.06 1.01 0.34 0.22 South Africa National 63,745 137,501 1.02 1.07 0.30 0.30 85,769 185,006 1.15 1.27 0.55 0.56 Cape Town 106,536 229,801 0.89 1.02 0.36 0.33 132,455 285,709 0.75 0.96 0.42 0.37 Durban 43,917 94,731 1.03 1.05 0.32 0.28 81,276 175,314 1.28 1.49 0.53 0.63 Johannesburg 60,826 131,204 1.05 1.18 0.64 0.76 86,642 186,889 1.21 1.50 0.91 0.85 Note: MAPE refers to mean absolute percentage error; l_MAPE refers to the mean percentage error based on predictions accuracy of apartments sized between 50 and 60 square meters; local ratio refers to the ratio between observed and predicted values for apartments sized between 50 and 60 square meters. Exchange rates are based on 2019 conversions. The estimations show that among the five countries, at the national level and in US$ terms, Pakistan has the cheapest BHP (US$ 21,849), followed by Morocco (US$ 53,282), South Africa (US$ 63,745), Albania (US$ 71,205), and Costa Rica (US$ 113,064). In PPP$, however, the price levels for a basic apartment between the five countries is more equal. Visualization of the raw data in Figures 1 and 2 offer some 29 intuition behind the cross-country differences. Comparing Pakistan and Costa Rica at opposite ends of the spectrum, we observe the availability of apartments in Pakistan concentrated at the lower end of the price spectrum, while a much more even distribution in Costa Rica (Figure 1). Moreover, while the price-size relationship (Figure 2) clearly points to apartments being listed cheaper than similar size houses in Pakistan, the opposite is true in Costa Rica. As the BHP represents the typical estimated market price for a standard 50 sqm apartment, the prevalence of more luxurious/expensive apartments in Costa Rica is expected to drive up the benchmark price. Within countries, there are also significant differences across regions and cities. In South Africa, for instance, price differences of the BHP between cities are stark. House prices in Cape Town are at the higher income spectrum, where even a basic house is priced at US$ 106,536. Cape Town is one of the most popular tourist destinations in Africa and its property market is known to be tailored to more affluent retirees and foreign property buyers. In Morocco, Casablanca is the most expensive city followed by Marrakesh. In addition to consistently high prevalence of European buyers, many wealthy Moroccans families live in the suburbs of these cities including Palmeraie in Marrakesh and Bouskoura in Casablanca where prices usually start around US$ 700,000. In Pakistan, while nationally at the lowest end of the price spectrum of the five economies covered in this study, there are also significant intra-country differences. Being a fast-growing emerging economy, the capital, Islamabad, is a thriving real estate market. While there are many houses at the lower end of the price spectrum, with the cheapest house advertised at roughly US$ 9,000, property prices in Islamabad can go as high as US$ 6 million. 5.2 Cross-Validation The predicted BHP of the apartments obtained by both models were compared with the observed apartment prices in order to determine the predictive power of the different models. One standard measure often used in price estimation models is the quotient between the predicted price and the observed price for the property. The acceptable median ratio between predicted and observed price is 0.9-1.1 (cf. International Association of Assessing Officers 2014; Čeh et al. 2018). Additionally, we evaluate the performance of the different models with the mean absolute percentage error (MAPE), which measures the average percentage deviation of predicted prices from actual property prices expressed as 100 ̂− MAPE= ∑=1 | |, where ̂ is the predicted property price and pi the actual property price of property i. To understand differences in predictions for the BHP, we estimate a localized MAPE for apartments between 50 and 60 square meters. Comparing the predictive performances of the models based on the performance measures (ratio of predicted vs. actual value, MAPE, localized MAPE), we obtain more precise results in the RF models for all countries. All RF models are within the suggested range of the predicted/observed price ratio of 0.9-1.1, also for the estimations where all potential duplicates are rigorously removed (Annex 2) and where different splits are applied (Annex 5). In the main OLS estimation, South Africa exceeds the acceptable median ratio between predicted and observed price range. The MAPEs indicates that the percent deviation of the RF prediction from the actual property price ranges between 27 percent for Costa Rica and 31 percent for Pakistan. Across the board, while MAPE is relatively high in both RF-based and OLS-based 30 predictions, the RF estimation consistently performs better than the OLS prediction. While the quality and quantity of information used in both estimation techniques were identical, since the predictor in the RF model explores the hierarchical structure of features, it can more sensitively track the possibility that the effect of each attribute on price varies by context (Hong 2020). The limited coverage of observable property features within EMDEs could be one explanation for the acceptable accuracy of the estimation. In addition, while most studies applying the RF model focus on a very narrow housing market (e.g., Čeh et al. 2018; Levantesi & Piscopo 2020) our data expands to the entire housing market of the five emerging economies covering heterogeneous properties with varying amenities including interior decorations, building age, or other features that are not captured in the model as these property features were not consistently available on listing websites. Equally, we are jointly estimating property prices for a large swathe of locations within an economy – beyond just different neighborhoods within a city but aggregating both rural and urban areas. This deviates from the use of RF models in the literature to predict property prices for a well-defined narrow set of locations, typically a city or a province/state. The limited property features and available explanatory variables pose limitations to the use of the model to predict individual property prices across the spectrum. However, the purpose of the prediction in our case is to arrive at a typical price for a standard property that can be compared across countries and contexts. The RF and OLS-based approach, notwithstanding the relatively high MAPE, can be considered improvements over the alternative of only considering one-dimensional summary statistics of price. 6. Conclusion Given the difficulty of obtaining reliable private property price data in emerging economies, most analyses of house prices or affordability assessments are constrained to developed economies and limited in scope. Most cross-country analyses that assess trends in house prices are based on available indices, which often aggregate to the national level, masking important within-country dynamics and regional differences. To overcome this flaw and provide more in-depth insights into housing markets in emerging economies, we demonstrate how to collect a large range of localized data through web scraping of property listing websites. Further, to compare property prices across countries, we introduce the concept of the Basic House Price – which constitutes the average price of a basic one-bedroom apartment of 50 square meters in an urban area – that allows for comparability across countries. By way of demonstrating the methodology and data processing for five EMDEs, we show that web scraping offers a cost-effective way to obtain a large amount of price data for countries where official data is absent and where alternate data sources on prices are not available. The main constraint to this approach remains the unorganized structure of listing websites and the limited information available on property features and attributes. This approach will only improve over time as the capabilities of listing websites improve and become the preferred method of listing. There is also room to improve the web scraping on several fronts. For instance, image recognition software can extract information that is not supplied systematically in listing websites and could improve model precision. In addition, addresses could be geotagged to incorporate crucial details about location and to differentiate within-city variation. 31 The paper aims to outline one efficient way to address the wide data gap on property prices in emerging economies. In addition, the paper outlined how, once collected, these data could be used to estimate the price of a standard house consistently. With a consistent methodology proposed in our paper, the BHP can then be applied in several avenues for further research. First, available data and analysis can feed into further research on determinants of house prices and drivers of changes through time in emerging economies. While determinants of house prices are a well-researched subject in the literature, gaining increasing attention post-2008/2009, analyses mainly rely on data from developed economies. If available at all, price estimations in emerging economies are primarily available for very localized markets. To what extent these findings extend to a larger sample of emerging economies may be an area of research triggered by the data proposed in this paper. In addition, research areas more relevant for emerging economies, such as those related to empirically assessing inefficiencies in the housing value chain, would be possible with the data and analysis proposed by the paper. From an affordability perspective, this paper provides an important variable that may be combined with other data sources, for example, households’ disposable income. Bringing these various elements together in the analysis of the country’s housing market affordability is fundamental for more fully understanding housing needs and challenges faced by households in emerging economies. References Ardila, D., Ahmed, A., & Sornette, D. (2021). Comparing Ask and Transaction Prices in the Swiss Housing Market. Quantitative Finance and Economics 5(1) 67-93. Alter, A., & Mahoney, E. M. (2021). Local House-price Vulnerability: Evidence from the U.S. and Canada. Journal of Housing Economics 54, 1-17. Anenberg, E. & Laufer, S. (2017). A More Timely House price Index. The Review of Economics and Statistics 99(4), 722-734. Anderson, D. E. (2000). Hypothesis Testing in Hedonic Price Estimation. On the Selection of Independent Variables. The Annals of Regional Science 34(2), 293-304. Anundsen, A. K., Gerdrup, K. & Hansen, F. (2016). Bubbles and Crises: The Role for House Prices and Credit. Journal of Applied Econometrics 31(7), 1291-1311. Angelo, H., & Vormann, B. (2018). Long Waves of Urban Reform: Putting the Smart City in its Place. City 22(5–6), 782-800. Antipov, E.A. & Pokryshevskaya, E.B. (2012). Mass Appraisal of Residential Apartments: An Application of Random Forest for Valuation and a CART-based Approach for Model Diagnostics. Expert Systems with Applications 39(2), 1772-1778. Anenberg, E. & Laufer, S. (2017). A More Timely House Price Index. The Review of Economics and Statistics 99(4),722–734. Arnott, R. (2009). Housing Policy in Developing Countries: The Importance of the Informal Sector, in Spence, M., P.C. Annex & R.M. Buckley (eds.). Urbanization and Growth. Washington, DC: Commission on Growth and Development, pp. 167–97. Athey, S. (2018). The Impact of Machine Learning on Economics. In A. K. Agrawal, J. Gans, & A. Goldfarb (eds.). The Economics of Artificial Intelligence: An Agenda. University of Chicago Press, pp. 507-547. 32 Balcilar, M., Gupta, R., & Shah, Z. B. (2011). An In-sample and Out-of-sample Empirical Investigation of the Nonlinearity in House Prices of South Africa. Economic Modelling 28(3), 891-899. Biau, G. (2012). Analysis of a Random Forests Model. Journal of Machine Learning Research 13, 1063– 1095. Boeing, G., Besbris, M., Schachter, A., & Kuk, J. (2021) Housing Search in the Age of Big Data: Smarter Cities or the Same Old Blind Spots? Housing Policy Debate, 31(1), 112-126. Borba, J. O. & Dentinho, T. P. (2016). Evaluation of Urban Scenarios Using Bid-rents of Spatial Interaction Models as Hedonic Price Estimators: An Application to the Terceira Island, Azores. The Annals of Regional Science 56(3), 671-685. Borde, S., Rane, A., Shende, G., & Shetty, S. (2017). Real Estate Investment Advising Using Machine Learning. International Research Journal of Engineering and Technology 4(3), 1821-1825. Borio, C., Kennedy, N. & Prowse, S. (1994). Exploring Aggregate Asset Price Fluctuations across Countries: Measurement, Determinants and Monetary Policy Implications. BIS Economic Papers, Basle, Switzerland: BIS. Bossuyt, D., Salet, W. & Majoor, S. (2018). Commissioning as Cornerstone of self-build Housing. Assessing the Constraints and Opportunities of self-build in The Netherlands. Land Use Policy 77, 524- 533. Breiman, L. (2001). Random Forests. Machine Learning 45(1), 5-32. Bricongne, J-C., Turrini, A., & Pontuch, P. (2019). Assessing House Prices: Insights from ‘Houselev’, A Dataset of Price Level Estimates. European Economy Discussion Papers 101. Bricongne, J-C., Meunier, B., & Pouget, S. (2021). Web Scraping Housing Prices in Real-time: the Covid- 19 Crisis in the UK, Banque de France Working Paper, No. 827. Brueckner, J. K., Rabe, C., & Harris, S. (2018). Backyarding: Theory and Evidence for South Africa. Policy Research Working Paper No. 8636. Washington, D.C.: World Bank. Campbell, D., Morris, D., Gallin, J., & Martin, R. (2009). What Moves Housing Markets: A Variance Decomposition of the Rent-Price Ratio. Journal of Urban Economics 66, 90-102. Campbell, J. Y. & Cocco, J. F. (2007). How do House Prices affect Consumption? Evidence from Micro Data. Journal of Monetary Economics 54 (3), 591–621. Can, A. (1992). Specification and Estimation of Hedonic Housing Price Models. Regional Science and Urban Economics 22(3), 453-474. Catte, P., Price, R. W. R., Girouard, N., & André, C. (2004). Housing Markets, Wealth and the Business Cycle, OECD Economics Department Working Paper, No. 394. Paris: OECD. Ceccato, V., & Wilhelmsson, M. (2020). Do Crime Hot Spots Affect Housing Prices? Nordic Journal of Criminology 21(1), 84-102. Čeh, M., Kilibarda, M., Lisec, A., & Bajat, B. (2018). Estimating the Performance of Random Forest versus Multiple Regression for Predicting Prices of the Apartments. International Journal of Geo-Information, 7(5). Claessens, S. & J. Schantz (2019). Regional House Price Differences: Drivers and Risks. In Nijskens, R., Lohuis, M., Hilbers, P., Heeringa, W. (Eds.) Hot Property. The Housing Market in Major Cities. Cham, CH: Springer, pp. 39-49. Das, C., A. Karamchandani, & Thuard, J. (2018). State of the Low-Income Housing Finance Market. Boston: FSG. Davis, M. A. & Heathcote, J. (2005). Housing and the Business Cycle. International Economic Review 46(3), 751-784. 33 Deghi, A., Katagiri, M., Shahid, S., Valckx, N. (2020). Predicting Downside Risks to House Prices and Macro-Financial Stability. International Monetary Fund WP/20/11. Del Negro, M. & Otrok, C. (2007). 99 Luftballons: Monetary Policy and the House Price Boom across U.S. States. Journal of Monetary Policy 54(7), 1962-1985. Durst, N. J. & Wegmann, J. (2017). Informal Housing in the United States. International Journal of Urban and Regional Research 41(2), 282-297. Drehmann M. & Juselius M. (2014). Evaluating Early Warning Indicators of Banking Crises: Satisfying Policy Requirements. International Journal of Forecasting 30, 759-780. Engsted, T. & Pedersen, T.Q. (2015). Predicting Returns and Rent Growth in the Housing Market Using the Rent-Price Ratio: Evidence from the OECD Countries. Journal of International Money and Finance 53, 257-275. European Union, International Labour Organization, International Monetary Fund, Organisation for Economic Co-operation and Development, United Nations Economic Commission for Europe, The World Bank (2013). Handbook on Residential Property Prices Indices (RPPIs). Luxembourg European Union. Fan, G. Z., Ong, S. E., & Koh, H. C. (2006). Determinants of House Price: A Decision Tree Approach. Urban Studies 43(12), 2301-2315. Fletcher, M., Gallimore, P., & Mangan, J. (2000). Heteroscedasticity in Hedonic House Price Models. Journal of Property Research 17(2), 93-108. Follain, J. R., Jr. & Malpezzi, S. (1981). Are Occupants Accurate Appraisers? Review of Public Data Use 9 (1), 47-55. Gardner, D. & Pienaar, J. (2019). Benchmarking Housing Construction Costs in Africa. Centre for Affordable Housing Africa. http://housingfinanceafrica.org/app/uploads/Benchmarking-Housing- Construction-Costs-Across-Africa-FINAL-19-May-2019.pdf Gallin, J. (2008). The long-run Relationship Between House Prices and Rents. Real Estate Economics 36, 635-658. Gauder, M., Houssard, C., & Orsmond, D. (2014). Foreign Investment in Residential Real Estate. Reserve Bank of Australia Bulletin. https://www.rba.gov.au/publications/bulletin/2014/jun/pdf/bu-0614-2.pdf Gao, G., Bao, Z., Cao, J., Quin, A.K., Sellis, T. & Wu, Z. (2019). Location Centered House Price Prediction: A Multi-Task Learning Approach. arXiv arXiv:1901.01774. Garrod, G. D. & Willis, K. G. (1992). Valuing Goods’ Characteristics: An Application of the Hedonic Price Method to Environmental Attributes. Journal of Environmental Management 34(1), 59-76. Girouard, N., Kennedy, M., van den Noord, P., & André, C. (2006), Recent House Price Developments: The Role of Fundamentals, OECD Economics Department Working Papers No. 475. Glaeser, E L., & Ward, B. A. (2009). The Causes and Consequences of Land Use Regulation: Evidence from Greater Boston. Journal of Urban Economics 65(3), 265-278. Gnagey, M. & Tans, R. (2018). Property-Price Determinants in Indonesia. Bulletin of Indonesian Economic Studies 54(1), 61-84. Goodhart, C. & Hofmann, B. (2008). House Prices, Money, Credit, and the Macroeconomy. Oxford Review of Economic Policy 24 (1):180-205. Goodman, J. L. & Ittner, J. B. (1992). The Accuracy of Home Owners’ Estimates of House Value. Journal of Housing Economics 2 (4), 339-57. Gröbel, S. & Thomschke, L. (2018). Hedonic Pricing and the Spatial Structure of Housing Data - An Application to Berlin. Journal of Property Research 35(3), 185-208. 34 Grömping, U. (2009). Variable Importance Assessment in Regression: Linear Regression versus Random Forest. American Statistical Association 63(4), 308-319. Gurran, N., Maalsen, S., & Shrestha, P. (2020). Is Informal Housing an Affordability Solution for Expensive Cities? Evidence from Sydney, Australia. International Journal of Housing Policy Gyourko, J., Mayer, C. & Sinai, T. (2013). Superstar Cities. American Economic Journal 5(4), 167-199. Haurin, D. R., Haurin, J. L., Nadauld, T. & Sanders, A. (2010). List Prices, Sale Prices and Marketing Time: An Application to U.S. Housing Markets, Real Estate Economics 38 (4), 659–85. He, S., Wang, D., Webster, C., & Chau, K. (2019). Property Rights with Price Tags? Pricing Uncertainties in the Production, Transaction and Consumption of China’s Small Property Right Housing. Land Use Policy 81, 424-434. Ho, W. K. O., Tang, B.-S., Wong, S. W. (2021). Predicting Property Prices with Machine Learning Algorithms. Journal of Property Research 38(1),48-70. Hong, J., Choi, H., Kim, W. (2020). A House Price Valuation Based on the Random Forest Approach: The Mass Appraisal of Residential Property in South Korea. International Journal of Strategic Property Management 24(3), 140-152. Horwitz, J. L. (1992). The Role of the List Price in Housing Markets: Theory and Econometric Model, Journal of Applied Econometrics 7, 115-129. Hu, L., He, S., Han, Z., Xiao, H., Su, S., Weng, M., & Cai, Z. (2019). Monitoring Housing Rental Prices Based on Social Media: An Integrated Approach of Machine-Learning Algorithms and Hedonic Modeling to Inform Equitable Housing Policies. Land Use Policy 82, 657–673. ICP (2011). A New Approach to International Construction Price Comparison. Available at: http://siteresources.worldbank.org/ICPINT/Resources/270056-1255977254560/6483625- 1273849421891/110622_ICP-OM_Construction.pdf Igan D. & Loungani, P. (2012). Global Housing Cycles, IMF Working Paper, No. 12/217. Washington D.C.: International Monetary Fund. International Association of Assessing Officers (2014). Guidance on International Mass Appraisal and Related Tax Policy. Available at: http://www.iaao.org/media/Standards/International_Guidance.pdf International Monetary Fund. (2018). House Price Synchronization: What Role for Financial Factors? In IMF Global Financial Stability Report, April 2018. A Bumpy Road Ahead (pp. 93-133). International Monetary Fund. (2008). World Economic Outlook April 2008. Housing and the Business Cycle. Washington, D.C.: International Monetary Fund. Ishwaran, H. (2014). The Effect of Splitting on Random Forests. Machine Learning 99, 75-118. Jordà, O., Schularick, M., & Taylor, A.M. (2016). The Great Mortgaging: Housing Finance, Crises and Business Cycles. Economic Policy 31(85), 107-152. Jordà, O., Schularick, M., & Taylor, A.M. (2015). Betting the House. Journal of International Economics 96(S2), 2-18. Keskin, B. & Watkins, C. (2017). Defining Spatial Housing Submarkets: Exploring the Case for Expert Delineated Boundaries. Urban Studies 54(6), 1446-1462. Kim, A. M. (2007). North versus South: The Impact of Social Norms in the Market Pricing of Private Property Rights in Vietnam. World Development 35(12), 2079-2095. Knoll, K., Schularick, M. & Steger, T. (2017). No Price Like Home: Global House Prices, 1870-2012. American Economic Review 107(2), 331-353. Krol, A. (2013). Application of Hedonic Methods in Modelling Real Estate Prices in Poland. Data Science, Learning by Latent Structures, and Knowledge Discovery, 501-511. 35 Kurvinen, A. & Saari, A. (2020). Urban Housing Density and Infrastructure Costs. Sustainability 12(2). Law, S., Paige, B., & Russell, C. (2019). Take a Look Around: Using Street View and Satellite Images to Estimate House Prices. ACM Transactions on Intelligent Systems and Technology (TIST) 10(5), 1-19. Lall, S. V., Henderson, J. V., & Venables A. J. (2017). Africa’s Cities: Opening Doors to the World. Washington, D.C.: The World Bank. Lancaster, K. J. (1966). A New Approach to Consumer Theory. Journal of Political Economy 74 (2), 132- 157. Levantesi, S. & Piscopo, G. (2020). The Importance of Economic Variables on London Real Estate Market: A Random Forest Approach. Risks 8(4),1-17. Li, M., Zhang, G., Chen, Y., Zhou, C. (2019). Evaluation of Residential Housing Prices on the Internet: Data Pitfalls. Complexity 1-15. Liaw, A. & Wiener, M. (2002). Classification and Regression by Random Forest. R News 2(3), 18-22. Libertun de Duren, N. R. (2018). Why There? Developers' Rationale for Building Social Housing in the Urban Periphery in Latin America. Cities 72, 411-420. Luüs, C. (2005). The Absa Residential Property Market Database for South Africa: Key Data Trends and Implication. BIS Papers 21. Available at: https://www.bis.org/publ/bppdf/bispap21l.pdf . Lyons, R. C, (2019). Can List Prices Accurately Capture Housing Price Trends? Insights from Extreme Market Conditions. Finance Research Letters 30, 228-323. Mack, A. & Martínez-García, E. (2011). A Cross-Country Quarterly Database of Real House Prices: A Methodological Note. Federal Reserve Bank of Dallas Globalization and Monetary Policy Institute Working Paper No. 99 https://www.dallasfed.org/~/media/documents/institute/wpapers/2011/0099.pdf McGreal, S. & Taltavull de La Paz, P. (2013). Implicit House Prices: Variation over Time and Space in Spain, Urban Studies 50 (10), 2024-43. McKinsey Global Institute (2014). A Blueprint for Global Affordable Housing Challenge. McKinsey Global Institute. Mian, A., Sufi, A., & Verner, E., 2017. Household Debt and Business Cycles Worldwide. The Quarterly Journal of Economics 132(4), 1755-1817. Montero, J.-M., Mínguez, R., & Fernández-Avilés, G. (2018). Housing Price Prediction: Parametric Versus Semi-parametric Spatial Hedonic Models. Journal of Geographical Systems 20, 27-55. Mohd, T., Masrom, S., & Johari, N. (2019). Machine Learning Housing Price Prediction in Petaling Jaya, Selangor, Malaysia. International Journal of Recent Technology and Engineering 8(2S11), 542–546. Mullainathan, S. & Spiess, J. (2017). Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives 31(2), 87-106. Mullins, D. & Moore, T. (2018). Self-organised and Civil Society Participation in Housing Provision. International Journal of Housing Policy 18, 1-14. Nairametrics (2018). Deal: ToLet.com.ng acquires Jumia House Nigeria, now Property Pro. Available at : https://nairametrics.com/2017/11/12/deal-tolet-com-ng-acquires-jumia-house-nigeria-now-property- pro/ Oladunni, T. & Sharma, S. (2016). Hedonic Housing Theory. A Machine Learning Investigation. International Conference on Machine Learning and Applications (pp. 522–527). Anaheim, United States, 18-20 December 2016. Owusu-Manu, D., Edwards, D. J., Donkor-Hyiaman, K. A., Asiedu, R. O., Hosseini, M. R., Obiri-Yeboah, E. (2019). Housing Attributes and Relative House Prices in Ghana. International Journal of Building Pathology and Adaptation 37(5), 733-746. 36 Pérez-Rave, J., Correa-Morales, J. C. & González-Echavarría, F. (2019). A Machine Learning Approach to Big Data Regression Analysis of Real Estate Prices for Inferential and Predictive Purposes. Journal of Property Research 36(1), 59-96. Pfeifer, N. & Steurer, M. (2020) Early real Estate Indicators During the COVDI-19 Crisis: A Tale of Two Cities. Graz Economic Papers, http://www100.uni-graz.at/vwlwww/forschung/RePEc/wpaper/2020- 17.pdf Philiponnet, N. & A. Turini (2017). Assessing House Price Developments in the EU. Discussion Paper 048 https://ec.europa.eu/info/sites/info/files/dp048_en.pdf Poursaeed, O., Matera, T., & Belongie, S. (2018). Vision-based Real Estate Price Estimation. Machine Vision and Applications 29 (4), 667–676. Probst, P. & Boulesteix, A.-L. (2018). To Tune or Not to Tune the Number of Trees in Random Forest. Journal of Machine Learning Research 18, 1-18. Rae, A. (2015). Online Housing Search and the Geography of Submarkets. Housing Studies 30(3), 453- 472. Reuter (2021). Coronavirus Wave flattens Indian Housing Market Views: Reuters poll. Available at: https://www.reuters.com/article/us-india-property-poll/coronavirus-wave-flattens-indian-housing- market-views-reuters-poll-idUSKCN2D20A4. Rodriguez, M., & Sirmans, C. F. (1994). Quantifying the Value of a View in Single-Family Housing Markets. Appraisal Journal 62, 600-603. Rosen, S. (1974). Hedonic Prices and Implicit Markets: Product Differentiation in Pure Competition. Journal of Political Economy, 82 (1), 34–55. Sadayuki, T. (2018). Measuring the Spatial Effect of Multiple Sites: An Application to Housing Rent and Public Transportation in Tokyo, Japan. Regional Science and Urban Economics 70, 155-173. Schonlau, M. & Zou, R. Y. (2020). The Random Forest Algorithm for Statistical Learning. The Stata Journal 20(1), 3-29. Steurer, M., Hilll, R. J., Pfeifer, N. (2021). Metrics for Evaluating the Performance of Machine Learning Based Automated Valuation Models. Journal of Property Research 38(2), 99-129. ten Bosch, O. & Windmeijer, D. (2014). On the Use of Internet Robots for Official Statistics, UNECE MSIS conference, Dublin, Ireland 2014. Truong, Q., Nguyen, M., Dang, H. & Mei, B. (2020). Housing Price Prediction via Improved Machine Learning Techniques. Procedia Computer Science 174, 433-442. UN-Habitat. (2007). Principles and Recommendations for Population and Housing Censuses (revision 2). New York: United Nations. Uwayezu, E. & de Vries, W. T. (2020) Access to Affordable Houses for the Low-Income Urban Dwellers in Kigali: Analysis Based on Sale Prices. Land 9(3), 2-32. Veradi and McCathie (2012) The S-estimator of Multivariate Location and Scatter in Stata. The Stata Journal, 12(2), 299-307. Wang, C. C. & Wu, H. (2018). A New Machine Learning Approach to House Price Estimation. New Trends in Mathematical Sciences 6(4), 165-171. Wang, X., Li, K. & Wu, J. (2020). House Price Index Based on Online Listing Information: The Case of China. Journal of Housing Economics 50, 1-12. World Bank (2015). Stocktaking of the Housing Sector in Sub-Saharan Africa. Challenges and Opportunities. Washington, D.C.: The World Bank. World Bank (2020). Purchasing Power Parities and the Size of World Economies. Results from the 2017 International Comparison Program. Washington, D.C.: The World Bank. 37 World Bank (2021). World Development Indicators. Washington D.C.: The World Bank. Yan, Z. & Zong, L. (2020). Spatial Prediction of House Prices in Beijing Using Machine Learning Algorithm. Proceedings of the 2020 4th High Performance Computing and Cluster Technologies Conference & 2020 3rd International Conference on Big Data and Artificial Intelligence, pp. 64-71. Yoshino, N. & Helble, M. (2016). The Housing Challenge in Emerging Asia: Options and Solutions. Tokyo: Asian Development Bank Institute. Yilmazer, S. & Kocaman, S. (2020). A Mass Appraisal Assessment Study Using Machine Learning Based on Multiple Regression and Random Forest. Land Use Policy 99 104889. Yoo, S., Im, J., & Wagner, J. E. (2012). Variable Selection for Hedonic Model Using Machine Learning Approaches: A case study in Onondaga County, NY. Landscape and Urban Planning, 107(3), 293- 306. Zhang, B., Li, W., Lownes, N., and Zhang, C. (2021). Estimating the Impacts of Proximity to Public Transportation on Residential Property Values: An Empirical Analysis for Hartford and Stamford Areas, Connecticut. International Journal of Geo Information 10(44), 1-11. Zhong, H. and Li, W. (2016). Rail Transit Investment and Property Values: An Old Tale Retold. Transportation Policy 51, 33-48. 38 Annex 1: Representativeness of the scraped data on the housing market Individuals Percentage Slum Number of using the of all Population, Ratio of Population Average HH Number of Country scraped internet (% households percent of mortgages to (2019) size Households observations of the covered in urban GDP, percent population) scraping population Albania 2,854,191 3.66 779,688 3,389 72.24 % 0.43 n.a 12.2 Costa Rica 5,047,561 3.20 1,579,421 10,376 80.53 % 0.66 4 15.93 Morocco 36,471,769 4.77 7,646,128 10,737 84.12 % 0.14 9 23 South Africa 216,565,318 6.28 34,500,045 107,652 68.20 % 0.31 26 16.15 Pakistan 58,558,270 3.86 15,155,778 91,904 17.00 % 0.61 40 0.23 Note: Overview on representativeness of the scraped house prices of the formal market. Numbers on average household size are drawn from countries’ latest household survey. Data for total population are drawn from World Bank (2022). 39 Annex 2: Robustness: Results of Rigorous De-Duplication Random Forest-based Prediction OLS-based Prediction Basic Local Basic Local Ratio Ratio House Ratio House Ratio predicted predicted Price predicted Price predicted and Local and Local Country Apartment and MAPE Apartment and MAPE observed MAPE observed MAPE (Current observed (Current observed price price USD) price USD) price (median) (median) (median) (median) Albania 64,747.53 1.05 0.99 0.30 0.21 57,899.49 1.05 0.99 0.29 0.27 Costa Rica 111,008.21 1.00 1.02 0.32 0.26 117,282.53 1.03 1.12 0.38 0.31 Morocco 57,186.95 0.99 1.05 0.31 0.33 47,841.95 1.01 0.88 0.43 0.42 Pakistan 22,990.29 0.99 1.04 0.35 0.29 25,225.80 1.02 1.17 0.44 0.40 South 63,728.53 1.01 1.05 0.30 0.30 85,469.41 1.12 1.28 0.54 0.55 Africa Note: MAPE refers to mean absolute percentage error; local MAPE refers to the mean percentage error based on predictions accuracy of apartments sized between 50 and 60 square meters; local ratio refers to the ration between observed and predicted values for apartments sized between 50 and 60 square meters. 40 Annex 3: Alternate Outlier Removal Albania Costa Rica Descriptive Statistic Original Truncated Winsorized Descriptive Statistic Original Truncated Winsorized Data Set Data Set Data Set Data Set Data Set Data Set Number of observations 3,389 3,376 3,064 Number of observations 10,376 10,202 9,879 Price (USD) Median 97,619.05 97,619.05 98,809.52 Price (USD) Median 200,000.00 200,000.00 200,000.00 Price (USD) Mean 137,498.50 137,483.8 126,646.6 Price (USD) Mean 293,751.60 295,211.70 267,586.4 Square meter Median 97.00 97.00 98.00 Square meter Median 181.00 181.00 180.00 Square meter Mean 116.73 115.84 109.37 Square meter Mean 347.11 234.17 220.94 Price per square meter Median 1,046.57 1,047.62 1,066.68 Price per square meter Median 1,156.72 1,162.28 1,156.25 Price per square meter Mean 1,227.50 1,219.54 1,152.60 Price per square meter Mean 2,744.14 1,393.17 1,314.42 Price per square meter IQR 508.32 505.95 493.19 Price per square meter IQR 673.14 665.61 639.94 Price per square meter SD 2,886.30 2,804.12 526.90 Price per square meter SD 80,189.03 3,869.17 730.45 Number of bedrooms Mean 1.98 1.98 1.99 Number of bedrooms Mean . . . Morocco Morocco (cont’d) Descriptive Statistic Original Truncated Winsorized Descriptive Statistic Original Truncated Winsorized Data Set Data Set Data Set Data Set Data Set Data Set Number of observations 10,737 10,734 10,131 Price per square meter Median 1,119.25 1,117.21 1,130.10 Price (USD) Median 104,395.60 105,494.50 104,395.60 Price per square meter Mean 3,071.55 1,677.39 2,250.78 Price (USD) Mean 271,361.30 234,233.80 168,450 Price per square meter IQR 932.10 903.46 925.03 Square meter Median 95.00 95.00 94.00 Price per square meter SD 33,458.45 7,570.71 14,499.67 Square meter Mean 188,59 133.68 117.90 Number of bedrooms Mean 2.65 2.64 2.61 41 Pakistan South Africa Descriptive Statistic Original Truncated Winsorized Descriptive Statistic Original Truncated Winsorized Data Set Data Set Data Set Data Set Data Set Data Set Number of observations 107,652 107,524 103,847 Number of observations 91,904 88,374 70,608 Price (USD) Median 83,934.34 83,934.34 83,934.34 Price (USD) Median 95,588.23 93,286.45 105,498.7 Price (USD) Mean 156,756.10 154,420.40 140,740.9 Price (USD) Mean 155,699.4 147,055 154,418 Square meter Median 151.76 151.76 151.76 Square meter Median 350 313 350 Square meter Mean 955.86 220.93 210.80 Square meter Mean 2,081.1 555.01 803.73 Price per square meter Median 535.38 535.38 535.38 Price per square meter Median 360.38 387.17 357.06 Price per square meter Mean 634.36 633.16 627.51 Price per square meter Mean 887.42 629.44 584.53 Price per square meter IQR 344.14 344.14 339.65 Price per square meter IQR 614.62 624.15 599.32 Price per square meter SD 652.22 480.86 395.14 Price per square meter SD 9,302.60 853.02 700.54 Number of bedrooms Mean 3.86 3.86 3.87 Number of bedrooms Mean 3.04 2.99 3.01 Note: The original data set contains the original set of all listings. The truncated data set retains listings that contain sale prices, size data, and whether the property is an apartment or house and truncates the data based on sqm<9 or sqm >3,000 and Price < $US 5,000 or Price > $US 5,000,000. The winsorized data set removes the first and 99th percentile of price and size; SD= Standard Deviation; IQR=Interquartile range 42 Annex 4: Frequency Distribution of Smaller Apartments and Houses Albania Costa Rica Morocco South Africa Pakistan 43 Annex 5: Robustness: Applying Different Splits A) 75:25 Split Random Forest-based Prediction OLS-based Prediction Basic Local Local Ratio Basic Ratio House Ratio Ratio predicted House Price predicted Price predicted predicted and Local Apartment and Local Country Apartment and MAPE and MAPE observed MAPE (Current observed MAPE (Current observed observed price USD) price USD) price price (median) (median) (median) (median) Albania 64,151.96 1.00 1.05 0.29 0.24 64,128.05 1.00 1.03 0.29 0.34 Costa Rica 112,948.01 0.99 1.00 0.25 0.24 117,076.60 1.02 0.99 0.29 0.27 Morocco 50,540.15 1.00 1.04 0.24 0.27 47,497.29 1.02 0.99 0.41 0.43 Pakistan 21,953.53 0.99 1.13 0.31 0.27 24,590.69 1.01 1.26 0.44 0.51 South Africa 63,120.99 1.01 1.06 0.30 0.30 85,368.35 1.15 1.29 0.56 0.58 44 B) 90:10 Split Random Forest-based Prediction OLS-based Prediction Basic Local Basic Local Ratio Ratio Ratio House Ratio House predicted predicted predicted Price predicted Price and and Local and Local Country Apartment and MAPE Apartment observed MAPE observed MAPE observed MAPE (Current observed (Current price price price USD) price USD) (median) (median) (median) (median) Albania 64,225.80 1.01 1.14 0.26 0.25 56,673.70 1.00 1.10 0.28 0.26 Costa Rica 118,500.52 1.00 1.09 0.23 0.32 117,152.50 1.01 1.11 0.28 0.33 Morocco 51,580.91 1.00 1.04 0.24 0.26 47,689.99 1.03 1.21 0.45 0.52 Pakistan 22,309.16 1.01 1.12 0.30 0.27 24,613.56 1.03 1.27 0.44 0.53 South Africa 62,322.08 1.02 1.04 0.29 0.28 85,833.20 1.14 1.26 0.56 0.57 Note: MAPE refers to mean absolute percentage error; local MAPE refers to the mean percentage error based on predictions accuracy of apartments sized between 50 and 60 square meters; local ratio refers to the ratio between observed and predicted values for apartments sized between 50 and 60 square meters. 45 Annex 6: Robustness: Training of the RF Model in a Global Model Random Forest-based Prediction Basic Basic House House Local Ratio Price Price Ratio predicted Apartment Apartment predicted and Local Country (Current (PPP$) and MAPE observed MAPE USD) observed price price (median) Global Global (median) Model Model Albania 55,923 148,600 1.07 1.10 0.28 0.24 Costa Rica 112,739 193,916 1.03 1.01 0.25 0.20 Morocco 56,405 137,148 1.07 1.10 0.29 0.27 Pakistan 24,330 101,633 1.09 1.29 0.36 0.45 South Africa 60,665 130,856 1.06 1.15 0.30 0.30 Note: MAPE refers to mean absolute percentage error; local MAPE refers to the mean percentage error based on predictions accuracy of apartments sized between 50 and 60 square meters; local ratio refers to the ratio between observed and predicted values for apartments sized between 50 and 60 square meters. 46