Policy Research Working Paper 10932 Spatial Inequality and Informality in Kenya’s Firm Network Verena Wiedemann Benard K. Kirui Vatsal Khandelwal Peter W. Chacha International Finance Corporation September 2024 Policy Research Working Paper 10932 Abstract The spatial configuration of domestic supply chains plays a activity, the paper estimates a structural model and predict crucial role in the transmission of shocks. This paper inves- a revised network incorporating informal firms. Findings tigates the representativeness of formal sector firm-to-firm show that formal sector trade flows underaccount for trade trade data in capturing domestic trade patterns in Kenya, within regions and across regions with stronger social ties. a context with high informality. It documents stylized facts The higher the incidence of informality, the more one showing that formal sector trade exhibits a distinct spatial underestimates vulnerability to domestic shocks and over- concentration relative to overall economic activity. Link- estimates exposure to import shocks. ing transaction-level data with data on informal economic This paper is a product of the International Finance Corporation. It is part of a larger effort by the World Bank Group to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at vwiedemann@ifc.org. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team ∗ Spatial Inequality and Informality in Firm Networks Verena Wiedemann† Benard Kipyegon Kirui‡ Vatsal Khandelwal§ Peter Wankuru Chacha¶ Originally published in the Policy Research Working Paper Series on September 2024. This version is updated on August 2025. To obtain the originally published version, please email prwp@worldbank.org. Keywords: firm networks, informality, spatial inequality, economic development. JEL classification: O11, O17, R12, D22, D85, E26. ∗ We thank the Kenya Revenue Authority (KRA) for the outstanding collaboration. Romeo Ekirapa, Simon Mwangi, and Benard Sang provided excellent technical support and advice. We thank Elizabeth Gatwiri and Daniela Villacreces Villacis for excellent research assistance. We thank Andrea Bacilieri, Raphael Bradenbrink, Banu Demir Pakel, Kevin Donovan, Douglas Gollin, Justice Tei Mensah, Luke Heath Milsom, Sanghamitra Warrier Mukherjee, Solomon Owusu, Piyush Panigrahi, Nina Pavcnik, Simon Quinn, Alexander Teytelboym, Christopher Woodruff, Hannah Zillessen and participants of seminars and conferences at Oxford, CSAE, STEG, the World Bank’s DaTax group, IFC, KIPPRA, Montreal, the Bank of Italy/OECD, and Jeune Street for comments and feedback. We gratefully acknowledge financial support from the Private Enterprise Development in Low Income Countries (PEDL) and the Structural Transformation and Economic Growth (STEG) programmes, both of which are joint initiatives by the Centre for Economic Policy Research (CEPR) and the UK Foreign, Commonwealth & Development Office (FCDO). Verena Wiedemann further acknowledges funding from the Oxford Economic Papers Research Fund (OEP) and the German National Academic Scholarship Foundation. This study has been approved by the Department of Economics Research Ethics Committee at Oxford (protocol no.ECONCIA20-21-23), and the Kenyan National Commission for Science, Technology and Innovation (protocol no.NACOSTI/P/20/5923). The views in this paper are those of the authors, and do not necessarily represent those of the KRA or any other institution the authors are affiliated with. † Economic Research Unit, International Finance Corporation (World Bank Group) ‡ Privatization Commission Kenya § Department of Economics, University of Exeter ¶ International Monetary Fund 1 Introduction Limited opportunities for export-led growth and concerns over the unequal allocation of gains from trade have led policymakers and researchers to shift attention towards domestic supply chains and market integration to enhance economic development (Topalova, 2010; Atkin and Donaldson, 2015; Bustos et al., 2020; Grant and Startz, 2022; Goldberg and Reed, 2023). Pro- gress in understanding the structure of domestic supply chains has been facilitated by the in- creasing availability of granular transaction-level firm network data, which are often sourced ao et al., 2022; Alfaro-Ure˜ from tax records (see e.g. Panigrahi, 2022; Ad˜ na et al., 2022; Boken et al., 2023).1 These developments reflect a broader trend in the literature on development and structural transformation, where non-traditional novel micro-data such as credit registries, smartphone data, and matched employer-employee data enable new insights into classic questions (Bustos et al., 2020; Blanchard et al., 2021; Dix-Carneiro et al., 2024; Kreindler and Miyauchi, 2023). However, a potential drawback is that the data-generating process underlying such ad- ministrative records is often skewed toward particular segments of the economy, e.g. taxpayers in the case of firm-to-firm transaction data. This can leave us with a potentially incomplete view of economic activity due to the size and significance of the informal sector in many low and middle-income economies. In this paper, we ask how observing only a selected segment of the economy, in our case, formal firms, might bias the inferences drawn from data on firm-to-firm trade networks constructed using tax records. Accounting for this bias is important because the structure of these networks can shape the aggregate and distributional impacts of infrastructure investments, industrial policies, and financial shocks (see e.g. Acemoglu et al., 2012; Liu, 2019; Demir et al., 2022; Balboni et al., 2023; Castro-Vincenzi et al., 2024; Demir et al., 2024; Huremovic et al., 2024; Donaldson, 2025) and inform tax policy and the design of social protection programs.2 Formal- sector data may not be representative of the overall spatial distribution of trade, especially if formal firms are concentrated in specific sectors and regions and exhibit distinct linking patterns. Given the scarcity of granular data on the informal sector, we lack empirical evidence on the nature and degree of the bias caused by neglecting informality. We address this question by combining transaction-level administrative tax records of over 76,000 formal firms and 5.8 million firm-to-firm relationships in Kenya, with micro and aggregate data on informal sector activity obtained from labor force surveys and national accounts.3 We first 1 Transaction-level survey data (see e.g. Startz, 2021) or administrative industry-specific data (see e.g. Hansman et al., 2020), which often cover a wider range of firm and buyer-supplier characteristics, are a popular, albeit costly, complement to data relying on tax records. 2 An advantage of transaction-level firm-to-firm trade data over traditional sources like input-output tables is the possibility to track spatial heterogeneity in economic activity, rather than being limited to national aggregates. 3 The labor force survey module was integrated in the 2019 population census. The public micro data covers 1 establish a series of stylized facts on both the formal and the informal sector which suggest that formal sector data is not representative of overall economic activity. We then estimate a structural model to construct a revised firm network that accounts for informality. Overall, we find that formal sector data overstates the importance of urban hubs and underestimates intra- regional trade. As a result, firm-to-firm data on the formal sector overestimates the degree of spatial inequality in trade flows, leading to an underestimation of the impact of domestic output shocks and an overestimation of the impact of import shocks in sectors and regions with high informality. These results are robust to varying assumptions about how informal firms interact with the rest of the economy. Our findings highlight the value of complementing administrative firm-to-firm data with auxiliary data on informality to improve estimates of aggregate effects of economic shocks. The question of how well formal sector trade patterns represent overall economic activity is relevant for many countries with high informality, several of which have accessible VAT data (see Figure 1). The Kenyan context is particularly well-suited to answering this question due to the relevance of the informal sector and the availability of unusually granular data on the sectoral and geographic composition of informal economic activity. VAT-paying firms account for only 36% of Kenya’s GDP.4 At the same time, high-quality subnational data allow us to observe both formal and informal activity across sectors and regions. Figure 1: Country-level informality rates and income levels Output informality (% of GDP) Self-employment (% of total employment) Each marker represents a country. Data on output informality and self-employment are taken from Elgin et al. (2021). We highlight countries for which, to our knowledge, at least one working paper or peer-reviewed article using VAT data is available. We begin by assessing whether patterns of domestic trade observed for the formal sector arise due to the systematic selection of firms into the administrative data or reflect the underlying structure of the economy. First, we show that formal firms exhibit distinct spatial patterns compared to overall economic activity. Trade among formal firms is more spatially concentrated compared 10% of the overall population and is hence representative at a very granular level. 4 As we will show later on, the public sector contributes another third to GDP. Any residual output can be attributed to the informal sector. 2 to other indicators, including population, GDP, employment in micro and small enterprises, the distribution of registered firms, and patterns of internal migration. This spatial concentration is driven by inequality along the extensive margins of the firm network, i.e. the location of firms and trading relationships, rather than transaction volumes. Notably, formal firms maintain twice as many links to buyers and suppliers in urban hubs as they do to formal firms within their own counties, indicating that within-county trade is less common than inter-county trade with firms in Nairobi. Following this, we show that informality is not evenly distributed across space but its incidence varies systematically across sectors, regions, and firm position in the supply chain. Informal firms are more likely to be located downstream of large formal firms, and informality negatively correlates with regional economic size and income. As a result, we expect that accounting for the informal sector can systematically alter the structure of the observed firm network rather than introducing random measurement error. As we do not directly observe trade flows between informal firms, we rely on a structural model to recover a network that accounts for informal activity. We introduce and estimate a network formation model with heterogeneous firm types following e et al. (2012) to predict a revised network. In our adaptation of the model, we classify Bramoull´ firms based on their sector, location, and size, due to the substantial heterogeneity along these three dimensions documented in the empirical section. The key advantage of the model is that it allows for a flexible network formation process that accounts for heterogeneous linking pref- erences along the above dimensions. Moreover, it allows firms to form links both independently (undirected search) and through the existing connections of their suppliers (directed search). The model then provides a unique steady-state predictions for the number of links between firms of different sector-location-size types. We first estimate the model to predict the Kenyan firm network as observed. We find that new firms choose 45% of their suppliers through undir- ected search, conditional on their bias, and the remaining 55% of suppliers are found via existing suppliers.5 The model-predicted network not only fits the empirical distribution of outlinks (i.e. the number of buyers) well, but also performs well with respect to untargeted moments, such as the share of links accounted for by Nairobi- and Mombasa-based firms. Using the estimated model, we then predict a revised network that accounts for informal firms by combining the model with granular real-world data on the sectoral and regional composition of the informal sector. Updating the entry probabilities of firms by sector, location, and size requires data on the spatial and sectoral dispersion of economic activity, for which we draw on microdata from the labour force module of the 2019 population census. We initially assume that 5 In comparison, Chaney (2014) finds that only 40% of all relationships of French exporters with international trade partners are formed via directed search. Our estimate of 55% of links being formed as a result of directed search could suggest that linking frictions are potentially even more binding for firms in Kenya’s domestic firm network. 3 informal firms, conditional on sector and geography, exhibit linking patterns similar to those of the smallest quartile of formal firms in the administrative data. This assumption addresses concerns that informal firms might have different linking patterns compared to larger formal firms operating in the same sector and location (e.g. due to internal economies of scale (Grant and Startz, 2022)).6 Nevertheless, informal firms might encounter additional obstacles specific to informality.7 To account for this, we implement additional sensitivity checks that account for alternative scenarios with lower linking probabilities between the formal and informal sector. We use the revised network to answer the question of interest: How do spatial patterns of trade change when informal firms are accounted for? First, we find that sectors and regions with the highest levels of informality have more outlinks in the revised network relative to the baseline network. The spatial inequality in outlinks declines by 7% and the prominence of urban hubs like Nairobi and Mombasa decreases by 5 percentage points. We show that while this decrease in inequality of outlinks is driven by an increase in both inter-county and intra-county trade, intra- county trade rises by a larger margin. Moreover, we find that once informal firms are accounted for, the number of trade relationships between counties is more sensitive to the strength of social ties between them, measured using both online friendship networks and migration inflows. Next, we simulate the pass-through of domestic and import shocks using both the baseline network estimated from the structural model and the revised network that incorporates informal firms. We find that ignoring informality leads us to underestimate the average impact of domestic shocks and overestimate the average impact of import shocks. In particular, when using the revised network, we find that domestic output shocks have a more pronounced negative effect on sector-regions with higher levels of informality, compared to results based on formal sector data alone. Our results suggest that a 1 percentage point decrease in the formal sector share corresponds with an underestimation of the reduction in output by 5 percentage points. By contrast, for import shocks, the bias goes in the opposite direction. The economy appears less exposed to import shocks once the informal sector is taken into account. This discrepancy arises because import shocks primarily affect larger formal firms, which carry less weight in the overall firm network once informality is incorporated. Finally, we consider the sensitivity of our results with respect to alternative assumptions about the linking patterns of informal firms. We conduct bounding exercises where we further restrict the degree of integration of informal firms with the formal sector relative to small formal firms. ohme and Thiele, 2014; Gadenne Drawing on survey evidence and the existing literature (B¨ 6 We provide evidence showing that the linking patterns of small formal firms are similar to what we would expect from informal firms – they link more locally and buy more from intermediaries relative to their larger peers. 7 This can include wedges introduced by the VAT system itself (De Paula and Scheinkman, 2010; Gadenne et al., 2022). 4 et al., 2022), we assume informal firms sell no output to the formal sector and source a smaller proportion of their inputs from formal firms compared to small formal firms. We find that this further reduces spatial inequality in outlinks and the prominence of urban hubs relative to the formal network. Moreover, we continue to underestimate the impact of domestic shocks while overestimating the effects of trade shocks. Our paper contributes to the literature on macroeconomic development, informality, firm net- works, and spatial inequality. First, we contribute to a growing body of research at the intersection of trade and macroe- conomic development that integrates granular administrative data such as employer-employee records and data from credit registries, with broader data sources like population censuses to achieve a more accurate assessment of aggregate economic outcomes. To date, this literature has primarily focused on employment outcomes, sector shares (see e.g. Albert et al., 2021), and consumption (see e.g. Fan et al., 2023), where informal activity is somewhat more observable. ohme and Thiele, However, informal activity along supply chains remains particularly elusive (B¨ 2014; Atkin and Khandelwal, 2020).8 Our results highlight the implications of the non-random selection of firms into administrative records. This is particularly important, given the growing reliance on such data in the literature (Donaldson, 2025). Our approach to employ a structural model to bridge gaps in our understanding of informal firm dynamics also aligns with the recent literature in this field (see e.g. Ulyssea, 2018; Dix- Carneiro et al., 2024). Unlike related studies that focus on firm and worker-level dynamics, we do not model the endogenous response of firms and workers to simulated shocks. Crucially, however, our research design allows us to examine the role of informality for Kenya’s region- level input-output matrix. This is particularly relevant for research that seeks to complement predictions about aggregate national welfare with welfare estimates at the regional level to study geographic heterogeneity in the impact of international trade (Topalova, 2010; Arkolakis et al., 2023), infrastructure investments (Demir et al., 2024) or climate and weather shocks (Albert et al., 2021; Castro-Vincenzi et al., 2024). Second, we contribute to the literature on spatial production networks (Bernard et al., 2019; Panigrahi, 2022; Miyauchi, 2024), shock propagation in firm networks (see e.g. Baqaee and Farhi, 2019; Carvalho et al., 2021; Arkolakis et al., 2023; Chacha et al., 2024), and urban primacy (as published in Jefferson (1989), Jefferson, 1939; Memon, 1976; Ades and Glaeser, 1995; Soo, 2005). We analyze the spatial distribution of formal firms in an economy with a large informal 8 A related literature in public finance studies why informality arises along supply chains, and how tax policy can alter its incidence (De Paula and Scheinkman, 2010; Zhou, 2022; Gadenne et al., 2022; Almunia et al., 2023). Relative to this literature, we focus on reconstructing a more complete network that includes informal firms rather than studying the marginal firm’s decision to formalize. 5 sector and demonstrate that ignoring informality can lead to overestimating spatial inequality in firm-to-firm trade and the extent of urban primacy. This oversight may cause researchers to underestimate the economic connectedness and vulnerability of smaller regions. Our finding that formal sector activity is disproportionately concentrated in urban hubs is also consistent with spatial patterns in other contexts that exhibit a similar formal-core, informal-periphery arate, 2022). In complementary work, Bacilieri et al. (2023) examine how varying structure (Z´ reporting thresholds for firm-to-firm transactions affect the comparability of aggregate network statistics. While they demonstrate that missing data can bias network statistics, our focus differs in two key ways. First, transaction reporting thresholds are not a concern in our setting; we instead address other sources of network incompleteness in tax data. Second, while they analyze aggregate statistics, we examine distributional implications at the regional level – a margin where we find informality plays a substantial role. Finally, we contribute to the growing literature on estimating network statistics and recon- structing networks with missing data (e.g., Chandrasekhar (2016); Mungo et al. (2023)).9 We contribute to this literature by proposing the use of a structural approach. The structural route seeks to tackle two challenges: first, nodes are missing in a non-random manner. Second, we do not observe network characteristics for missing nodes, but linking preferences of missing firms can systematically differ from those of observed firms. To address these challenges, we combine multiple data sources and estimate a network formation model that recovers both the prevalence of missing firms and their linking behavior, accounting for sectoral and geographic preferences. In comparison, Mungo et al. (2023), for example, use a link prediction algorithm to predict the existence of links among missing nodes using data on the characteristics and linking behavior of non-missing nodes. This approach is challenging to apply in the context of informality, as informal firms are missing in a non-random manner and information on their linkages are not available at a sufficient granularity to recover a network with sectoral and spatial heterogeneity.10 The paper is organized as follows: Section 2 describes our data. Section 3 examines how repres- entative the trade patterns captured in the administrative data are of overall economic activity and discusses the role of the informal sector. We present and estimate a network formation model in Section 4. Section 5 describes the patterns we observe in our revised network that now incorporates informal firms, while Section 6 analyzes the sensitivity of these results to alternative assumptions. 9 u and Zhou (2011) for a survey of algorithms used for link prediction, a technique commonly applied in See L¨ network reconstruction. 10 Rare instances where researchers have used survey data to collect information on firm-to-firm links of informal firms focus on specific sub-sectors. Moreover, explicitly prompting for the tax status of trade partners adds an additional challenge due to the sensitive nature of this information. 6 2 Data Description 2.1 Administrative Data Our analysis of the formal sector draws on micro data from value-added (VAT), pay-as-you-earn (PAYE) tax returns, and tax registration forms collected by the Kenya Revenue Authority. The tax registration forms provide self-reported information on each firm’s 4-digit sector classific- ation and headquarter location. The VAT returns include details on firm-to-firm transactions between VAT-registered firms. Sales and purchases with non-registered parties (e.g. tax ex- empt parties, non-VAT-registered businesses, and final consumers) are recorded as an aggregate monthly figure. VAT applies to firms with an annual turnover of KShs five million and above ($38,400 as of May 2024). Once a firm is VAT-registered and has crossed the threshold of KShs five million, they are required to continue filing VAT returns in years with lower turnover. We solely focus on entities that identify as private companies or partnerships in their tax- registration form. We also restrict our analysis to firms with annual purchases greater than zero and annual sales of KShs five million or more in at least one year between 2015 and 2022.11 Figure 2: Location of formal firms and informal sector employment shares Informal sector share of overall private Number of formal firms per km2 sector (self-)employment (based on tax records) (based on population census) The left map shows the density of firm headquarter locations at the subcounty level, i.e. the number of firms per km2 . The right map shows the share of informally employed people as a share of the local labour force, at the subcounty level. Subcounties represent the second administrative layer. The borders of Kenya’s 47 counties are outlined in grey. 11 This ensures that firms that registered for VAT to bid for tender but were never operational are not included for analysis. 7 Figure A1 plots the sector composition and the respective sales and purchase channels of firms covered in these administrative records. Manufacturing and wholesale and retail firms together account for almost half of the sales in the tax records. The graph on the left in Figure 2 shows the geographical dispersion of formal firm headquarters, which shows that many firms concentrate in Nairobi and Mombasa. Figure A2 plots the share of total observed firm-to-firm links attributed to different groups of firms, distinguishing by size, sector, and location. Panel A shows that aside from retail, wholesale, and manufacturing, business services account for a substantive proportion of supplier linkages. Mining firms account for the least number of both supplier and buyer links. Panel B shows that firms in Nairobi and Mombasa account for more than three quarters of all buyer and supplier links. Panel C shows that firms in the top sales quartile account for a disproportionately large share of total links, especially as suppliers. Aside from the statistics plotted in Figure A2, 61% of total links are formed among firms located in the same county, 17% of links with firms within the same sector, and 45% among firms in the same sales quartile. 2.2 Data on Informal Activity Informality in firm-to-firm transaction data can either arise because firms themselves are not registered (extensive margin) or because transactions of registered firms are not recorded (in- tensive margin). Not being registered as a firm or not recording a transaction in turn can either be the result of non-compliance or because of exemptions (e.g. for small firms).12 To obtain an updated distribution of overall (formal and informal) economic activity that we can compare to the administrative data, we draw on a series of data sources listed in Table 1. Importantly, we not only measure informality as the gap between overall economic activity and what is captured in the administrative data, but additionally rely on measures generated independently of the administrative records. This helps us to rule out the possibility that our measures of informality capture idiosyncratic patterns that are specific to the VAT system. Throughout this paper, we draw on three types of measures for economic activity: employment figures, the number of firms, and value added (i.e. the difference between sales and purchases). Our preferred measure of informality uses formal sector employment as a share of total employ- ment, drawing on a comprehensive labor force module in the 2019 population census (KNBS, 2019). Alternative measures of informality based on the number of firms rely on estimates of the universe of businesses in KNBS (2016),13 while a value-added based measure utilizes estimates 12 See Appendix A.2 for a detailed discussion of these margins and why administrative data may not fully capture economic activity. 13 KNBS (2016) obtains information on the number of licensed businesses from county governments and estimates the number of unlicensed businesses based on household survey data. 8 Table 1: Overview of data sources Source Year Aggregation Key indicators Population & housing census (census) 2019 Sector and county Formal & informal employment Gross County Product (GCP) 2019 Sector and county Gross County Product Census of establishments (CoE) 2017 Sector or county Number of formal sector establishments Micro, small & medium sized 2016 Firm-level Main input source and buyer enterprises survey (MSMEs) Census of industrial production 2010 Sector and county Sales of multi-establishment firms All data are collected and published by the Kenya National Bureau of Statistics. Sources: 2019 Kenya Population & Housing Census KNBS (2019); Gross County Product KNBS (2022); Census of Establishments KNBS (2017); Small & Medium-Sized Enterprises Survey KNBS (2016); Census of Industrial Production 2010 (KNBS, 2010). of the regional economic size captured by the Gross County Product (KNBS, 2022).14 The employment-based measure, which later serves as a key input for predicting the revised network with informal firms, offers two distinct advantages. First, it enables joint disaggregation of informal activity by sector and region. Second, it allows us to distinguish between private and public sector employment, a distinction unavailable in alternative measures, but based on which we can rule out that this measure of informality captures a proportion of public sector activity. The graph on the right in Figure 2 shows the geographical dispersion of informal activity as per the employment-based measure derived from the labor force module of the 2019 census. The measure correlates strongly (ρ= 0.83, Table A3) with a measure of regional formal sector shares that relies on the administrative data to capture the size of the formal sector (see Figure A9). Finally, we exclude agricultural and non-market service sectors from our informality measures where possible, as their tax records only cover a small and very specific sub-population of firms and employees. In the case of agricultural firms, the administrative data only capture large-scale commercial agriculture, which is often primarily export-oriented (see Figure A1 and Chacha et al. (2024)). Non-market services are dominated by non-profit organizations and the government with only a few for-profit VAT firms. Data on linking patterns of informal firms: In addition to the above, we use data from a survey with small and medium size enterprises (KNBS, 2016) to derive some insights into the sales and purchase patterns of the informal sector. The survey data only record the main type of buyer and supplier of a firm and hence cannot be directly used to reconstruct a network with informal firms. However, we will use these data to inform the assumptions of our model. 2.3 Size of the Informal Sector To assess how much of economic activity is generated by VAT-reporting firms, we first compare their value added with Kenya’s national accounts. We find that VAT-reporting firms account 14 To estimate the Gross County Product, KNBS (2022) relies on a series of data sets including the 2016 MSME survey and the 2019 population census. Most data sets covering informal activity are only collected intermittently. 9 for 36% of Kenya’s GDP on average between 2015 and 2019 (Appendix Table A4). Excluding VAT-exempt sectors such as agriculture, non-market services and finance, the share increases to 67% of residual economic activity, implying an informal sector share of 33%. The share of informal firms and people employed in the informal sector is even higher. Of the 7.4 million businesses identified in the 2016 MSME report, only one fifth were licensed, and only 2.5% appeared in VAT data (KNBS, 2016). Similarly, VAT-registered private sector firms employed 5% of Kenya’s workforce in 2019. With data on the formal sector only capturing a proportion of overall economic activity, we now turn to the question on whether the inability to observe informal firms in the VAT data might result in distinct trade patterns that deviate from overall firm-to-firm trade. 3 Representativeness of the Formal Firm Network In this section, we establish two key stylized facts about the extent to which trade among formal firms might be representative of overall economic activity. First, we show that trade among formal firms is highly concentrated around urban centers and places greater emphasis on inter-county rather than within-county trade. This contrasts with other measures of economic activity that include less formalised activities and which we find to be more geographically dispersed. Second, we document that the incidence of informality varies systematically across geography, sectors, and positions in the supply chain, suggesting that informality is indeed not evenly distributed across the economy. As a result, we expect overall trade patterns to diverge from the ones we document for the formal sector. Fact 1: Formal sector trade is more spatially concentrated than overall economic activity and distinctly centered around urban hubs. Formal sector data reveal a high degree of spatial concentration compared to overall economic activity. Our first observation is that formal sector trade flows are strongly concentrated around Kenya’s largest metropolitan areas Nairobi and Mombasa. In 2019, as much as 68% of the total sales within the network of formal firms was generated by Nairobi-headquartered firms (see Table 2). However, in the same year, as little as 9% of Kenya’s population lived in Nairobi County and the city contributed only 33% of Kenya’s GDP outside the agricultural sector (also reported in Table 2).15 These comparisons suggest that the role of urban centers in the Kenyan firm network is disproportionate relative to their population and their contribution to aggregate GDP. 15 While Nairobi’s metropolitan area extends into neighboring counties, for simplicity, references to Nairobi in this paper exclude these areas. 10 As a proxy for the spatial dispersion of economic activity beyond simple shares attributed to major cities, we compute Pareto exponents (α) (Gabaix, 2009; Soo, 2005). The third column in Table 2 shows that while county-level GDP and population show relatively even distributions (α close to one), measures of economic activity derived from the VAT data exhibit much higher inequality. The Pareto exponents for employment, value added, and trade flows are 57-76% lower than for overall economic activity, indicating higher spatial concentration. Is this spatial concentration driven by firm locations and number of firm-to-firm relationships (i.e. the extensive margin of trade) or trade volumes (i.e. the intensive margin of trade)? In Appendix A.5, we show that the spatial concentration of both trade out- and inflows is primarily driven by firm locations and the number of firm-to-firm relationships. The number of transactions and average trade volumes per transaction only play a minor role in explaining spatial variation in trade patterns. This insight later informs our choice to focus on a model that allows us to predict the structure of the production network along the extensive margins. To assess whether the observed spatial concentration and urban bias are idiosyncratic features of VAT data or reflect patterns innate to formal activity, we also compute the spatial concentration for a range of other indicators for economic activity. These range from the least formal to increasingly formalized economic activities and are included in Table 2. We observe a clear pattern: activities that capture very little formal activity such as the number of micro-, small-, and medium-size enterprises (MSMEs) and employment in MSMEs exhibit relatively little spatial concentration (α = 0.86 and 0.78, respectively). As we focus on more and more formalized activity, e.g. licensed businesses from micro to medium sized, all the way to firms in the census of industrial production, the degree of concentration increases. This suggests that the spatial concentration observed in the VAT data represents a feature of the formal sector more broadly. Unlike formal sector trade, we find that other observable types of domestic flows, like migration flows also do not exhibit a similar spatial concentration (see Table 2).16 For instance, while Nairobi is the destination for 29% of all long-term inter-county migration, it accounts for 69% of total outlinks and 65% of total inlinks in the formal sector firm network. A potential concern is that the observed spatial concentration in urban centres is mainly driven by firm headquarter locations being more likely to be based in Nairobi or Mombasa. While this concern is mitigated by the fact that we observe a similar spatial concentration in other measures of formal economic activity, we also use micro-data from the 2010 Census of Industrial Production to address the question about the role of multi-establishment firms more explicitly. In Appendix A.6, we compare the spatial concentration of sales and firm locations with and 16 We compute long-term migration flows using data on birthplace and current residence from the 2019 popu- lation census. 11 Table 2: Geographic concentration of economic activity by degree of formalisation Nairobi Mombasa Rank regression in % α SE Population overall 9 3 1.29 0.18 Population of cities & towns 31 9 0.85 0.01 Migration inflows 29 6 0.59 0.04 Migration outflows 5 2 0.77 0.07 GDP 25 5 1.00 0.07 GDP w/o agriculture 33 7 0.97 0.05 GDP w/o non-market services 25 5 0.91 0.08 No. MSMEs 14 3 0.86 0.17 Employment in MSMEs 19 3 0.78 0.13 No. licensed MSMEs 18 3 0.73 0.09 Employment in licensed MSMEs 28 3 0.67 0.07 No. SMEs 37 3 0.58 0.06 Employment in SMEs 36 3 0.60 0.05 No. census establishments 36 4 1.10 0.12 No. firms census of industrial production 48 6 0.54 0.02 Sales census of industrial production 61 7 0.32 0.03 No. VAT firms 64 9 0.63 0.03 Employment in VAT firms 62 9 0.36 0.03 Value added of VAT firms 72 10 0.38 0.03 VAT network sales 68 13 0.35 0.02 VAT network outlinks 69 11 0.35 0.02 VAT network purchases 60 9 0.43 0.02 VAT network inlinks 65 10 0.48 0.02 The columns for Nairobi and Mombasa report their share of the respective national aggregate figures (e.g., Nairobi’s contribution to Kenya’s GDP). The Pareto exponent α is the estimated coefficient from a county-level regression of each county’s rank (log) on the respective measure x (log): log rank = log A − α log x. All measures reported in the final section of the table are derived from the VAT data. All other measures are based on data sources summarised in Table 1. without multi-establishments. We find that the excess spatial concentration introduced by multi-establishments does not explain the aggregate concentration patterns of formal private sector activity. Formal firms predominately source from and sell to firms in urban hubs. Consistent with the previous section, a plot of county-by-county trade flows in Figure 3 highlights the primacy of Nairobi and Mombasa. The width of each segment on the left reflects the county’s total sales within the network, while segments on the right are proportional to total purchases. The color of the trade flows aligns with the county of origin. Importantly, visualizing trade flows also reveals that Nairobi not only has more outlinks to other counties but also is the most important buyer of goods and services by firms from other counties. Given the high concentration of manufacturing firms and their upstream position in the network, 12 Figure 3: County-level trade flows between formal firms The figure shows inter-firm trade flows aggregated at the county level. The size of each node (segment) is proportional to the county’s share of purchases and sales relative to the aggregate volume of firm-to-firm trade between formal firms in Kenya. The colour of the edges (links between segments) indicates the direction of the trade flow. They take the colour of the supplying county (e.g., goods and services provided by firms in Nakuru to firms in Nairobi take the colour of the segment for Nakuru). The width of each edge (links between segments) is proportional to the share of the trade flow with respect to the aggregate volume of trade flows in the transaction- level VAT data. To improve readability, we only separate the trade flows for eight counties (prioritizing those with the largest aggregate amount of transactions and those that act as regional hubs). We bundle the trade flows for the remaining 39 counties. we expect that Nairobi-based firms supply inputs to a wide range of other sectors and firms across the country. Indeed, as seen in Table 2, firm outlinks (α = 0.35) are more spatially concentrated than inlinks (α = 0.48), indicating that the supply of inputs to the broader network is more spatially concentrated. Despite sales destinations of firms being relatively less concentrated, it is perhaps more unexpected that Nairobi also is the top destination for firm-to-firm sales of formal firms located outside the city. This is well-illustrated by the left matrix in Figure 4, which plots the share of each county’s total sales across all 47 counties. The column corresponding to sales to Nairobi stands out with more intense shading. Formal firms trade significantly more across regions than within. Lastly, we find surprisingly limited evidence of home bias in the formal sector firm-to-firm data. Put differently, local county-level markets are not as important for formal firms as one might expect. This pattern is salient in the heatmaps in Figure 4. In both heatmaps, the squares along the diagonal represent the share of sales and purchases to other firms within the county. While the 13 Figure 4: County-by-county trading relationships Sales shares Purchase shares Each column and row in the above graph corresponds to one of Kenya’s 47 counties. The graphs plot how much each county accounts for any other county’s sales (purchase) shares, i.e. the row- (column-) normalised county- by-county matrix derived from the administrative firm-to-firm transaction-level data. The rows and columns are ordered alphabetically based on county names, which are omitted to improve readability. diagonal is clearly visible, particularly in the heatmap highlighting the destinations of firm sales, the lighter shades indicate that local markets are less important than trade with Nairobi. Indeed, for the median county, the number of supplier-to-buyer links with Nairobi is twice as large as the number of intra-county links. Overall we find that the number of inter-county supplier-buyer links is 4 times larger than the number of intra-county links for the median county. However, as expected, selling locally is more common for firms than sourcing inputs locally. This is reflected in the heatmap, where the diagonal, representing local trade, stands out more clearly for sales than for purchases. The concentration of trade flows around firms based in metropolitan areas alongside the relat- ively less prominent role for intra-county trade are two striking features of formal firm-to-firm trade. These patterns may not be reflective of overall economic activity. For the remainder of the paper we will focus on getting a better understanding of whether we expect these to be features of overall trade, including the informal sector, or whether these patterns are innate to the formal sector, similar to the documented urban concentration of formal economic activity. Fact 2: Incidence of informality varies by sector, geography, and position along the supply chain. We now investigate whether informality is randomly distributed across the economy or sys- tematically varies by sector, geography, and position in the supply chain. If informality is not 14 randomly distributed, we expect that accounting for the informal sector will systematically alter the structure of the observed network. A first piece of suggestive evidence is that regions with higher levels of informality have fewer observed firm-to-firm links in the VAT data, even after controlling for population and travel time to metropolitan areas (Table A1). Informal-sector shares correlate negatively with regional economic size and income. First, we explore the spatial distribution of informal firms, which we find predominantly reside in smaller markets. Figure A4 plots the distribution of formal sector shares across counties, measured using both value-added and employment metrics. The graph shows that in most counties, the formal sector accounts for less than 20% of economic activity. We find a strong correlation between a county’s formal sector share and both its economic size (measured by Gross County Product) and income level (measured by Gross County Product per capita). As shown in Figure A5, economic size alone explains between 35% and 52% of the variation in formal sector shares across counties. This pattern is consistent across all three measures of economic activity: employment, value added and the number of firms. To validate that this positive correlation between market size and formal sector share is not merely an artifact of the administrative data, Figure 5 presents correlations between Gross County Product and three additional employment-based formality measures that do not rely on the administrative data. Notably, while more stringent definitions of informality yield flatter slopes, the R2 remains stable. This consistency suggests that economic size explains similar proportions of county-level informality variation regardless of the measurement approach. The incidence of informality systematically varies across sectors. Beyond geographic patterns, informality also varies systematically across sectors. Figure 6 compares a sector’s value added (from administrative data) with its contribution to Kenya’s GDP (from national accounts). Manufacturing and business services show the closest alignment between these measures, which suggests that the bulk of economic activity in those two sectors takes place in the formal economy. This pattern is consistent with the fact that both sectors rely predominantly on inputs from other firms and tend to sell to other businesses (Figure A1). In contrast, downstream sectors closer to final consumers exhibit larger disparities between value added and GDP contributions (Figure A8). This pattern aligns with weaker self-enforcement in consumer-facing sectors inherent to most VAT systems (Pomeranz, 2015; Naritomi, 2019). We observe similar patterns in both the extensive margin of informality (comparing firm counts across data sources, Figure A6) and the intensive margin (comparing formal and informal em- 15 Figure 5: Share of formal sector employment and regional market size The first measure uses the formal sector employment share according to the 2019 population census, the second measure considers the number of employees in licensed businesses, the third uses the same measure but disregards micro-enterprises, and the fourth measure considers employment in the tax records. Each measure represents a share, i.e. captures the proportion of economic activity that can be attributed to the formal sector. ployment from the 2019 population census, Figure A7). Both measures indicate higher inform- ality in downstream sectors such as wholesale, retail, and other services. Informal firms are located downstream of larger firms. While the previous section indicates that informal firms predominantly operate in downstream sectors, we now utilize survey data to further document their location along the supply chain. We find that informal firms predominantly operate downstream of large formal firms and in consumer-facing roles. If interactions between large firms and smaller, often informal firms occur, they often follow the following pattern: large firms serve as input providers to informal businesses, whilst informal firms primarily act as distributors, serving end consumers.17 We draw on survey data on trading partners of micro, small and mid-sized enterprises (MSMEs) by KNBS (2016), which asks about a firm’s main source of input and main type of customer. Only 2.3% of all MSMEs state that a large firm is their main customer, while 14.5% rely on large firms as their main source of inputs.18 Figure 7 shows that the pattern holds across sectors.19 17 Cordaro et al. (2022) document this pattern in Kenya, showing how microenterprises distribute fast-moving consumer goods for multinationals. 18 KNBS (2016) defines large firms as those with more than 99 employees, which is larger than the average VAT-paying firm. 19 The survey likely provides a lower bound estimate of the interaction between the VAT-registered and non- VAT-registered firms. While MSMEs primarily trade with each other, the survey does not distinguish micro, 16 Figure 6: Value added by VAT firms vs GDP This graph compares the sector-level contribution to national GDP to the value added (sales - purchases) of firms covered in the administrative tax records for 2019. ohme and Thiele (2014); Zhou (2022) who document similar These results align with findings by B¨ ote d’Ivoire, Mali, linking patterns between formal and informal firms in Benin, Burkina Faso, Cˆ en´ S´ ohme and Thiele, 2014) and India (Zhou, 2022) respectively.20 Gadenne et al. egal, Togo (B¨ (2022) use granular data covering a wide range of sectors to document that non-VAT paying firms that participate in a simplified tax scheme in West Bengal, India, also sell little to VAT-paying firms, but purchase between 50-75% of their inputs from VAT-paying businesses. The higher incidence of informality in downstream sectors, as well as informal firms being more likely to purchase from larger firms rather than vice versa, are consistent with the underlying enforcement structure of VAT systems. The VAT system incentivises downstream firms to request receipts from suppliers to claim input VAT deductions against their output VAT obligations. However, end consumers and VAT-exempt entities lack incentives to demand receipts as they cannot claim VAT refunds (Naritomi, 2019). Consequently, we indeed expect downstream sectors to exhibit higher shares of economic activity outside the VAT system. small, and medium firms. Small enterprises, defined as those with up to 50 employees and KShs five million annual turnover (KNBS, 2016) may fall below the VAT threshold, but medium-sized firms often exceed it. 20 ohme and Thiele (2014) look at multiple sectors, while Zhou (2022) focuses on manufacturing. For a review B¨ on links between the formal and informal economy, see Meagher (2013). 17 Figure 7: Links of small and medium sized enterprises to large firms Sales Purchases The figure draws on data from the 2016 Small and Medium Enterprises (MSME) Survey by the Kenya National Bureau of Statistics (KNBS, 2016). We restrict the sample to participating firms with an annual revenue below the VAT registration cut-off. The survey asks each firm for their main input sources and their main customer type. Note that the customer and supplier category “MSME” also contains medium sized firms which can include formal tax-registered firms. The percentage captured in the “Large firm” category thus represents a lower bound on linkages between small non-VAT registered businesses and large VAT-registered private sector firms. KNBS (2016) defines non-MSMEs/large firms as entities with more than 99 employees. 4 A Network Formation Model with Heterogeneity in Sectors, Regions, and Firm Size We now present and estimate a network formation model to (i) predict the formal firm network as observed in the data and (ii) estimate a revised network that accounts for informal firms. We will use the revised network to measure the extent to which ignoring informality has implications for the spatial patterns of domestic trade and the geographic variation in the pass-through of domestic and international trade shocks. 4.1 Model Motivation e et al. (2012). This model is We rely on the network formation model outlined in Bramoull´ particularly well-suited for our purposes for three reasons. First, it focuses on the entry of nodes into the network and the formation of links among them. In other words, the model captures the extensive margin of trade, i.e. firm location and firm- to-firm links. As discussed earlier and document in Appendix A.5, we show that these two components account for 70-90% of the variation in trade flows. Second, it allows us to easily incorporate three key dimensions of firm heterogeneity that can affect network formation - sectors, geography, and size. The sectoral dimension captures the underlying input-output structure, while the geographic dimension also allows us to study the question of spatial inequality. The size dimension captures both the well-documented positive 18 relationship between firm sales and firm-to-firm connections (Bernard et al., 2022; Bacilieri et al., 2023) as well as potential differences in how small firms organize their supply chains across space and sectors.21 Table B1 shows that small firms within the same sector and county are less likely to directly source from manufacturing firms (upstream), but instead are more likely to source from retailers or wholesalers (downstream). Further, they are less likely to source from Nairobi- based suppliers and more likely to source locally. Third, the model incorporates a flexible network formation process such that the emergent degree distribution can follow a power law. The underlying dynamic network formation process gives rise to the substantial inequality in outdegrees across firms that has been widely documented in the literature (Bernard and Moxnes, 2018; Panigrahi, 2022; Bernard et al., 2022; Bacilieri et al., 2023). Figure B1 plots the outdegree and indegree distribution of the formal firms in our VAT data, revealing a very unequal degree distribution that resembles a power law. Our framework is flexible and allows us to estimate the share of firm-to-firm links formed via directed search (searching among the trade partners of existing suppliers) versus undirected search (often referred to as random search in the networks literature (Jackson and Rogers, 2007)). 4.2 Model Setup Consider an economy with a set of firms denoted by N . Each firm i ∈ N is of a given type θi ∈ Θ where Θ is the set of all possible types. In our application, we specify firm types as unique sector-county-size combinations, i.e. all firms in the same sector, county, and size group are classified as the same type. Our aim is to predict a matrix π that captures the number of links that exist between every possible pair of firm types. The network formation process is as follows. In every period t, a new buyer firm of type θ enters with probability p(θ). Hence, the number of firms in the network in any given time period is equal to the number of time periods that have passed since t = 0. In order for its operations to be viable, the new firm needs to source inputs from a fixed number of suppliers m. To do this, it first chooses a sector-county-size pair (i.e. a type) with probability p(θ, θ′ ) for all θ′ in Θ. The probabilities p(θ, θ′ ) represent the firm’s bias in terms of sectors, regions, and firm size types it wants to link with. In other words, the probability that a buyer of type θ finds a supplier of type θ′ may not necessarily be equal to the probability of θ′ in the firm population. These biases can reflect production technologies or homophilous preferences arising out of search costs and information frictions. Firms in a location θ might find it easier to link to firms in location θ′ that is close to them as opposed to firms in location θ′′ that is far. Likewise, firms in sectors that 21 Economies of scale in trade cost at the firm level give rise to supply chain structures with several intermediaries (Grant and Startz, 2022). Hence, we expect the linking patterns of small firms to diverge from large firms. Relevant for our case, economies of scale can result in firms of different sizes, but operating within the same geography and sector exhibiting different sourcing patterns. 19 supply services like electricity or telecommunication, which almost every firm requires as inputs, might find themselves with linking probabilities p(θ, θ′ ) that exceed their entry probability p(θ). Having chosen the sector-region-size type it wants to link with, the firm now relies on two different search technologies to form its m links: first, undirected search (a.k.a. random search). Here, the new firm ‘randomly’ links to other firms of the chosen type. It forms a fraction r of its total m links in this manner. Second, preferential attachment. The new firm forms the remaining fraction 1 − r of its m links to suppliers by searching among the existing suppliers it acquired via undirected search. In other words, once the buyer firm forms links to the first set of suppliers, it then ‘randomly’ links with the suppliers of its suppliers. The second step of this process is preferential in that suppliers that are more connected are more likely to be chosen. This process continues for several time periods and the network evolves accordingly.22 Ultimately, we are interested in the number of links between each sector-county-size type and their outdegrees. To this end, consider a matrix B where each row and column represents a type ′ (θ,θ ) θ ∈ Θ. Its θθ′ ’th entry is then equal to p(θ) pp e et al. (2012) rely on B to derive (θ′ ) . Bramoull´ the matrix π whose ij ’th entry shows the number of directed links at time t between buyers of type i and suppliers of type j which are born in t0 : r t πt =m (f (t, B) − I) (1) 0 1−r Here, t refers to the time period, I is the identity matrix, and f is a scaled geometric series of the matrix B defined as follows: µ=∞ ((1 − r) log(t)B)µ f (t, B) = µ=0 µ! Newly entered buyers form m inlinks in every period. As a result the outdegree of existing firms, i.e. the suppliers of the newly entered firms, evolves over time. Thus, the matrix πt t gives 0 the expected outdegree (i.e. number of buyers) of each column node born in time t0 to a row, computed at time t. The purpose of the dynamic network formation process is to rationalise the heterogeneity in outdegree. Our focus on predicting outdegrees while keeping indegrees fixed is further discussed in Appendix B.3. We motivate this assumption using stylized facts observed in the data. While this assumption helps us characterize the steady state of the model, we also discuss how relaxing it would likely reinforce our results. 22 The model takes the distribution of firm types as given, abstracting from firms’ endogenous entry decisions across sectors, regions, and size categories (and subsequently, formality status). Instead, we capture these entry patterns through exogenous probabilities p(θ) for each firm type, which correspond to the observed spatial, sectoral, and size distributions in our data. While one could extend the model to microfound these entry choices — for instance, to explain the concentration of economic activity in Nairobi —we prioritise matching the observed proportions of different firm types rather than explaining their underlying determinants. 20 4.3 Estimation Strategy Given the granular data on the empirical formal sector firm-to-firm network, we are able to obtain the majority of the model parameters directly from the data (see Table 3 for an overview). These include all entry probabilities p(θ) ∀ θ ∈ Θ that a firm enters in a given sector, county, and size group as well as all interaction probabilities p(θ, θ′ ) between all possible sector-county-size types. We use the cross-section from 2019, the last pre-COVID year of our panel, to obtain the p(θ)s and p(θ, θ′ )s.23 The parameter we need to estimate is r, the fraction of input links a firm obtains via undirected search independent of the network environment. Table 3: Model parameters Parameters Description Source Proxy Value r Share of suppliers via random search Estimated - 0.45 p(θ) Entry probability of type θ Data Share of firms observed as θ (0,0.12] p(θ, θ′ ) Linking probability of θ and θ′ Data Share of links between θ and θ′ (0,1] m Indegree Data Avg. number of suppliers 30 t Number of entry periods Data Number of firms in admin data 56822 First, we classify firms into types defined by unique sector-location-size combinations. Sectors refer to 13 aggregate sectors, namely, agriculture, mining, manufacturing, utilities, construc- tion, transportation and logistics, wholesale, hospitality, retail, business services, non-market services, other services, and miscellaneous (incl. international organisations and non-classified firms). Locations are given by the county in which the firm is located. Within each sector and county we further group firms into large and small firms. We define small firms as firms in the bottom sales quartile within a sector-county group. By restricting ourselves to two size bins only, we avoid having too few observations in each firm-type bin and the matrix of linking prob- abilities becoming too sparse. For example, all firms in the top three sales quartiles of Nairobi’s manufacturing sector are classified as the same type. Next, we compute the probability that a type exists for all types in Θ. We do so by dividing the number of formal firms of a sector-county-size type by the total number of formal firms in the economy. The interaction probabilities p(θ, θ′ ) then represent the fraction of a sector-county-size type θ’s supplier relationships that it forms with type θ′ . We compute the above probabilities for all possible combinations of types and use them to construct the matrix B. Moreover, we follow Jackson and Rogers (2007) and define m as the average indegree (i.e. average number of suppliers) in the network. The variable t, by definition, is equal to the number of firms observed in the data equal to 56,822. 23 We exclude a small proportion of firms with zero suppliers in the data, as the model requires all entrants to form m buying links. 21 Using the parameters from the empirical data, we are able to predict the matrix of type-to-type network links π (r) for different choices of r ∈ [0, 1]. Appendix B.2 discusses the practical steps needed to construct the matrix during the estimation. In addition to the predicted version of the matrix π , we also observe the actual π in the data where the ij ’th entry of π is just the number of links between types i and j . We match the model predicted matrix and the matrix in the data using the method of moments procedure to obtain r∗ . Each moment is weighted by the probability with which we observe a specific sector-region- size type in the data. In doing so, we assign greater weight to more common sector-region-size types whose probabilities tend to be more stable over time. r∗ is defined as follows: r∗ = arg min p(θ) (πmodel (θ, θ′ ; r) − πactual (θ, θ′ ))2 (2) θ θ′ r∗ is obtained by minimising the distance between the model predicted matrix of type-by-type interactions and the corresponding matrix obtained from the data (method of moments). We estimate r using simulated annealing. With only one parameter to estimate, we can plot the objective function for various values of r to ensure that our estimated value is indeed the global minimum (see Figure B4). 4.4 Estimation Results Our estimation strategy yields a result of r∗ = 0.45. It suggests that a newly entered firm chooses 45% of its m suppliers at random, and the remaining 55% among the suppliers of its existing suppliers. A network with 55% of all links being formed via directed search suggests a prominent role for information frictions as firms rely on their suppliers to form new links. It aligns with previous research documenting the importance of relational contracts in Kenya and neighbouring economies (Fafchamps, 2003). In a variant of this model, Chaney (2014) estimates r = 0.6 for French exporters forming links with trade partners abroad, which also suggests a substantial, but not quite as prominent role of information asymmetries. 4.5 Model Fit To assess how well our model does in fitting the targeted outdegree distribution (i.e. distribution of the number of buyers), we plot the degree distribution (i.e. total number of outlinks) of each sector-county-size type as observed in the data and as predicted by the model. Figure B5 shows that the key properties of the outdegree distribution are replicated by the model’s predictions. The model and data match particularly well in the right tail of the distribution i.e. the part that is specifically targeted by allowing for directed search in the network formation process. 22 We also estimate the Pareto exponent α, which was not explicitly targeted by the model, for both degree distributions. We obtain an α of 0.36 from the model and 0.37 from the data. In addition, since we have previously shown that the formal firm network is spatially concentrated, we also assess how well the model predicts the share of buyer relationships in the economy that originate from firms in Nairobi, Mombasa, to firms in other counties. We find that the model performs well on this dimension too (Table 5). In the administrative data, 69% (11%) of all outlinks in the economy are captured by Nairobi (Mombasa) based firms and the model predicts a share of 70% (11%). 5 Predicting a Revised Network With the estimated model at hand, we are now able to tackle the question of how including informal firms might affect the spatial patterns of domestic trade. Our proposed thought exper- iment is the following: suppose we were to observe informal firms. What would happen to the outdegree distribution of various types θ? To obtain a ‘revised network’ that accounts for in- formal firms, we rely on updated information on the spatial and sectoral dispersion of economic activity in Kenya – now including the informal sector. In model terms, our exercise shifts the probabilities p(θ) with which we observe nodes of certain sector-region-size types θ to be born. Knowing r∗ and our updated p(θ)s, we can then once again predict the type-by-type matrix of firm-to-firm links π , keeping everything else constant. We will also make additional assumptions about the linking probabilities of informal firms. 5.1 Predicting the Sector-County Profile of Non-VAT Firms To incorporate informal firms into the network, we first update the firm-type probabilities p(θ) for each sector-county-size cell, this time accounting for the entire firm size distribution. To update p(θ), we ideally would want to observe the number of firms Ncs in each sector s, county c, and size cell – irrespective of their formality status. However, none of the KNBS records available to us feature a breakdown of the firm count along both the sector s and the county c dimension, let alone size dimension. Therefore, instead of the firm count, we rely on granular sector and region level information on formal and informal employment in the 2019 census labor force module to compute our alternative entry probabilities p(θa ): eformal 1 p(θa, large f ormal ) = sc × formal (3) s osc + esc 47 13 xsc c oformal + oinformal + einformal 1 p(θa, inf ormal/small f ormal ) = sc sc sc × informal (4) s osc + esc 47 13 xsc c 23 where osc is the number of self-employed people, and esc the number of (wage) employed people. The denominator sums total private sector employment (both wage and self-employed, formal and informal) across all 13 sectors and 47 counties. The updated sector-region-size probabilities p(θa ) will again sum to one and hence capture a relative change in the number of firms rather than an absolute change. Using simple employment shares to compute p(θa ) relies on the assumption that the mapping of employees to firms is the same across all sectors and regions. However, empirically, manufac- turing firms, for example, tend to be larger than businesses in the hospitality sector. Nairobi hosts larger firms than Mandera County in Kenya’s north. We therefore re-scale the number of employees by the average firm size in each sector-county-size cell xf sc ormal,inf ormal . For small formal and informal firms, we rely on the KNBS (2016) survey to compute the average number of employees, while we use the administrative data for large formal sector firms.24 For agriculture and non-market services, we estimate their p(θa ) drawing only on formal private sector employment. Formal VAT-paying firms occupy a very specific niche in both cases (e.g. formal firms in these sectors are disproportionately export-oriented or the sector is dominated by non-profit entities, see discussion in Section 2.2) and informal employment takes vastly different forms (e.g. mainly reflects subsistence farming). How does the probability p(θ) that a formal firm enters in a given sector-county-size cell shift to p(θa ) once we account for informal firms? Figure C1 suggest that a 10 percentage point in- crease in formality corresponds with a 0.5 percentage point increase in p(θ)-p(θa ) (0.35 standard deviations). As expected, p(θa ) is lower than p(θ) for sectors and counties with a high degree of formality, indicating that their importance for the overall economy is overstated in the admin- istrative data. To recap, our proposed revised network accounts for informal firms being born into the network based on their sector-region profile. Rather than thinking of the exercise as adding new firms, we adjust the weights of each sector-region-size type. 5.2 Assumptions about Linking Probabilities of Informal Firms Another challenge for integrating informal firms into the network of formal firms arises due to the lack of granular data on the sectoral and geographic composition of how informal firms link 24 If big formal firms employ informal casual workers not captured in the administrative data, we understate their size and hence overestimate the probability of big formal firms in the network. As a result, our revised network becomes biased towards the baseline network that only covers the formal sector. This is also illustrated in Figure 8 where we compare the spatial inequality in county-level outdegrees in the baseline network to several scenarios that account for informal firms. Comparing the two scenarios where in one case we use the simple employment shares to compute p(θa ) and in the other case further adjust for differences in firm size across sectors and counties, we find that in line with the intuition on the implications of informal workers in formal firms, the former scenario is closer to the original network with only formal firms. 24 with both each other and with the formal sector. An ideal data set would provide details on sourcing and selling patterns by sector, geography, and formality status. In the absence of such data and given the strong correlation of size and formality status, our default approach will be to assume that informal firms exhibit linking preferences p(θ, θ′ ) similar to those of small formal firms, conditional on sector and geography. This approach is particularly attractive given its straightforward implications for the sectoral and geographic composition of linking patterns. It further allows for informal and small formal firms to experience different wedges in link formation, provided these wedges generate aggregate linking patterns that are still comparable to those of small formal firms at the sector-county- size level. Consider, for instance, a small formal retailer and an informal retailer both seeking to purchase soap. While neither can source directly from manufacturers in Nairobi, the small formal retailer might purchase from a large formal wholesaler locally, whereas the informal retailer—potentially excluded due to their tax status—might source from an alternative local wholesaler. Despite sourcing from different individual suppliers, both retailers exhibit the same patterns once their sourcing is aggregated at the sectoral and geographic levels. The similarity assumption is also motivated by findings in the administrative data, where we document that small formal firms tend to source more locally and from intermediaries (see Table B1). Nevertheless, this assumption may not fully capture the implications of challenges specific to informal firms for their linking patterns. Therefore, we subsequently introduce modifications to our approach, drawing on stylized facts from the existing literature, to assess the sensitivity of our results to alternative assumptions about the linking patterns of informal firms. 5.3 Characteristics of the Revised Network How does the revised network that accounts for informal firms compare to the baseline network? First, we find that firm types in sectors and counties with a high incidence of informality are predicted to have a relatively larger increase in outdegrees (Figure C2). Second, the share of total outlinks attributed to Nairobi- and Mombasa-based firms declines from 80% to 75% (see Table 5). While the two cities maintain their prominent role in the network, this shift is meaningful for the remaining counties in relative terms as it represents a 25% increase in their outdegrees. Third, accounting for informal firms reduces the variation in outdegrees across counties by 7.5% (Table 4). We visualize this reduction in inequality by plotting the Lorenz curve for county-level outlinks in Figure 8. What drives the reduction of outdegree inequality? First, Nairobi and Mombasa become less important as a destination for products and services from other counties. In Figure 9 we plot the row-normalised adjacency matrices, before and after accounting for informal firms, at the county 25 Table 4: County-level changes in the dispersion of outdegrees County outdegree ∆ sd/mean (in %) All counties -7.5 Without Nairobi & Mombasa -18.0 The above table reports the difference in outdegrees between the original and the revised network - aggregated at the county level. We look at the coefficient of variation as the key metric. Adjusting for the mean accounts for the fact that the change in the number of outlinks predicted by the model needs to be interpreted in relative rather than absolute terms. We exclude the outdegrees of Nairobi and Mombasa when we compute the coefficient of variation in the second row. Figure 8: Inequality in county-level outlinks in the baseline and revised network To visualise the change in inequality between the baseline and the revised network, we plot the Lorenz curve for the number of outlinks at the county level. The default scenario uses the entry probabilities p(θ) specified in Equations 3 and 4 and assumes similar linking patterns for informal firms and small formal firms conditional on their sector and county of operation. and sector level, respectively. The matrix is normalised so that each row sums to one. After accounting for informal firms, a smaller proportion of a county’s total outlinks now connects with firms in Nairobi (i.e., the column with the lightest color in the baseline matrix). Second, downstream relationships with firms in the same county now become relatively more prominent. The values of the diagonal entries of the adjacency matrix increase between baseline and revised network. This is consistent with the earlier stylized fact that smaller firms are more likely to source locally (Table B1). With the exception of five counties, most notably Nairobi and Mombasa, trade within the county gains in importance for the other 42 counties. This pattern is illustrated even more clearly in Figure C4, which compares changes in both inter- and intra-county links between the baseline and revised networks. After accounting for 26 informal firms, both inter- and intra-county outdegree increase for the median county as well as on average. However, for 83% of counties, the increase in intra-county outdegrees exceeds the increase in inter-county links.25 If informal firms source an even higher share of their inputs locally, the predicted shift towards intra-county trade represents a lower bound. We will discuss this in Section 6.1. Table 5: The importance of Kenya’s primary cities in a revised network Mombasa Other counties Nairobi in % Population 9 3 88 GDP 25 5 70 GDP w/o agriculture 33 7 60 Number of outlinks Model fit Administrative data 69 11 20 Model predicted network 70 11 19 Revised network: informal firms ≈ small firms Default scenario 65 10 25 p(θ) using employ. share only 66 10 23 Revised network: alternative linking patterns for informal firms 0% sales, 50% input to/from formal 59 9 32 0% sales, 75% input to/from formal 63 10 28 25% sales, 50% input to/from formal 59 9 32 The above table documents the share of overall firm-to-firm links which have a supplier based in Nairobi, Mombasa or the remaining 45 counties. We present the spatial dispersion of outlinks in the data, the predicted (formal sector) network as well as our default scenario for the revised network and four alternative scenarios to assess sensitivity. If counties are selling less to Nairobi and Mombasa, where do their inter-county trade links shift? We find that the number of bilateral trade links is now more sensitive to social ties between counties. In Table C2, we regress the number of links between two counties on both travel distance and social connectedness (Bailey et al., 2021), and compare the results for the baseline and the revised network.26 The findings indicate that, in the revised network, inter-county trade links are more strongly associated with the strength of social ties between counties. The increased importance of within-county trade and trade between socially connected counties gives rise to a network with a more pronounced group structure. We quantify the extent to which the network is partitioned by measuring the network’s modularity (Newman, 2006). The modularity of a network is higher when groups of nodes have more links among each other 25 While inter-county trade rises for the median county, 18 out of 47 counties actually have fewer links with other counties in the revised network. 26 Social connectedness is measured using friendship network data provided by a popular social media platform, see Bailey et al. (2021). 27 than what we would expect in a random network. We compute the modularity of the weighted adjacency matrix at the sector-county level. We find that modularity in the revised network with informal firms increases by about 46% suggesting that the revised network exhibits a more pronounced group structure.27 To further characterize this group structure, we apply a community detection algorithm to the trade flows between counties as per the revised network. As illustrated by Figure C3, the group structure now correlates strongly with Kenya’s geography, i.e., geographically proximate counties are now more likely to be clustered in the same group. Next, we explore for which group of counties we underestimate the overall number of links the most. To do so, we regress the change in county-level outlinks between baseline and revised network on various county characteristics, including the aggregate level of formality, population, Gross County Product, and market access (Table C3). The results reveal that the baseline network, which relies exclusively on formal sector data, most significantly underestimates con- nectivity in smaller counties and those with high market access. Notably, after controlling for these other characteristics, the relationship between a county’s aggregate formality share and the change in outlinks is no longer statistically significant. Finally, we find that incorporating informal firms also reshapes inter-sector trade patterns (Fig- ure 9). Sectors with substantial informal activity like other services, retail, and wholesale, now gain prominence as buyers in the network. Manufacturing, wholesale, and mining experience the largest relative gains in new links. 5.4 Simulating the Effect of Economic Shocks As a next step, we ask how the newly predicted network that accounts for informal firms com- pares to the previous network in terms of its role in propagating domestic and international shocks. How does the predicted impact of the shock depend on whether we account for inform- ality? Are sectors and regions with more informality more or less vulnerable to shocks than the administrative data would suggest? To answer these questions, we first simulate a series of domestic output shocks that reduce each firm type’s output and then analyse how it affects the output of all other types, both directly and indirectly, by propagating through the network over multiple time periods. Then, we simulate international supply shocks that affect firm types depending on their exposure to international markets. We discuss the results for both domestic and international shocks below. 27 We compute this by running the Leiden algorithm multiple times and averaging the resulting modularity scores. See Traag et al. (2019) for details on the algorithm. 28 Figure 9: Baseline versus revised network County-by-county trading relationships Baseline network Revised network Sector-by-sector trading relationships Baseline network Revised network The above figures show heatmaps of the predicted row-normalised adjacency matrix of the network (where row sells to column) as per the baseline p(θ) on the left and augmented p(θa ) on the right at the county level (top) and sector level (bottom). 5.4.1 Domestic Shocks Following the supply-side version of classic input-output models (Sargent and Stachurski, 2022), we define firm type j ’s output yjt in period t as the sum of inputs it purchases from other types i plus payments to other factors of production (value added) υjt :28 yjt = gij yit−1 + υjt (5) The intermediate inputs purchased from other firm types are the product of each supplier i’s total output in the previous period yit−1 and the fraction it sells to type j , gij . The gij s represent 28 Alternatively, υjt can also be interpreted as a type-specific and period-specific shock to output. 29 the normalized cells of our type-by-type matrix π that captures the total number of links between all types. We normalise the rows of π by dividing each entry in a row by the sum of that row. We abstract from any endogenous network adjustments (see e.g. Panigrahi, 2022; Eaton et al., 2022; Arkolakis et al., 2023). We assume that υjt is an independent draw from a uniform distribution U [−10, 10] for every type j in every time period t. Each type starts with a randomly chosen output drawn from the distribution U [0, 100] in t = 0. Using this set-up, we first simulate the output process without any shock. Then, we simulate the output process following a negative output shock to sector-region-size type j ’s value added υjt in the first time period.29 We repeat this exercise for all types j . To study the relevance of unobserved informal firms, we conduct our simulation exercise twice. In the first scenario, we use the matrix π derived from administrative records. In the second scenario, we use an alternative version of π using our revised network that accounts for the presence of informal firms.30 Our primary question is: how do domestic shocks impact each type when informal firms are considered versus when they are not? For each simulated shock, we compute: (i) the absolute reduction in output of each type using the original adjacency matrix (excluding informal firms) and (ii) the absolute reduction in output of each type using the new adjacency matrix (including informal firms), both averaged across all time periods. This yields a matrix of shock impacts where each row corresponds to a supplier who is shocked and each column corresponds to a buyer who faces the impact of the shock. We then aggregate across rows to compute the average impact of the shocks on buyers. We find that the higher the incidence of informality in a sector and region, the more we under- estimate the adverse impact of a domestic output shock. Figure 10 shows that our established employment-based measure of informality negatively correlates with the impact of the shock under the two scenarios. To align with our most granular measure of informality, we aggregate the response to shocks at the sector-region level.31 As shown in Table C4, a one percentage point decrease in the formal sector share corresponds with a 4.9 percentage point larger output drop following a domestic shock. Figure C5 presents the distribution of the ratio of the shock impact at the sector-county level (revised network/baseline network). This ratio exceeds one for 48% of the sector-county pairs and 42% of the sector-county-size types, indicating the baseline network on average underestim- ates the impact of domestic shocks for these types. Among the types for which we underestimate 29 We compute the impact of the shock on each type’s output over 100 periods of time by comparing the two output processes. All of the outputs reported below are averages across the 100 time periods. 30 We ensure that the random component of output, υjt , is identical across these two scenarios for each type j in every time period t to ensure it does not affect our results. 31 Put differently, we average the impact across each buyer’s suppliers and then compute a weighted sum for each sector-region cell. The weights are determined by a type’s entry probabilities. 30 Figure 10: How do output shocks pass-through in a revised network that takes into account informal firms? - % change in output drops and the level of formality The above graph plots the percentage change of the output reduction in response to domestic output shocks for two scenarios: the baseline network using only administrative data and the revised network including informal firms. We aggregate the output reduction across buyer types at the sector and county level, weighted by the entry probability p(θ) for each size and formality type. The x-axis shows the formal sector share for each sector-region pair. the impact, 73% are types with small firms, an indicator that the omission of informal firms is the primary driver of this result. It is important to note that by considering the changes in output reduction between the two scenarios, we focus on relative shifts. These shifts are more notable for sector-county pairs that face a relatively smaller impact initially due to their peripheral role in the formal sector network. However, even after accounting for informal firms, the aggregate impact of the shock remains largest for Nairobi- and Mombasa-based sectors. This mirrors the finding that the two cities still account for a sizable share of outlinks, as discussed in the previous section. 5.4.2 Import Shocks In addition to a domestic shock, we consider the impact of a reduction in output in response to an adverse shock to international suppliers whom Kenyan firms source from. As before, firm j ’s output can be written as follows: yjt = gij yit−1 + mjw yw + υit (6) Firm j ’s output now additionally depends on world output yw in line with its import share miw , which we obtain from the administrative data. We re-normalise the rows of the adjacency matrix such that j gij + mjw = 1. Next, we simulate a series of negative shocks to yw and 31 analyse how it affects total output in the economy and the heterogeneous effects on various firm types. The bottom graph of Figure C5 plots the impact of the shock relying on the revised network as a proportion of the impact based on the baseline network. Unlike domestic shocks, our findings for import shocks indicate that extrapolating from data on the formal sector network to the overall economy leads to an overestimation of the reduction in output. The impact is consistently less negative when using the revised network. This effect is particularly pronounced in sectors and regions with a higher incidence of informal activity (see Figure 11). Specifically, Table C4 shows that a 10 percentage point increase in the informal sector share corresponds to a 1 percentage point overestimation of the reduction in output. This pattern emerges because formal firms in predominantly informal markets have more unobserved connections than captured in administrative data, reducing their effective exposure to import shocks. Why do the predictions differ for domestic and import shocks? When accounting for inform- ality, sectors and counties with a high share of informal activity become more prominent in the network, making them more susceptible to economic shocks. Conversely, this adjustment reduces the relative importance of formal-dominated sectors, which typically have higher import shares and international exposure. By adjusting their prominence (i.e., modifying their entry probabilities and considering informality), we find that the economy seems more resilient to trade shocks but more vulnerable to domestic shocks. The intuition behind our result aligns with the mechanism discussed in Di Giovanni and Levchenko (2012), who show that smaller economies tend to experience more volatility due to having fewer firms and less diversification. Applied to our setting, focusing only on formal sector firms leads to overstating the importance of internationally-linked formal firms and underestimating the diversification of the regional economy. 6 How Sensitive are Results to Alternative Linking Patterns? What happens if we relax the assumption that informal firms have similar linking patterns to small formal firms? Given our lack of disaggregated data on how informal firms link with the formal sector, we stress-test our results relying on alternative assumptions motivated by evidence provided in the literature and consistent with the stylized facts presented in Section 3. First, informal firms almost exclusively sell to final consumers. Second, informal firms purchase from ohme and Thiele, formal firms, but not as much as formal firms purchase from each other (B¨ 2014; Gadenne et al., 2022). Finally, informal firms predominately source locally (Amodio et al., 2024). These assumptions allow us to construct extreme bounds for linking patterns that we 32 Figure 11: How does a shock to import markets pass-through in a revised network that takes into account informal firms? - Output drops and the level of formality The above graph plots the percentage change of the output reduction in response to international trade shocks for two scenarios: the baseline network using only administrative data and the revised network including informal firms. We aggregate the output reduction across buyer types at the sector and county level, weighted by the entry probability p(θ) for each size and formality type. The x-axis shows the formal sector share for each sector-region pair. might observe if we were to have granular network data with informal firms. 6.1 Alternative Assumptions about Linking Patterns of Informal Firms Consider now the three groups of firms; small formal firms si ∈ Fs , large formal firms li ∈ Fl , and informal firms ni ∈ N with their types (i.e. sector and country composition) denoted by θli , θsi , and θni respectively. We will drop the index i in what follows as the conditions do not vary across firms, once we take their sector, county, and size (informal, small, or formal) into account. We start by considering the sales patterns of informal firms. We consider two alternative linking probabilities that capture the differential sales patterns of informal firms. In one scenario, we set the probability that an informal firm sells to formal firms to zero. In a second scenario, we assume that informal firms form one-fourth of their outlinks with formal firms, where the sector and geographic composition of these links follows those of small formal firms. These assumptions are motivated by stylized facts documented earlier which show that micro, small, and medium enterprises with sales below the VAT threshold rarely sell to large firms in Kenya. ohme and Thiele (2014) for informal firms in six urban This is also in line with findings in B¨ centers in West Africa, and firms that participate in simplified tax scheme and are similar to informal firms in West Bengal, India in Gadenne et al. (2022). Finally, it is consistent with the set up of VAT systems requiring firms to ask for a receipt from their supplier in order to 33 claim input VAT (Pomeranz, 2015). In summary, for all formal and informal types we make the following extreme assumptions regarding the sales patterns of informal firms: p(θn , θl ) = 0 sensitivity: p(θn , θl ) = 0.25 × p(θs , θl ) (7) p(θn , θs ) = 0 sensitivity: p(θn , θs ) = 0.25 × p(θs , θs ) We now turn to informal firms and their sourcing patterns. In this case, we assume that informal firms source only 50% of their inputs from formal firms (large or small) in one scenario and 75% in an additional sensitivity check. Conditional on sourcing from the formal sector, the sectoral and geographic preferences of informal firms will again follow those of small formal firms. As documented in our empirical section, MSMEs do source some of their inputs from large firms. ohme and Thiele (2014) find that informal firms buy about half as much from formal firms than B¨ the amount formal firms source from each other. Gadenne et al. (2022) find that the smallest group of firms under the simplified tax scheme in West Bengal source a similar share from VAT firms, while the largest non-VAT firms source as much as three-quarters of their inputs from VAT firms. We rely on these point estimates to make the following assumption: p(θl , θn ) = 0.5 × p(θl , θs ) sensitivity: p(θl , θn ) = 0.75 × p(θl , θs ) (8) p(θs , θn ) = 0.5 × p(θs , θs ) sensitivity: p(θs , θn ) = 0.75 × p(θs , θs ) Informal firms will then source the remainder of their inputs from other informal firms. This requires assumptions about their preferences on which sectors and counties to source from. Motivated by evidence from the literature where Amodio et al. (2024) find that the vast majority of small, largely informal firms in Ethiopia obtain their inputs from local sources, we allow inputs obtained from other informal firms to only be sourced locally. Consider any counties a and b in the set of counties C and let θf,a be the type of small or large firm f in Fs ∪ Fl in county a. The assumption implies the following: p(θn,a , θn,b ) = 1 − p(θf , θn ) × p(θn,b ) if a=b (9) f ∈Fs ∪Fl p(θn,a , θn,b ) = 0 if a ̸= b Finally, we now compute separate p(θ)s for informal and small formal firms, splitting the main term of Equation 4. For additional details see Section C.2. 34 6.2 Implications for the Revised Network Figure 8 and Table C1 summarize how alternative assumptions about informal firms affect the distribution of outlinks across counties. If we assume informal firms source half of their inputs from the formal sector but have no formal buyers, inequality in county-level outlinks declines by 16% compared to our model-predicted network. This represents an 8.5 percentage point larger reduction than our default scenario where small formal and informal firms share similar linking patterns. The share of total outlinks accounted for by Nairobi- and Mombasa-based firms now drops down to 69%, down from 80% in the baseline network and 75% in the first scenario with informal firms (see Table 5). Allowing informal firms to sell 25% of their output to formal firms, instead of none, has little implication for both the overall decline in inequality and Nairobi’s share of outlinks. Allowing for a larger share of informal firms’ inputs to be sourced from the formal sector results in a smaller reduction in inequality (11%) and a higher share of Nairobi- and Mombasa-based links (73%), more closely matching the initial network with informal firms. This pattern is driven by the greater reliance of informal firms on formal suppliers in this scenario, which in turn tend to locate in counties with larger formal economies. Our simulations of domestic and international trade shocks under these alternative assumptions reinforce our earlier findings, showing even stronger relative effects in sectors and counties with high informality (Figure C6). This more pronounced impact stems from increased intra-county linkages among informal firms. While more prominent local linkages reduce spatial inequality and urban concentration, they amplify the impact of domestic shocks through stronger within- county multiplier effects. 7 Conclusion Firm-to-firm transaction-level data sourced from tax records have become a valuable tool for mapping domestic firm networks and trade flows. Aside from studying the micro dynamics of firm-to-firm relationships, they allow researchers to compute often hard-to-observe flows of goods and services across regions within national economies, which can be leveraged to estimate the welfare implications of policy interventions. This paper shows that applying such data in settings with high informality can pose important challenges. We find evidence that formal sector trade patterns provide a skewed representation of overall economic activity: they are more spatially concentrated and underestimate intra-regional trade in favor of trade with urban hubs. We incorporate informal firms into a structural model of network formation leading to a revised 35 network that is more locally connected and spatially less concentrated. We further provide evidence for the implications of extrapolating from formal sector data about the aggregate impact of economic shocks, i.e. ignoring the presence of the informal sector. Simulations of the pass-through of output shocks using the revised network reveal that formal sector data leads us to underestimate the impact of domestic output shocks in regions and sectors with high informality. Conversely, we may overstate the local output effects of international trade shocks in sectors and regions with a high incidence of informality. This is because formal sector data places more weight on formal firms with stronger links to international markets, when in fact the overall economy has weaker ties to import markets. Our findings are applicable to a variety of contexts with high informality across the world. Informality matters given its systematic occurrence as it is not randomly distributed across sectors and geographies. Our proposed approach for incorporating informal firms is useful for settings where data on the links of informal firms are not available with sufficient information on heterogeneity across key dimensions (e.g. sector and geography) and researchers therefore have to rely on secondary sources to recover the structure of the broader network. Our findings highlight the value of complementing administrative firm-to-firm data with auxiliary data on informality to draw inferences about the aggregate implications of economic policies. An important question for future research, beyond the scope of this paper, is whether the observed spatial concentration of formal sector firm networks is a result of market frictions or a feature of structural transformation (Gollin, 2008). Understanding its drivers can inform policy recommendations about the optimal distribution of formal economic activity across space. 36 References Acemoglu, D., Carvalho, V. M., Ozdaglar, A. and Tahbaz-Salehi, A. (2012), ‘The network origins of aggregate fluctuations’, Econometrica 80(5), 1977–2016. Ad˜ao, R., Carrillo, P., Costinot, A., Donaldson, D. and Pomeranz, D. (2022), ‘Imports, exports, and earnings inequality: Measures of exposure and estimates of incidence’, The Quarterly Journal of Economics 137(3), 1553–1614. Ades, A. F. and Glaeser, E. L. (1995), ‘Trade and circuses: explaining urban giants’, The Quarterly Journal of Economics 110(1), 195–227. Albert, C., Bustos, P. and Ponticelli, J. (2021), The effects of climate change on labor and capital reallocation, Technical report, National Bureau of Economic Research. na, A., Manelici, I. and Vasquez, J. P. (2022), ‘The effects of joining multinational Alfaro-Ure˜ supply chains: New evidence from firm-to-firm linkages’, The Quarterly Journal of Economics 137(3), 1495–1552. Almunia, M., Henning, D. J., Knebelmann, J., Nakyambadde, D. and Tian, L. (2023), Leveraging Trading Networks to Improve Tax Compliance: Experimental Evidence from Uganda, Centre for Economic Policy Research. Amodio, F., Benveniste, E., Pham, H. and Sanfilippo, M. (2024), ‘The local (informal) multiplier of industrial jobs’. Mimeo. as, P., Chor, D., Fally, T. and Hillberry, R. (2012), ‘Measuring the upstreamness of pro- Antr` duction and trade flows’, American Economic Review 102(3), 412–16. Arkolakis, C., Huneeus, F. and Miyauchi, Y. (2023), Spatial production networks, Technical report, National Bureau of Economic Research. Atkin, D. and Donaldson, D. (2015), Who’s getting globalized? the size and implications of intra-national trade costs, Working Paper 21439, National Bureau of Economic Research. Atkin, D. and Khandelwal, A. K. (2020), ‘How distortions alter the impacts of international trade in developing countries’, Annual Review of Economics 12(1), null. Bacilieri, A., Borsos, A., Astudillo-Estevez, P. and Lafond, F. (2023), ‘Firm-level production networks: what do we (really) know’, INET Oxford Working Paper 2023. Bailey, M., Gupta, A., Hillenbrand, S., Kuchler, T., Richmond, R. and Stroebel, J. (2021), ‘In- ternational trade and social connectedness’, Journal of International Economics 129, 103418. Balboni, C., Boehm, J. and Waseem, M. (2023), ‘Firm adaptation in production networks: evidence from extreme weather events in pakistan mimeo’. Baqaee, D. R. and Farhi, E. (2019), ‘The macroeconomic impact of microeconomic shocks: Beyond hulten’s theorem’, Econometrica 87(4), 1155–1203. Bernard, A. B., Dhyne, E., Magerman, G., Manova, K. and Moxnes, A. (2022), ‘The ori- gins of firm heterogeneity: A production network approach’, Journal of Political Economy 130(7), 1765–1804. Bernard, A. B. and Moxnes, A. (2018), ‘Networks and trade’, Annual Review of Economics 10, 65–85. 37 Bernard, A. B., Moxnes, A. and Saito, Y. U. (2019), ‘Production networks, geography, and firm performance’, Journal of Political Economy 127(2), 639–688. Blanchard, P., Gollin, D. and Kirchberger, M. (2021), ‘Perpetual motion: Human mobility and spatial frictions in three african countries’, CEPR Discussion Papers No. 16661. ohme, M. H. and Thiele, R. (2014), ‘Informal–formal linkages and informal enterprise perform- B¨ ance in urban west africa’, The European Journal of Development Research 26, 473–489. Boken, J., Gadenne, L., Nandi, T. and Santamaria, M. (2023), ‘Community networks and trade’, CEPR Working Paper DP17787. Bramoull´e, Y., Currarini, S., Jackson, M. O., Pin, P. and Rogers, B. W. (2012), ‘Homophily and long-run integration in social networks’, Journal of Economic Theory 147(5), 1754–1786. Bustos, P., Garber, G. and Ponticelli, J. (2020), ‘Capital accumulation and structural trans- formation’, The Quarterly Journal of Economics 135(2), 1037–1094. Carvalho, V. M., Nirei, M., Saito, Y. U. and Tahbaz-Salehi, A. (2021), ‘Supply chain disrup- tions: Evidence from the great east japan earthquake’, The Quarterly Journal of Economics 136(2), 1255–1321. Castro-Vincenzi, J., Khanna, G., Morales, N. and Pandalai-Nayar, N. (2024), Weathering the storm: Supply chains and climate risk, Technical report, National Bureau of Economic Re- search. Chacha, P. W., Kirui, B. K. and Wiedemann, V. (2024), ‘Supply chains in times of crisis: Evidence from kenya’s production network’, World Development 173, 106363. Chandrasekhar, A. (2016), ‘Econometrics of network formation’, The Oxford Handbook of the Economics of Networks pp. 303–357. Chaney, T. (2014), ‘The network structure of international trade’, American Economic Review 104(11), 3600–3634. Cordaro, F., Fafchamps, M., Mayer, C., Meki, M., Quinn, S. and Roll, K. (2022), Microequity and mutuality: Experimental evidence on credit with performance-contingent repayment, Technical report, National Bureau of Economic Research. De Paula, A. and Scheinkman, J. A. (2010), ‘Value-added taxes, chain effects, and informality’, American Economic Journal: Macroeconomics 2(4), 195–221. Demir, B., Javorcik, B., Michalski, T. K. and Ors, E. (2022), ‘Financial Constraints and Propaga- tion of Shocks in Production Networks’, The Review of Economics and Statistics pp. 1–46. Demir, B., Javorcik, B. and Panigrahi, P. (2024), ‘Breaking invisible barriers: Does fast internet improve access to input markets?’, CESifo Working Paper 11567. Di Giovanni, J. and Levchenko, A. A. (2012), ‘Country size, international trade, and aggregate fluctuations in granular economies’, Journal of Political Economy 120(6), 1083–1132. Dix-Carneiro, R., Goldberg, P. K., Meghir, C. and Ulyssea, G. (2024), Trade and domestic distortions: The case of informality, Technical report. Donaldson, D. (2025), Transport infrastructure and policy evaluation, in ‘Handbook of Regional and Urban Economics, vol. 6’, Elsevier North Holland Amsterdam. 38 Eaton, J., Kortum, S. and Kramarz, F. (2011), ‘An anatomy of international trade: Evidence from French firms’, Econometrica 79(5), 1453–1498. Eaton, J., Kortum, S. S. and Kramarz, F. (2022), Firm-to-firm trade: Imports, exports, and the labor market, Technical report, National Bureau of Economic Research. Elgin, C., Kose, M. A., Ohnsorge, F. and Yu, S. (2021), ‘Understanding informality’, CERP Discussion Paper 16497. Fafchamps, M. (2003), Market institutions in sub-Saharan Africa: Theory and evidence, MIT press. Fan, T., Peters, M. and Zilibotti, F. (2023), ‘Growing like india—the unequal effects of service- led growth’, Econometrica 91(4), 1457–1494. Gabaix, X. (2009), ‘Power laws in economics and finance’, Annu. Rev. Econ. 1(1), 255–294. Gadenne, L., Nandi, T. K. and Rathelot, R. (2022), ‘Taxation and supplier networks: Evidence from india’, Working Paper . Goldberg, P. K. and Reed, T. (2023), ‘Demand-side constraints in development: The role of market size, trade, and (in)equality’, Econometrica . Gollin, D. (2008), ‘Nobody’s business but my own: Self-employment and small enterprise in economic development’, Journal of Monetary Economics 55(2), 219–233. Grant, M. and Startz, M. (2022), Cutting out the middleman: The structure of chains of intermediation, Technical report, National Bureau of Economic Research. on-Ciliotta, G. and Teachout, M. (2020), ‘Vertical integration, Hansman, C., Hjort, J., Le´ supplier behavior, and quality upgrading among exporters’, Journal of Political Economy 128(9), 3570–3625. Herrendorf, B., Rogerson, R. and Valentinyi, A. (2022), New evidence on sectoral labor pro- ductivity: Implications for industrialization and development, Technical report, National Bureau of Economic Research. Huremovic, K., Jim´ o, J.-L. and Vega-Redondo, F. (2024), enez, G., Moral-Benito, E., Peydr´ ‘Production and financial networks in interplay: Crisis evidence from supplier-customer and credit registers’, Available at SSRN 4657236 . Jackson, M. O. and Rogers, B. W. (2007), ‘Meeting strangers and friends of friends: How random are social networks?’, American Economic Review 97(3), 890–915. Jefferson, M. (1939), ‘The law of the primate city’, Geographical Review 29(2), 226–232. Jefferson, M. (1989), ‘Why geography? The law of the primate city’, Geographical Review 79(2), 226–232. Klenow, P. J. and Rodriguez-Clare, A. (1997), ‘The neoclassical revival in growth economics: Has it gone too far?’, NBER macroeconomics annual 12, 73–103. KNBS (2010), ‘Basic Report on the 2010 Census of Industrial Production’. KNBS (2016), Micro, Small and Medium Establishment (MSME) Survey: Basic Report2016, Technical report, Kenya National Bureau of Statistics. KNBS (2017), Report on the 2017 Kenya Census of Establishments (CoE), Technical report, Kenya National Bureau of Statistics. 39 KNBS (2019), 2019 Kenya Population and Housing Census: Volume I, Technical report, Kenya National Bureau of Statistics. KNBS (2022), Gross county product 2021 (gcp), Technical report, KNBS. URL: https://www.knbs.or.ke/reports/kenya-gross-county-product-2021/ Kreindler, G. E. and Miyauchi, Y. (2023), ‘Measuring commuting and economic activity inside cities with cell phone records’, Review of Economics and Statistics 105(4), 899–909. Liu, E. (2019), ‘Industrial policies in production networks’, The Quarterly Journal of Economics 134(4), 1883–1948. u, L. and Zhou, T. (2011), ‘Link prediction in complex networks: A survey’, Physica A: L¨ statistical mechanics and its applications 390(6), 1150–1170. Meagher, K. (2013), ‘Unlocking the informal economy: A literature review on linkages between formal and informal economies in developing countries’, Work. ePap 27, 1755–1315. Memon, P. A. (1976), ‘Urban primacy in kenya’, IDS Working Paper Series, University of Nairobi 282. Miyauchi, Y. (2024), ‘Matching and agglomeration: Theory and evidence from japanese firm- to-firm trade’, Econometrica 92(6), 1869–1905. evez, P. and Farmer, J. D. (2023), ‘Reconstructing pro- Mungo, L., Lafond, F., Astudillo-Est´ duction networks using machine learning’, Journal of Economic Dynamics and Control 148, 104607. Naritomi, J. (2019), ‘Consumers as tax auditors’, American Economic Review 109(9), 3031–72. Newman, M. E. (2006), ‘Modularity and community structure in networks’, Proceedings of the national academy of sciences 103(23), 8577–8582. Panigrahi, P. (2022), ‘Endogenous spatial production networks: Quantitative implications for trade and productivity’, Working Paper . Pomeranz, D. (2015), ‘No taxation without information: Deterrence and self-enforcement in the value added tax’, American Economic Review 105(8), 2539–69. Sargent, T. J. and Stachurski, J. (2022), ‘Economic networks: Theory and computation’, arXiv preprint arXiv:2203.11972 . Soo, K. T. (2005), ‘Zipf’s law for cities: a cross-country investigation’, Regional Science and Urban Economics 35(3), 239–263. Startz, M. (2021), ‘The value of face-to-face: Search and contracting problems in nigerian trade’, Working Paper . Storeygard, A. (2016), ‘Farther on down the road: transport costs, trade and urban growth in sub-saharan africa’, The Review of Economic Studies 83(3), 1263–1295. Topalova, P. (2010), ‘Factor immobility and regional impacts of trade liberalization: Evidence on poverty from india’, American Economic Journal: Applied Economics 2(4), 1–41. Traag, V. A., Waltman, L. and Van Eck, N. J. (2019), ‘From louvain to leiden: guaranteeing well-connected communities’, Scientific reports 9(1), 1–12. Ulyssea, G. (2018), ‘Firms, informality, and development: Theory and evidence from brazil’, American Economic Review 108(8), 2015–47. 40 arate, R. D. (2022), Spatial misallocation, informality, and transit improvements: Evidence Z´ from mexico city, The World Bank. Zhou, Y. (2022), The value added tax, cascading sales tax, and informality, in M. Bussolo and S. Sharma, eds, ‘Hidden Potential: Rethinking Informality in South Asia’, World Bank Publications, chapter The Value Added Tax, Cascading Sales Tax, and Informality, pp. p. 61–90. 41 Appendix A Material for Data and Empirical Section A.1 Supplementary Graphs and Tables Figure A1: Composition of sales and purchases by sector Sales Purchases The figures in the first row show sector-level aggregate sales (domestic + exports) and purchases (domestic + imports) for 2019. In the second row, we plot the share of each buyer and supplier type as a percentage of total sector-level sales and purchases. 1 Figure A2: Proportion of total network links by firm Attributes The above figure plots the share of total firm-to-firm links accounted for by different groups of suppliers and buyers. Panels are organized by sector (top), location (middle), and size (bottom). Bars represent the proportion of total links observed in the administrative data. Q1–Q4 refer to size quartiles, with Q1 being the smallest firms and Q4 the largest firms based on firms’ average sales volumes for the years 2016-2019. 2 Figure A3: Firm headquarter locations and population density Geographic density of firms Population density The left map shows the density of firm headquarter locations at the sub-county level, i.e. the number of firms per km2 . The right map shows the population density - also at the sub-county level. The borders of Kenya’s 47 counties, the first administrative layer, are outlined in grey. 3 Table A1: Firms in counties with a higher informal sector share have fewer links in the administrative data total mean median 90th percentile final demand buyers suppliers buyers suppliers buyers suppliers buyers suppliers % Formal sector share (sector-county, %) 0.043*** 0.037*** 0.016** 0.009* 0.011** 0.007* 0.019** 0.011* -0.166 (0.014) (0.012) (0.007) (0.005) (0.005) (0.004) (0.008) (0.006) (0.181) Population 1.559*** 1.415*** 0.441*** 0.282*** -0.180 0.072 0.543*** 0.455*** -5.515* (0.294) (0.266) (0.138) (0.090) (0.132) (0.088) (0.140) (0.117) (2.763) Travel time to Nairobi -0.699*** -0.591*** -0.301** -0.170** 0.007 -0.139** -0.318** -0.225** 0.424 (0.243) (0.178) (0.128) (0.078) (0.083) (0.067) (0.132) (0.098) (2.200) Travel time to Mombasa -0.552** -0.449** -0.326** -0.246*** -0.164 -0.282*** -0.378*** -0.262*** 6.256*** (0.244) (0.176) (0.132) (0.083) (0.127) (0.084) (0.137) (0.095) (1.998) No. observations 450 472 450 472 379 471 450 472 470 R2 0.469 0.540 0.400 0.326 0.266 0.315 0.408 0.307 0.242 Sector FE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ In this table we regress the number of firm-to-firm links aggregated at the sector and county level on informal sector employment shares from the population census, which we observe at the same level of disaggregation. In the last column we regress the share of sales to non-registered entities (consumers or firms outside the VAT system) on the formal sector share. Standard errors are clustered at the county level. Figure A4: County-level formal sector shares The above graphs plot the probability density function (pdf) for the dispersion of formal sector shares across Kenya’s 47 counties. The value added-based measure relies on the difference between county-sector-level national accounts and the administrative data to obtain formal sector shares. For the employment-based measure we rely on information about formal and informal employment at the sector and county level from the 2019 census. 4 Figure A6: The extensive margins of informality - in which sectors do informal firms operate? The graph compares the number of firms covered in the administrative data and with an annual revenue of over KShs 5 million in 2016 to the number of firms with annual revenues above KShs 5 million in the 2016 Census of Establishments (CoE) (KNBS, 2017) as well as any firm captured in either data set. Further, it plots the of licensed and unlicensed businesses reported by KNBS in KNBS (2016). 5 Figure A5: Informality, market size, and income levels Correlation of the formal sector share and ... ... Gross County Product ... Gross County Product per capita The two graphs plot the correlation of the formal sector share with the Gross County Product in absolute and per capita terms respectively. Each marker represents one of Kenya’s 47 counties. 6 Figure A7: Formal and informal employment in private enterprises The graph compares the number of formal employees employed in VAT-paying firms to the number people who stated they are formally or informally employed in a private sector entity. To improve readability we omit the bar for informal employment in the agricultural sector. As per 2019 population census over 11 million people are informally employed in the agricultural sector. Figure A8: The GDP/value-added gap and upstreamness We plot the gap between value-added in the VAT and national accounts figures at the sub-sector level for the most granular sector classification reported in national accounts. We correlate it with a measure of upstreamness (Antr`as et al., 2012), which captures how removed a sector is from final consumers (it takes a value of one if the sector sells everything directly to final consumers). 7 A.2 Margins of informality Table A2 summarises four different margins of informality that can occur in firm networks: an ex- tensive margin at the firm level and an intensive margin at the transaction level. Within each cat- egory, informality can occur due to either non-compliance or simply because a firm/transaction is too small to be taxed. A formally-registered wholesaler we interviewed in Nairobi’s Central Business District explained how the notion of an extensive and intensive margin of informal- ity well-established in the literature on labour markets (Ulyssea, 2018) extends to firm-to-firm transactions: “All firms purchase from manufacturers and importers paying input VAT. They even have an interest in getting purchases that have VAT on it to inflate the input VAT. What they do to mitigate the VAT levy, [is that] they downplay their output VAT (i.e. sales). Some customers will purchase with receipt and output VAT on it. Some customers will purchase without a receipt.” Moreover, VAT exemptions can be a legal reason why firms or transactions above the VAT threshold are not captured in administrative tax records. Table A2: Margins of informality in firm networks Extensive Intensive Below tax threshold Small firms Small transactions Above tax threshold Non-compliance Non-compliance Depending on the tax code, not all four margins of informality arise in every setting. The only margin of informality, which is not relevant in our setting, is the potential neglect of small VAT transactions. This is because the Kenya Revenue Authority requires firms to record transactions of any size, conditional on both parties being VAT-registered. Moreover, we address some issues around non-compliance by relying on information from a firm’s trade partner to recover some omitted transactions and under-reported trade volumes. A.3 Measures of Informality As documented in Table A3, the two employment-based KNBS measures correlate well with all measures based on the administrative data. The measure capturing licensed businesses as a share of the universe of businesses in Kenya (including micro-enterprises) in contrast only correlates weakly with them. This likely reflects the fact that many of the licensed firms are very small themselves and their geographic dispersion does not correlate as strongly with the tax records. Any employment in licensed businesses (second row) is likely concentrated in larger licensed firms, which is why the employment based measure aligns more strongly with the administrative data relative to the simple firm count. 8 Table A3: Correlation of formality measures Formality measures based on admin data KNBS measures No. firms Employment Value added Employment (census) 0.78 0.83 0.78 Employment (licensed MSMEs) 0.58 0.69 0.62 No. firms (licensed) 0.20 0.16 0.11 The above table shows the correlation coefficients of different measures of the formal sector share. Each measure represents a share, i.e. captures the proportion of economic activity that can be attributed to the formal sector. The labels indicate the underlying unit of measurement and the source of the data. All measures are aggregated at the county level. Figure A9: Comparison of formal sector shares based on census versus administrative records The above graph correlates the share of the formal sector computed using employment figures from the admin- istrative records with the share of formal private sector employment as per the 2019 population census (KNBS, 2019). Each market represents a county. The size of each marker is proportional to the economic size of the county, i.e. its Gross County Product. To avoid mechanical correlation between the two measures we use total employment in licensed firms as the denominator for the administrative data. The KNBS estimate for employment in licensed firms is based on micro data that is distinct from the population census. 9 A.4 The VAT-Paying Sector as a Share of GDP The most relevant sector that is not well captured in the VAT data is agriculture, which gener- ates 21%-23% of Kenya’s GDP. While part of the sector receives special tax treatment due to exemptions of mainly unprocessed agricultural commodities, some of the GDP gap can also be attributed to informality in the classic sense due to the prevalence of small holders in the sector. Non-market services include education, health, public administration, and real estate (Herren- dorf et al., 2022). They contribute 22% to Kenya’s GDP, but are barely represented in the VAT data as most of the entities operating in these sectors are VAT exempt, not-for-profit, or the underlying sector’s size in the national accounts is estimated using non-market prices (see penultimate column of Table A4). Figure 6 highlights another sizeable gap for “others”, which includes international organisations, unclassified firms, and financial services. Table A4 illustrates that the value added generated by the VAT sector has been declining over time as a proportion of GDP. This downward trend in value added can be attributed to two factors. First, the introduction of a fuel tax in September 2018, which was previously VAT exempt, has led to a reduction in value added. The impact of this tax is particularly relevant for the utilities sector. However, this sector alone cannot fully explain the overall downward trend and kink in the data. Second, sectors that have significantly contributed to Kenya’s growth over the years, such as agriculture, real estate, financial services, and public administration, are not well captured in the VAT data. Table A4: Share of GDP covered in the administrative records Share of GDP (%) Year All ex Fin. ex NMS+Fin. ex Agri. ex NMS+Fin.+Agri. NMS Agri. 2015 36 39 50 42 66 22 21 2016 40 43 56 46 73 22 21 2017 37 40 52 45 71 22 22 2018 37 40 52 45 70 22 22 2019 28 30 39 34 53 22 23 The mid-section of the above table reports the share of GDP captured by the VAT data sequentially excluding (ex ) specific sectors. Fin. refers to financial services. NMS refers to non-market services, i.e. education, health, public administration, and real estate (Herrendorf et al., 2022). Agri. refers to the agricultural sector. The first five data columns report the proportion of GDP captured by value added of the VAT-paying firms. The final two columns report the GDP share of non-market services and agriculture respectively. GDP figures are based on national accounts data (in current prices) published by the Kenya National Bureau of Statistics. A.5 Firm Location and Relationships Drive Spatial Concentration in Trade Flows The extensive margins of the firm network, firm location, and firm-to-firm relationships, account for 70%-90% of the variation in aggregate trade volumes. Using transaction-level data, we are 10 able to distinguish between four different sales margins: the number of firms, the number of relationships with buyers per firm, the number of transactions per relationship, and the average trade volume per transaction. The same is true for purchases. Table A5 summarises the share of the variance attributed to each term in both upstream (purchases) and downstream (sales) trade flows.32 The number of firms operating in each county accounts for as much as 67% of the variance in purchases across counties. This includes purchases the firms make within their own county and what they buy outside the county. The number of supplier relationships other counties have with the county accounts for yet another 22%. This leaves a little over 10% of the variance to be picked up by the intensive margins for trade, i.e. the number of transactions between firm pairs and the average transaction volume. Turning to downstream trade flows, i.e. the decomposition of the variance in sales across (sub-)counties, the location of firms plays a slightly less important role. Instead the number of firm-to-firm relationships now accounts for one third of the variance in network sales. Table A5: Geographic concentration of economic activity in Kenya Purchases Aggregation No. firms No. relationships/firm No. transactions/relation Avg. volume/transaction County 0.67 0.22 0.14 -0.04 Subcounty 0.53 0.29 0.16 0.06 Sales Aggregation No. firms No. relationships/firm No. transactions/relation Avg. volume/transaction County 0.60 0.31 0.12 -0.00 Subcounty 0.39 0.34 0.15 0.16 A.6 Spatial Concentration of Economic Activity and Multi-Establishment Firms A potential concern with the VAT data is that it may overstate spatial concentration because firms are only required to report their headquarters’ locations, which are often situated in major cities like Nairobi or Mombasa. To assess the sensitivity of measures of spatial concentration to multi-establishment firms, we use micro-data from the 2010 Census of Industrial Production (KNBS, 2010), which includes the mining, manufacturing, and utilities sectors. We compare the spatial distribution of sales and firm locations for all firms, including those with multiple branches, to that of single-establishment firms in Table A6. Firms covered in the Census of Industrial Production overlap closely with the group of VAT-paying firms we observe in the tax records. A 1:1 mapping is not possible due to the anonymous nature of the data sets. However, 32 Our decomposition follows Klenow and Rodriguez-Clare (1997); Eaton et al. (2011); Panigrahi (2022). 11 the overall number of industrial firms observed in each of the two data sources aligns closely. In 2015, we observed 4,064 VAT-paying firms33 in mining, manufacturing, and utilities, while KNBS (2010) covered 2,252 firms five years earlier. Of all firms involved in industrial production, 48% are located in Nairobi County generating as much as 61% of total sales in 2010.34 When we limit the data from the Census of Industrial Production to single establishments, the overall concentration of firm locations does not change. The concentration becomes even slightly more unequal once we consider sales instead of purely counting the number of firms. We, however, overstate the concentration of sales in Nairobi by six percentage points if multi-establishment firms are in the sample and their sales are aggregated geographically based on headquarter information only (i.e. the measure we obtain from the VAT data by default). Despite this, the discrepancy is not large enough to fully account for the higher spatial concentration observed among VAT-registered firms compared to overall economic activity. Table A6: Geographic concentration of industrial activity All firms Single est. firms Nairobi (%) α Nairobi (%) α Census of Industrial Production (2010) N = 2252 No. firms 48 0.54 48 0.54 Sales 61 0.32 55 0.30 Industrial firms in admin data (2015) N = 4064 No. firms 64 0.50 - - Sales 69 0.21 - - The columns for Nairobi report their share of the respective national aggregate figures (e.g., the share of industrial establishments located in Nairobi). α is the estimated coefficient from a county-level rank regression of each county’s rank (log) on the respective measure x (log): log rank = log A − α log x. The Census of Industrial Production was carried out by KNBS (2010). 33 The earliest year for which the VAT records have been fully digitised is 2015. A later Census of Industrial Production is available for 2018. However, the data set published by KNBS does not include any information on firm locations. Further, information on sales is missing for over half of the firms. 34 The figures for Kenya are similar to the concentration of formal manufacturing firms reported by Storeygard (2016) for Tanzania. Dar es Salaam, Tanzania’s primate city, accounts for 8% of its population (Storeygard, 2016) - a very similar figure to Nairobi’s population share in Kenya (KNBS, 2019). 12 B Material for Model Setup and Estimation B.1 Supplementary Graphs and Tables Table B1: Linking patterns of small buyers Manufacturing Wholesale Retail Nairobi Mombasa Same county Bigger supplier Final demand Small buyer -0.023*** 0.011 0.038*** -0.046*** -0.003 0.037*** 0.002 0.030* (0.01) (0.01) (0.01) (0.01) (0.01) (0.01) (0.00) (0.02) No. observations 892 892 892 892 892 892 892 850 R2 0.585 0.593 0.568 0.721 0.860 0.872 0.477 0.637 Sector-county FE ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ We group firms by sector, county, and size. Small firms represent the bottom sales quartile of a sector and county. We then compute the share of overall links the firm has with another sector-county-size group. We then aggregate the share for suppliers with specific characteristics (e.g. any wholesaler, irrespective of location) for each type of buyer (sector-county-size). The column titles list the characteristics of the suppliers. Finally, we regress the respective sum of shares on whether or not the buyer type is a small buyer type. Figure B1: Degree distributions The figure plots the log-log plot of the probability density function (pdf) against firm outdegree and indegree respectively. The coefficients α shown at the bottom of the plot correspond to the power law exponent indicating the existence of a thicker tail for the outdegree distribution. 13 Figure B2: Average in- and outdegrees across space Average outdegree Average indegree The above map plots the average in- and outdegree of firms for each sub-county. The borders of Kenya’s 47 counties, the first administrative layer, are outlined in grey. Figure B3: County-level average in- and outdegree The histogram plots the average in- and outdegree across firms in each county. 14 Figure B4: Objective function for various values r The graph plots the sum of the squared difference between each element of the model predicted interaction matrix π and the matrix π directly observed in the data, for various values of the parameter r ∈ [0, 1]. The figure shows that r∗ = 0.45 obtained via simulated annealing minimises the objective function. 15 Figure B5: Model Fit - actual and predicted outdegree distribution The figure plots inverse CDF for the actual and model-predicted total outdegree for each type (i.e. sector-county- size cell). The number of outdegrees is standardised. Note the log scale on both the x- and the y-axis. B.2 Constructing Type-by-Type Matrix πmodel (θ, θ′ ; r) Each iteration of the estimation requires two additional steps to construct πt , which captures type-by-type network linkages at time t. e et al. (2012)’s formula, predicting πt requires us to compute a geo- First, based on Bramoull´ metric series of matrix B. For ease of computation, we restrict this to the first five entries of the geometric series as subsequent matrix entries become negligible. t represents only the expected outdegree of types born Second, note from Equation 1 that πt0 in t0 evaluated at time t. Since new firms are born in every period up until period t, we need to aggregate these matrices across all time periods leading up to t to obtain the type-by-type adjacency matrix of the entire network. The matrix of type-by-type links at time t is given by πt = t0 (p t t ′ )′ where p is a column vector containing the probabilities for each type to · πt0 be born. We compute the probability that a node of a certain type is born in time t0 and its t . We then repeat this process to expected links in time t with every other type to get p · πt0 compute the probability that a node of a certain type is born in time t0+1 and its expected degree in time t to get p · πt t 0+1 . We have to undertake this exercise for all time periods leading up to t. In other words, we need to compute t such matrices and add them up to obtain the 16 type-by-type degree distribution at time t. Computing π t for t= 56822 in each iteration while looping through different candidate values of r is computationally intensive. Therefore, in every iteration, we compute πt t for 500 ‘repres- 0 entative’ time periods that we then aggregate to obtain πt . We space these 500 periods equally between our first period t0 = 1 and final period t0 = 56822. As a result, we compute πt = t0 =1:100:56822 (p t ′ )′ . This implies that the network is scaled down in terms of firm count. · πt0 However, this approach ensures that we do not disproportionately sample from either older or younger nodes and thereby bias our results. For example, sampling from nodes born in the first 500 periods would lead us to predict the type-by-type outdegree distribution only for firms in the right tail of the firm degree distribution if the observed network happens to exhibit preferential attachment, i.e. a high degree of directed search. This stems from the fact that directed search results in older nodes having a higher chance of being more connected. This can bias our estim- ation of r as we will match the predicted distribution of such ‘older’ firms with all firms observed in the data. Our sampling strategy prevents this by ensuring a balanced representation of firms across different time periods and ensures that the essential features of the network formation process and network structure remain intact. B.3 Fixed Indegree in the Network Formation Model Recall that in our model, while a firm’s number of buyers (outdegree) evolves over time, the number of suppliers (indegree) that a firm has is fixed to m and does not change as new firms are born. This assumption can be interpreted as a reflection of a fixed production technology that the firm needs to operate. Our focus on endogenizing the outdegree distribution (i.e. the number of buyers of a firm) is also motivated by four key reasons. First, the outdegree distribution in firm networks has been widely documented to exhibit a substantially higher degree of heterogeneity than the indegree distribution (Bacilieri et al., 2023), a fact that also replicates when we consider the spatial distribution of firm links in Figure B2 and Figure B3. Second, supplier relationships can often be established early and remain stable over time. Con- sistent with this, we find empirically that firm age is more strongly correlated with outdegree (number of buyers) than with indegree (number of suppliers), suggesting that customer bases grow with time while suppliers remain relatively fixed (see Figure B6). Third, given that informal firms are more likely to operate downstream in the supply chain, not observing them is more likely to affect the outdegree rather than the indegree of formal firms. Finally, this assumption is also made for analytical convenience as it simplifies the characteriz- 17 ation of the network’s steady state. This allows us to estimate the model. We do not think that this assumption would change our key results. To see this, consider the scenario where firm indegrees were also allowed to grow over time. In that case, informal firms, which are more likely to be buyers to begin with, would accumulate more supplier links as they age. This would further increase their integration in the production network relative to formal firms that are more likely to be suppliers. As a result, by holding indegree fixed, we likely understate the extent to which informal firms participate in the broader economy. Figure B6: In- and outdegree by firm age The figure plots the in- and outdegree against firm age. Firm age is truncated at 50 years as only few firms are older, but the age distribution exhibits a long tail. 46 years corresponds to the 99th percentile. 18 C Material on the Revised Network C.1 Supplementary Material for Revised Network Figure C1: Sector-county-size probabilities and formal sector shares The graph plots each sector-regions formality share against the normalised difference between the baseline p(θ) and the augmented version p(θa ) that takes into account informal firms. p(θ)-p(θa ) is reported in terms of standard deviations. A 10 percentage point increase in formality leads to an increase of p(θ)-p(θa ) by half a percentage point (0.35 standard deviations). To estimate the slope, we exclude eleven sector-county-size types which are adjusted by more than two standard deviations. All of the eleven types are Nairobi-based. Nine are large types, plus small firms in business services and construction. The slope becomes a little more than twice as steep if the five sector-county pairs are included. 19 Figure C2: Predicted change in type-level outdegree and formal sector shares The figure plots sector-county formal sector shares against changes in type-level outdegree. The change in outde- gree is measured as the difference between the revised network (including informal firms) and the baseline network (formal firms only), expressed relative to the baseline. Figure C3: Model versus revised county-level network The left and right panels show the baseline and revised county-level networks, respectively. We use the row- normalised county-level adjacency matrix to construct the plots with arrows indicating links from suppliers to buyers. Colours indicate county groupings identified by a community detection algorithm. 20 Figure C4: Inter- and intra-county trade patterns in a revised network Inter-county outlinks Intra-county outlinks The figure plots the ratio of the supplier-to-buyer links for the revised network relative to the baseline for each county, distinguishing between trade links between counties (inter) and within counties (intra). To the left of the dotted line, at a value of one, are counties that experienced a decline in outlinks in the respective type of trade linkages. Table C1: County-level changes in the dispersion of outdegrees - alternative scenarios Scenario ∆ sd/mean (in %) Informal firms ≈ small firms Default all counties -7.5 w/o NBO & MSA -18.0 p(θ) using employ. share only all counties -5.6 w/o NBO & MSA -11.8 Alternative linking patterns for informal firms 0% sales, 50% input to/from formal all counties -16.0 w/o NBO & MSA -33.5 0% sales, 75% input to/from formal all counties -11.1 w/o NBO & MSA -25.7 25% sales, 50% input to/from formal all counties -16.1 w/o NBO & MSA -33.7 The above table reports the difference in outdegrees between the original and the revised network - aggregated at the county level. We look at the coefficient of variation as the key metric. Adjusting for the mean accounts for the fact that the change in the number of outlinks predicted by the model needs to be interpreted in relative rather than absolute terms. We exclude the outdegrees of Nairobi and Mombasa when we compute the coefficient of variation in every other row. The first two scenarios assume similar linking patterns for informal and small formal firms, conditional on their sector and county. In the second scenario, we use a simple version of the updated entry probabilities p(θ) that does not account for differences in firm size across sectors and locations. Scenarios three to five rely on the default assumptions on how to update p(θ) to incorporate informal firms. Instead, assumptions about p(θ, θ′ ) are modified: Scenarios three and four assume that informal firms do not sell to formal firms at all, but buy 50% or 75% of their inputs from the formal sector, respectively. Scenario five maintains the assumption that 50% of inputs are sourced from the formal sector and further allows 25% of the informal firms’ sales to go to formal firms. 21 Table C2: Social connectedness, travel time, migration and county-by-county-links Outlinksij Baseline Revised Baseline Revised Travel timeij (log) -0.010*** -0.009*** -0.011*** -0.009*** (0.004) (0.002) (0.004) (0.002) Social connectednessij (log) 0.006*** 0.014*** 0.007*** 0.014*** (0.002) (0.001) (0.002) (0.001) Migrationij 0.530*** 0.176 (0.187) (0.181) Migrationji 0.114*** 0.158* (0.029) (0.096) No. observations 2,124 2,124 2,124 2,124 R2 0.904 0.351 0.901 0.351 Origin FE ✓ ✓ ✓ ✓ Destination FE ✓ ✓ ✓ ✓ We regress the matrix of county-by-county outlinks, more precisely the share of inputs a given county purchases from another county, on social connectedness and travel time (in hours) and the number of migrants (in millions) between the two counties. Standard errors are clustered at level of the origin-destination dyad. Social connected- ness captures the probability of two random individuals being friends on a popular social media platform (Bailey et al., 2021), conditional on their present location. Table C3: County-level changes in outdegree and county characteristics Outlinks counterfactual/outlinks baseline Formal sector share -3.525*** -1.569 0.561 0.839 (1.303) (1.711) (2.293) (2.244) Population (log) -0.365* 0.323 0.871 (0.213) (0.542) (0.613) Gross County Product (log) -0.545 -1.063** (0.395) (0.485) Market access (distance, log) 0.330* (0.187) No. observations 47 47 47 47 R2 0.140 0.194 0.228 0.281 We regress the county-level change in outdegrees on various county characteristics including the formal sector share, the Gross County Product, and market access. We weight observations by Gross County Product. 22 Table C4: Differences in simulated output reduction for revised network with informal firms versus baseline network with formal firms only Domestic output shocks Import shocks (1) (2) (3) (4) (5) (6) Buyer sector-county formal employment share -4.948*** -4.803*** -4.066*** 0.115*** 0.238*** 0.053 (0.931) (1.458) (1.001) (0.032) (0.052) (0.034) No. observations 431 431 431 431 431 431 Sector FE - ✓ - - ✓ - County FE - - ✓ - - ✓ The outcome of interest measures the ratio of the impact response to an adverse shock if we account for informal firms versus relying only on the administrative data. The ratio is larger than one if we underestimate the impact of the shock and smaller than one if we overestimate it by not accounting for informality. The above table shows the results from regressing this change in output reduction at the sector-county level on the sector-county formal sector share. 23 Figure C5: The ratio of ∆ output (revised network) and ∆ output (baseline network) Domestic output shocks Import shocks We plot the ratio of the impact response to an adverse shock if we account for informal firms versus relying only on the administrative data, for the domestic and trade shocks respectively. The ratio is larger than one if we underestimate the impact of the shock and smaller than one if we overestimate it, if we do not account for informality. We aggregate the impact of the shock at the sector-county level where we weight the impact for each size and formality type using its entry probability p(θ). 24 Figure C6: % change in output drops Scenario: no sales of informal firms to and 50% inputs sourced from the formal sector Domestic output shocks Import shocks The above graphs plot the percentage change of the output reduction in response to domestic and international output shocks for two scenarios: the baseline network using only administrative data and the revised network assuming that informal firms do not sell to the formal sector and source 50% of their inputs from formal firms. We aggregate the output reduction across buyer types at the sector and county level, weighted by the entry probability p(θ) for each size and formality type. The x-axis shows the formal sector share for each sector-region pair. 25 C.2 Updating Entry Probabilities for Scenarios that Distinguish Between Small Formal and Informal Firms To differentiate informal firms from small formal firms, we simply split the first term in Equation 4 into two components: self-employment in the formal sector (small formal firms) and employ- ment in the informal sector (informal firms). As outlined in Section 6.1, our proxy for informal firms’ linking patterns still relies to some extent on small formal firms’ connections to capture the sector-region composition of formal-informal linkages. Consequently, we can only introduce informal types where corresponding small formal types exist in the administrative data. This approach excludes 53 potential types, representing sector-county cells that account for 9% of private sector employment (excluding agriculture and non-market services). We do not introduce informal types for the agricultural and non-market service sectors. The resulting classification yields 1,376 distinct firm types: 419 informal, 493 small formal, and 464 large formal. 26