Policy Research Working Paper                       10932




  Spatial Inequality and Informality in Kenya’s
                 Firm Network
                                Verena Wiedemann
                                 Benard K. Kirui
                                Vatsal Khandelwal
                                 Peter W. Chacha




International Finance Corporation
September 2024
Policy Research Working Paper 10932


  Abstract
 The spatial configuration of domestic supply chains plays a                        activity, the paper estimates a structural model and predict
 crucial role in the transmission of shocks. This paper inves-                      a revised network incorporating informal firms. Findings
 tigates the representativeness of formal sector firm-to-firm                       show that formal sector trade flows underaccount for trade
 trade data in capturing domestic trade patterns in Kenya,                          within regions and across regions with stronger social ties.
 a context with high informality. It documents stylized facts                       The higher the incidence of informality, the more one
 showing that formal sector trade exhibits a distinct spatial                       underestimates vulnerability to domestic shocks and over-
 concentration relative to overall economic activity. Link-                         estimates exposure to import shocks.
 ing transaction-level data with data on informal economic




 This paper is a product of the International Finance Corporation. It is part of a larger effort by the World Bank Group to
 provide open access to its research and make a contribution to development policy discussions around the world. Policy
 Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted
 at vwiedemann@ifc.org.




         The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
         issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
         names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
         of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
         its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                       Produced by the Research Support Team
                                                                                                     ∗
          Spatial Inequality and Informality in Firm Networks

         Verena Wiedemann†               Benard Kipyegon Kirui‡               Vatsal Khandelwal§
                                         Peter Wankuru Chacha¶




  Originally published in the Policy Research Working Paper Series on September 2024. This version
  is updated on August 2025.
  To obtain the originally published version, please email prwp@worldbank.org.




       Keywords: firm networks, informality, spatial inequality, economic development.
       JEL classification: O11, O17, R12, D22, D85, E26.




   ∗
     We thank the Kenya Revenue Authority (KRA) for the outstanding collaboration. Romeo Ekirapa, Simon
Mwangi, and Benard Sang provided excellent technical support and advice. We thank Elizabeth Gatwiri and
Daniela Villacreces Villacis for excellent research assistance. We thank Andrea Bacilieri, Raphael Bradenbrink,
Banu Demir Pakel, Kevin Donovan, Douglas Gollin, Justice Tei Mensah, Luke Heath Milsom, Sanghamitra Warrier
Mukherjee, Solomon Owusu, Piyush Panigrahi, Nina Pavcnik, Simon Quinn, Alexander Teytelboym, Christopher
Woodruff, Hannah Zillessen and participants of seminars and conferences at Oxford, CSAE, STEG, the World
Bank’s DaTax group, IFC, KIPPRA, Montreal, the Bank of Italy/OECD, and Jeune Street for comments and
feedback. We gratefully acknowledge financial support from the Private Enterprise Development in Low Income
Countries (PEDL) and the Structural Transformation and Economic Growth (STEG) programmes, both of which
are joint initiatives by the Centre for Economic Policy Research (CEPR) and the UK Foreign, Commonwealth &
Development Office (FCDO). Verena Wiedemann further acknowledges funding from the Oxford Economic Papers
Research Fund (OEP) and the German National Academic Scholarship Foundation. This study has been approved
by the Department of Economics Research Ethics Committee at Oxford (protocol no.ECONCIA20-21-23), and
the Kenyan National Commission for Science, Technology and Innovation (protocol no.NACOSTI/P/20/5923).
The views in this paper are those of the authors, and do not necessarily represent those of the KRA or any other
institution the authors are affiliated with.
   †
     Economic Research Unit, International Finance Corporation (World Bank Group)
   ‡
     Privatization Commission Kenya
   §
     Department of Economics, University of Exeter
   ¶
     International Monetary Fund
1       Introduction

Limited opportunities for export-led growth and concerns over the unequal allocation of gains
from trade have led policymakers and researchers to shift attention towards domestic supply
chains and market integration to enhance economic development (Topalova, 2010; Atkin and
Donaldson, 2015; Bustos et al., 2020; Grant and Startz, 2022; Goldberg and Reed, 2023). Pro-
gress in understanding the structure of domestic supply chains has been facilitated by the in-
creasing availability of granular transaction-level firm network data, which are often sourced
                                              ao et al., 2022; Alfaro-Ure˜
from tax records (see e.g. Panigrahi, 2022; Ad˜                          na et al., 2022; Boken
et al., 2023).1 These developments reflect a broader trend in the literature on development
and structural transformation, where non-traditional novel micro-data such as credit registries,
smartphone data, and matched employer-employee data enable new insights into classic questions
(Bustos et al., 2020; Blanchard et al., 2021; Dix-Carneiro et al., 2024; Kreindler and Miyauchi,
2023). However, a potential drawback is that the data-generating process underlying such ad-
ministrative records is often skewed toward particular segments of the economy, e.g. taxpayers
in the case of firm-to-firm transaction data. This can leave us with a potentially incomplete
view of economic activity due to the size and significance of the informal sector in many low
and middle-income economies.

In this paper, we ask how observing only a selected segment of the economy, in our case, formal
firms, might bias the inferences drawn from data on firm-to-firm trade networks constructed
using tax records. Accounting for this bias is important because the structure of these networks
can shape the aggregate and distributional impacts of infrastructure investments, industrial
policies, and financial shocks (see e.g. Acemoglu et al., 2012; Liu, 2019; Demir et al., 2022;
Balboni et al., 2023; Castro-Vincenzi et al., 2024; Demir et al., 2024; Huremovic et al., 2024;
Donaldson, 2025) and inform tax policy and the design of social protection programs.2 Formal-
sector data may not be representative of the overall spatial distribution of trade, especially if
formal firms are concentrated in specific sectors and regions and exhibit distinct linking patterns.
Given the scarcity of granular data on the informal sector, we lack empirical evidence on the
nature and degree of the bias caused by neglecting informality.

We address this question by combining transaction-level administrative tax records of over 76,000
formal firms and 5.8 million firm-to-firm relationships in Kenya, with micro and aggregate data
on informal sector activity obtained from labor force surveys and national accounts.3 We first
    1
      Transaction-level survey data (see e.g. Startz, 2021) or administrative industry-specific data (see e.g. Hansman
et al., 2020), which often cover a wider range of firm and buyer-supplier characteristics, are a popular, albeit costly,
complement to data relying on tax records.
    2
      An advantage of transaction-level firm-to-firm trade data over traditional sources like input-output tables is
the possibility to track spatial heterogeneity in economic activity, rather than being limited to national aggregates.
    3
      The labor force survey module was integrated in the 2019 population census. The public micro data covers



                                                           1
establish a series of stylized facts on both the formal and the informal sector which suggest
that formal sector data is not representative of overall economic activity. We then estimate a
structural model to construct a revised firm network that accounts for informality. Overall, we
find that formal sector data overstates the importance of urban hubs and underestimates intra-
regional trade. As a result, firm-to-firm data on the formal sector overestimates the degree of
spatial inequality in trade flows, leading to an underestimation of the impact of domestic output
shocks and an overestimation of the impact of import shocks in sectors and regions with high
informality. These results are robust to varying assumptions about how informal firms interact
with the rest of the economy. Our findings highlight the value of complementing administrative
firm-to-firm data with auxiliary data on informality to improve estimates of aggregate effects of
economic shocks.

The question of how well formal sector trade patterns represent overall economic activity is
relevant for many countries with high informality, several of which have accessible VAT data
(see Figure 1). The Kenyan context is particularly well-suited to answering this question due
to the relevance of the informal sector and the availability of unusually granular data on the
sectoral and geographic composition of informal economic activity. VAT-paying firms account
for only 36% of Kenya’s GDP.4 At the same time, high-quality subnational data allow us to
observe both formal and informal activity across sectors and regions.

                    Figure 1: Country-level informality rates and income levels
          Output informality (% of GDP)                        Self-employment (% of total employment)




Each marker represents a country. Data on output informality and self-employment are taken from Elgin et al.
(2021). We highlight countries for which, to our knowledge, at least one working paper or peer-reviewed article
using VAT data is available.


We begin by assessing whether patterns of domestic trade observed for the formal sector arise due
to the systematic selection of firms into the administrative data or reflect the underlying structure
of the economy. First, we show that formal firms exhibit distinct spatial patterns compared to
overall economic activity. Trade among formal firms is more spatially concentrated compared
10% of the overall population and is hence representative at a very granular level.
   4
     As we will show later on, the public sector contributes another third to GDP. Any residual output can be
attributed to the informal sector.


                                                      2
to other indicators, including population, GDP, employment in micro and small enterprises, the
distribution of registered firms, and patterns of internal migration. This spatial concentration
is driven by inequality along the extensive margins of the firm network, i.e. the location of firms
and trading relationships, rather than transaction volumes. Notably, formal firms maintain twice
as many links to buyers and suppliers in urban hubs as they do to formal firms within their
own counties, indicating that within-county trade is less common than inter-county trade with
firms in Nairobi. Following this, we show that informality is not evenly distributed across space
but its incidence varies systematically across sectors, regions, and firm position in the supply
chain. Informal firms are more likely to be located downstream of large formal firms, and
informality negatively correlates with regional economic size and income. As a result, we expect
that accounting for the informal sector can systematically alter the structure of the observed
firm network rather than introducing random measurement error. As we do not directly observe
trade flows between informal firms, we rely on a structural model to recover a network that
accounts for informal activity.

We introduce and estimate a network formation model with heterogeneous firm types following
        e et al. (2012) to predict a revised network. In our adaptation of the model, we classify
Bramoull´
firms based on their sector, location, and size, due to the substantial heterogeneity along these
three dimensions documented in the empirical section. The key advantage of the model is that
it allows for a flexible network formation process that accounts for heterogeneous linking pref-
erences along the above dimensions. Moreover, it allows firms to form links both independently
(undirected search) and through the existing connections of their suppliers (directed search).
The model then provides a unique steady-state predictions for the number of links between
firms of different sector-location-size types. We first estimate the model to predict the Kenyan
firm network as observed. We find that new firms choose 45% of their suppliers through undir-
ected search, conditional on their bias, and the remaining 55% of suppliers are found via existing
suppliers.5 The model-predicted network not only fits the empirical distribution of outlinks (i.e.
the number of buyers) well, but also performs well with respect to untargeted moments, such as
the share of links accounted for by Nairobi- and Mombasa-based firms.

Using the estimated model, we then predict a revised network that accounts for informal firms
by combining the model with granular real-world data on the sectoral and regional composition
of the informal sector. Updating the entry probabilities of firms by sector, location, and size
requires data on the spatial and sectoral dispersion of economic activity, for which we draw on
microdata from the labour force module of the 2019 population census. We initially assume that
   5
    In comparison, Chaney (2014) finds that only 40% of all relationships of French exporters with international
trade partners are formed via directed search. Our estimate of 55% of links being formed as a result of directed
search could suggest that linking frictions are potentially even more binding for firms in Kenya’s domestic firm
network.



                                                       3
informal firms, conditional on sector and geography, exhibit linking patterns similar to those
of the smallest quartile of formal firms in the administrative data. This assumption addresses
concerns that informal firms might have different linking patterns compared to larger formal
firms operating in the same sector and location (e.g. due to internal economies of scale (Grant
and Startz, 2022)).6 Nevertheless, informal firms might encounter additional obstacles specific
to informality.7 To account for this, we implement additional sensitivity checks that account for
alternative scenarios with lower linking probabilities between the formal and informal sector.

We use the revised network to answer the question of interest: How do spatial patterns of trade
change when informal firms are accounted for? First, we find that sectors and regions with the
highest levels of informality have more outlinks in the revised network relative to the baseline
network. The spatial inequality in outlinks declines by 7% and the prominence of urban hubs
like Nairobi and Mombasa decreases by 5 percentage points. We show that while this decrease in
inequality of outlinks is driven by an increase in both inter-county and intra-county trade, intra-
county trade rises by a larger margin. Moreover, we find that once informal firms are accounted
for, the number of trade relationships between counties is more sensitive to the strength of social
ties between them, measured using both online friendship networks and migration inflows.

Next, we simulate the pass-through of domestic and import shocks using both the baseline
network estimated from the structural model and the revised network that incorporates informal
firms. We find that ignoring informality leads us to underestimate the average impact of domestic
shocks and overestimate the average impact of import shocks. In particular, when using the
revised network, we find that domestic output shocks have a more pronounced negative effect
on sector-regions with higher levels of informality, compared to results based on formal sector
data alone. Our results suggest that a 1 percentage point decrease in the formal sector share
corresponds with an underestimation of the reduction in output by 5 percentage points. By
contrast, for import shocks, the bias goes in the opposite direction. The economy appears less
exposed to import shocks once the informal sector is taken into account. This discrepancy arises
because import shocks primarily affect larger formal firms, which carry less weight in the overall
firm network once informality is incorporated.

Finally, we consider the sensitivity of our results with respect to alternative assumptions about
the linking patterns of informal firms. We conduct bounding exercises where we further restrict
the degree of integration of informal firms with the formal sector relative to small formal firms.
                                                         ohme and Thiele, 2014; Gadenne
Drawing on survey evidence and the existing literature (B¨
  6
     We provide evidence showing that the linking patterns of small formal firms are similar to what we would
expect from informal firms – they link more locally and buy more from intermediaries relative to their larger
peers.
   7
     This can include wedges introduced by the VAT system itself (De Paula and Scheinkman, 2010; Gadenne
et al., 2022).


                                                     4
et al., 2022), we assume informal firms sell no output to the formal sector and source a smaller
proportion of their inputs from formal firms compared to small formal firms. We find that this
further reduces spatial inequality in outlinks and the prominence of urban hubs relative to the
formal network. Moreover, we continue to underestimate the impact of domestic shocks while
overestimating the effects of trade shocks.

Our paper contributes to the literature on macroeconomic development, informality, firm net-
works, and spatial inequality.

First, we contribute to a growing body of research at the intersection of trade and macroe-
conomic development that integrates granular administrative data such as employer-employee
records and data from credit registries, with broader data sources like population censuses to
achieve a more accurate assessment of aggregate economic outcomes. To date, this literature
has primarily focused on employment outcomes, sector shares (see e.g. Albert et al., 2021), and
consumption (see e.g. Fan et al., 2023), where informal activity is somewhat more observable.
                                                                              ohme and Thiele,
However, informal activity along supply chains remains particularly elusive (B¨
2014; Atkin and Khandelwal, 2020).8 Our results highlight the implications of the non-random
selection of firms into administrative records. This is particularly important, given the growing
reliance on such data in the literature (Donaldson, 2025).

Our approach to employ a structural model to bridge gaps in our understanding of informal
firm dynamics also aligns with the recent literature in this field (see e.g. Ulyssea, 2018; Dix-
Carneiro et al., 2024). Unlike related studies that focus on firm and worker-level dynamics,
we do not model the endogenous response of firms and workers to simulated shocks. Crucially,
however, our research design allows us to examine the role of informality for Kenya’s region-
level input-output matrix. This is particularly relevant for research that seeks to complement
predictions about aggregate national welfare with welfare estimates at the regional level to study
geographic heterogeneity in the impact of international trade (Topalova, 2010; Arkolakis et al.,
2023), infrastructure investments (Demir et al., 2024) or climate and weather shocks (Albert
et al., 2021; Castro-Vincenzi et al., 2024).

Second, we contribute to the literature on spatial production networks (Bernard et al., 2019;
Panigrahi, 2022; Miyauchi, 2024), shock propagation in firm networks (see e.g. Baqaee and Farhi,
2019; Carvalho et al., 2021; Arkolakis et al., 2023; Chacha et al., 2024), and urban primacy
(as published in Jefferson (1989), Jefferson, 1939; Memon, 1976; Ades and Glaeser, 1995; Soo,
2005). We analyze the spatial distribution of formal firms in an economy with a large informal
   8
    A related literature in public finance studies why informality arises along supply chains, and how tax policy
can alter its incidence (De Paula and Scheinkman, 2010; Zhou, 2022; Gadenne et al., 2022; Almunia et al., 2023).
Relative to this literature, we focus on reconstructing a more complete network that includes informal firms rather
than studying the marginal firm’s decision to formalize.


                                                        5
sector and demonstrate that ignoring informality can lead to overestimating spatial inequality
in firm-to-firm trade and the extent of urban primacy. This oversight may cause researchers
to underestimate the economic connectedness and vulnerability of smaller regions. Our finding
that formal sector activity is disproportionately concentrated in urban hubs is also consistent
with spatial patterns in other contexts that exhibit a similar formal-core, informal-periphery
            arate, 2022). In complementary work, Bacilieri et al. (2023) examine how varying
structure (Z´
reporting thresholds for firm-to-firm transactions affect the comparability of aggregate network
statistics. While they demonstrate that missing data can bias network statistics, our focus
differs in two key ways. First, transaction reporting thresholds are not a concern in our setting;
we instead address other sources of network incompleteness in tax data. Second, while they
analyze aggregate statistics, we examine distributional implications at the regional level – a
margin where we find informality plays a substantial role.

Finally, we contribute to the growing literature on estimating network statistics and recon-
structing networks with missing data (e.g., Chandrasekhar (2016); Mungo et al. (2023)).9 We
contribute to this literature by proposing the use of a structural approach. The structural route
seeks to tackle two challenges: first, nodes are missing in a non-random manner. Second, we do
not observe network characteristics for missing nodes, but linking preferences of missing firms
can systematically differ from those of observed firms. To address these challenges, we combine
multiple data sources and estimate a network formation model that recovers both the prevalence
of missing firms and their linking behavior, accounting for sectoral and geographic preferences.
In comparison, Mungo et al. (2023), for example, use a link prediction algorithm to predict the
existence of links among missing nodes using data on the characteristics and linking behavior
of non-missing nodes. This approach is challenging to apply in the context of informality, as
informal firms are missing in a non-random manner and information on their linkages are not
available at a sufficient granularity to recover a network with sectoral and spatial heterogeneity.10

The paper is organized as follows: Section 2 describes our data. Section 3 examines how repres-
entative the trade patterns captured in the administrative data are of overall economic activity
and discusses the role of the informal sector. We present and estimate a network formation
model in Section 4. Section 5 describes the patterns we observe in our revised network that now
incorporates informal firms, while Section 6 analyzes the sensitivity of these results to alternative
assumptions.
    9
           u and Zhou (2011) for a survey of algorithms used for link prediction, a technique commonly applied in
      See L¨
network reconstruction.
   10
      Rare instances where researchers have used survey data to collect information on firm-to-firm links of informal
firms focus on specific sub-sectors. Moreover, explicitly prompting for the tax status of trade partners adds an
additional challenge due to the sensitive nature of this information.




                                                         6
2        Data Description

2.1       Administrative Data

Our analysis of the formal sector draws on micro data from value-added (VAT), pay-as-you-earn
(PAYE) tax returns, and tax registration forms collected by the Kenya Revenue Authority. The
tax registration forms provide self-reported information on each firm’s 4-digit sector classific-
ation and headquarter location. The VAT returns include details on firm-to-firm transactions
between VAT-registered firms. Sales and purchases with non-registered parties (e.g. tax ex-
empt parties, non-VAT-registered businesses, and final consumers) are recorded as an aggregate
monthly figure. VAT applies to firms with an annual turnover of KShs five million and above
($38,400 as of May 2024). Once a firm is VAT-registered and has crossed the threshold of KShs
five million, they are required to continue filing VAT returns in years with lower turnover.

We solely focus on entities that identify as private companies or partnerships in their tax-
registration form. We also restrict our analysis to firms with annual purchases greater than zero
and annual sales of KShs five million or more in at least one year between 2015 and 2022.11

              Figure 2: Location of formal firms and informal sector employment shares
                                                                       Informal sector share of overall private
         Number of formal firms per km2
                                                                              sector (self-)employment
            (based on tax records)
                                                                            (based on population census)




The left map shows the density of firm headquarter locations at the subcounty level, i.e. the number of firms per
km2 . The right map shows the share of informally employed people as a share of the local labour force, at the
subcounty level. Subcounties represent the second administrative layer. The borders of Kenya’s 47 counties are
outlined in grey.


    11
     This ensures that firms that registered for VAT to bid for tender but were never operational are not included
for analysis.



                                                        7
Figure A1 plots the sector composition and the respective sales and purchase channels of firms
covered in these administrative records. Manufacturing and wholesale and retail firms together
account for almost half of the sales in the tax records. The graph on the left in Figure 2 shows the
geographical dispersion of formal firm headquarters, which shows that many firms concentrate
in Nairobi and Mombasa.

Figure A2 plots the share of total observed firm-to-firm links attributed to different groups
of firms, distinguishing by size, sector, and location. Panel A shows that aside from retail,
wholesale, and manufacturing, business services account for a substantive proportion of supplier
linkages. Mining firms account for the least number of both supplier and buyer links. Panel B
shows that firms in Nairobi and Mombasa account for more than three quarters of all buyer and
supplier links. Panel C shows that firms in the top sales quartile account for a disproportionately
large share of total links, especially as suppliers. Aside from the statistics plotted in Figure A2,
61% of total links are formed among firms located in the same county, 17% of links with firms
within the same sector, and 45% among firms in the same sales quartile.


2.2     Data on Informal Activity

Informality in firm-to-firm transaction data can either arise because firms themselves are not
registered (extensive margin) or because transactions of registered firms are not recorded (in-
tensive margin). Not being registered as a firm or not recording a transaction in turn can either
be the result of non-compliance or because of exemptions (e.g. for small firms).12

To obtain an updated distribution of overall (formal and informal) economic activity that we
can compare to the administrative data, we draw on a series of data sources listed in Table
1. Importantly, we not only measure informality as the gap between overall economic activity
and what is captured in the administrative data, but additionally rely on measures generated
independently of the administrative records. This helps us to rule out the possibility that our
measures of informality capture idiosyncratic patterns that are specific to the VAT system.

Throughout this paper, we draw on three types of measures for economic activity: employment
figures, the number of firms, and value added (i.e. the difference between sales and purchases).
Our preferred measure of informality uses formal sector employment as a share of total employ-
ment, drawing on a comprehensive labor force module in the 2019 population census (KNBS,
2019). Alternative measures of informality based on the number of firms rely on estimates of the
universe of businesses in KNBS (2016),13 while a value-added based measure utilizes estimates
  12
     See Appendix A.2 for a detailed discussion of these margins and why administrative data may not fully
capture economic activity.
  13
     KNBS (2016) obtains information on the number of licensed businesses from county governments and estimates
the number of unlicensed businesses based on household survey data.



                                                      8
                                    Table 1: Overview of data sources

 Source                                    Year    Aggregation          Key indicators
 Population & housing census (census)      2019    Sector and county    Formal & informal employment
 Gross County Product (GCP)                2019    Sector and county    Gross County Product
 Census of establishments (CoE)            2017    Sector or county     Number of formal sector establishments
 Micro, small & medium sized               2016    Firm-level           Main input source and buyer
 enterprises survey (MSMEs)
 Census of industrial production           2010    Sector and county    Sales of multi-establishment firms
All data are collected and published by the Kenya National Bureau of Statistics. Sources: 2019 Kenya Population
& Housing Census KNBS (2019); Gross County Product KNBS (2022); Census of Establishments KNBS (2017);
Small & Medium-Sized Enterprises Survey KNBS (2016); Census of Industrial Production 2010 (KNBS, 2010).


of the regional economic size captured by the Gross County Product (KNBS, 2022).14

The employment-based measure, which later serves as a key input for predicting the revised
network with informal firms, offers two distinct advantages. First, it enables joint disaggregation
of informal activity by sector and region. Second, it allows us to distinguish between private and
public sector employment, a distinction unavailable in alternative measures, but based on which
we can rule out that this measure of informality captures a proportion of public sector activity.
The graph on the right in Figure 2 shows the geographical dispersion of informal activity as per
the employment-based measure derived from the labor force module of the 2019 census. The
measure correlates strongly (ρ= 0.83, Table A3) with a measure of regional formal sector shares
that relies on the administrative data to capture the size of the formal sector (see Figure A9).

Finally, we exclude agricultural and non-market service sectors from our informality measures
where possible, as their tax records only cover a small and very specific sub-population of firms
and employees. In the case of agricultural firms, the administrative data only capture large-scale
commercial agriculture, which is often primarily export-oriented (see Figure A1 and Chacha et al.
(2024)). Non-market services are dominated by non-profit organizations and the government
with only a few for-profit VAT firms.

Data on linking patterns of informal firms: In addition to the above, we use data from a
survey with small and medium size enterprises (KNBS, 2016) to derive some insights into the
sales and purchase patterns of the informal sector. The survey data only record the main type
of buyer and supplier of a firm and hence cannot be directly used to reconstruct a network with
informal firms. However, we will use these data to inform the assumptions of our model.


2.3     Size of the Informal Sector

To assess how much of economic activity is generated by VAT-reporting firms, we first compare
their value added with Kenya’s national accounts. We find that VAT-reporting firms account
  14
     To estimate the Gross County Product, KNBS (2022) relies on a series of data sets including the 2016 MSME
survey and the 2019 population census. Most data sets covering informal activity are only collected intermittently.


                                                        9
for 36% of Kenya’s GDP on average between 2015 and 2019 (Appendix Table A4). Excluding
VAT-exempt sectors such as agriculture, non-market services and finance, the share increases
to 67% of residual economic activity, implying an informal sector share of 33%. The share of
informal firms and people employed in the informal sector is even higher. Of the 7.4 million
businesses identified in the 2016 MSME report, only one fifth were licensed, and only 2.5%
appeared in VAT data (KNBS, 2016). Similarly, VAT-registered private sector firms employed
5% of Kenya’s workforce in 2019.

With data on the formal sector only capturing a proportion of overall economic activity, we now
turn to the question on whether the inability to observe informal firms in the VAT data might
result in distinct trade patterns that deviate from overall firm-to-firm trade.


3    Representativeness of the Formal Firm Network

In this section, we establish two key stylized facts about the extent to which trade among
formal firms might be representative of overall economic activity. First, we show that trade
among formal firms is highly concentrated around urban centers and places greater emphasis on
inter-county rather than within-county trade. This contrasts with other measures of economic
activity that include less formalised activities and which we find to be more geographically
dispersed. Second, we document that the incidence of informality varies systematically across
geography, sectors, and positions in the supply chain, suggesting that informality is indeed not
evenly distributed across the economy. As a result, we expect overall trade patterns to diverge
from the ones we document for the formal sector.


Fact 1: Formal sector trade is more spatially concentrated than overall
economic activity and distinctly centered around urban hubs.

Formal sector data reveal a high degree of spatial concentration compared to
overall economic activity.

Our first observation is that formal sector trade flows are strongly concentrated around Kenya’s
largest metropolitan areas Nairobi and Mombasa. In 2019, as much as 68% of the total sales
within the network of formal firms was generated by Nairobi-headquartered firms (see Table 2).
However, in the same year, as little as 9% of Kenya’s population lived in Nairobi County and
the city contributed only 33% of Kenya’s GDP outside the agricultural sector (also reported in
Table 2).15 These comparisons suggest that the role of urban centers in the Kenyan firm network
is disproportionate relative to their population and their contribution to aggregate GDP.
  15
     While Nairobi’s metropolitan area extends into neighboring counties, for simplicity, references to Nairobi in
this paper exclude these areas.




                                                       10
As a proxy for the spatial dispersion of economic activity beyond simple shares attributed to
major cities, we compute Pareto exponents (α) (Gabaix, 2009; Soo, 2005). The third column in
Table 2 shows that while county-level GDP and population show relatively even distributions
(α close to one), measures of economic activity derived from the VAT data exhibit much higher
inequality. The Pareto exponents for employment, value added, and trade flows are 57-76%
lower than for overall economic activity, indicating higher spatial concentration.

Is this spatial concentration driven by firm locations and number of firm-to-firm relationships
(i.e. the extensive margin of trade) or trade volumes (i.e. the intensive margin of trade)?
In Appendix A.5, we show that the spatial concentration of both trade out- and inflows is
primarily driven by firm locations and the number of firm-to-firm relationships. The number
of transactions and average trade volumes per transaction only play a minor role in explaining
spatial variation in trade patterns. This insight later informs our choice to focus on a model
that allows us to predict the structure of the production network along the extensive margins.

To assess whether the observed spatial concentration and urban bias are idiosyncratic features of
VAT data or reflect patterns innate to formal activity, we also compute the spatial concentration
for a range of other indicators for economic activity. These range from the least formal to
increasingly formalized economic activities and are included in Table 2. We observe a clear
pattern: activities that capture very little formal activity such as the number of micro-, small-,
and medium-size enterprises (MSMEs) and employment in MSMEs exhibit relatively little spatial
concentration (α = 0.86 and 0.78, respectively). As we focus on more and more formalized
activity, e.g. licensed businesses from micro to medium sized, all the way to firms in the census
of industrial production, the degree of concentration increases. This suggests that the spatial
concentration observed in the VAT data represents a feature of the formal sector more broadly.

Unlike formal sector trade, we find that other observable types of domestic flows, like migration
flows also do not exhibit a similar spatial concentration (see Table 2).16 For instance, while
Nairobi is the destination for 29% of all long-term inter-county migration, it accounts for 69%
of total outlinks and 65% of total inlinks in the formal sector firm network.

A potential concern is that the observed spatial concentration in urban centres is mainly driven
by firm headquarter locations being more likely to be based in Nairobi or Mombasa. While
this concern is mitigated by the fact that we observe a similar spatial concentration in other
measures of formal economic activity, we also use micro-data from the 2010 Census of Industrial
Production to address the question about the role of multi-establishment firms more explicitly.
In Appendix A.6, we compare the spatial concentration of sales and firm locations with and
  16
     We compute long-term migration flows using data on birthplace and current residence from the 2019 popu-
lation census.



                                                    11
       Table 2: Geographic concentration of economic activity by degree of formalisation

                                                        Nairobi      Mombasa     Rank regression
                                                                  in %            α       SE
            Population overall                                9         3        1.29    0.18
            Population of cities & towns                     31         9        0.85    0.01
            Migration inflows                                29         6        0.59    0.04
            Migration outflows                                5         2        0.77    0.07

            GDP                                              25         5        1.00       0.07
            GDP w/o agriculture                              33         7        0.97       0.05
            GDP w/o non-market services                      25         5        0.91       0.08

            No. MSMEs                                        14         3        0.86       0.17
            Employment in MSMEs                              19         3        0.78       0.13

            No. licensed MSMEs                               18         3        0.73       0.09
            Employment in licensed MSMEs                     28         3        0.67       0.07

            No. SMEs                                         37         3        0.58       0.06
            Employment in SMEs                               36         3        0.60       0.05
            No. census establishments                        36         4        1.10       0.12

            No. firms census of industrial production        48         6        0.54       0.02
            Sales census of industrial production            61         7        0.32       0.03

            No. VAT firms                                    64          9       0.63       0.03
            Employment in VAT firms                          62          9       0.36       0.03
            Value added of VAT firms                         72         10       0.38       0.03
            VAT network sales                                68         13       0.35       0.02
            VAT network outlinks                             69         11       0.35       0.02
            VAT network purchases                            60          9       0.43       0.02
            VAT network inlinks                              65         10       0.48       0.02
The columns for Nairobi and Mombasa report their share of the respective national aggregate figures (e.g.,
Nairobi’s contribution to Kenya’s GDP). The Pareto exponent α is the estimated coefficient from a county-level
regression of each county’s rank (log) on the respective measure x (log): log rank = log A − α log x. All measures
reported in the final section of the table are derived from the VAT data. All other measures are based on data
sources summarised in Table 1.


without multi-establishments. We find that the excess spatial concentration introduced by
multi-establishments does not explain the aggregate concentration patterns of formal private
sector activity.


Formal firms predominately source from and sell to firms in urban hubs.

Consistent with the previous section, a plot of county-by-county trade flows in Figure 3 highlights
the primacy of Nairobi and Mombasa. The width of each segment on the left reflects the county’s
total sales within the network, while segments on the right are proportional to total purchases.
The color of the trade flows aligns with the county of origin.

Importantly, visualizing trade flows also reveals that Nairobi not only has more outlinks to other
counties but also is the most important buyer of goods and services by firms from other counties.
Given the high concentration of manufacturing firms and their upstream position in the network,


                                                        12
                       Figure 3: County-level trade flows between formal firms




The figure shows inter-firm trade flows aggregated at the county level. The size of each node (segment) is
proportional to the county’s share of purchases and sales relative to the aggregate volume of firm-to-firm trade
between formal firms in Kenya. The colour of the edges (links between segments) indicates the direction of the
trade flow. They take the colour of the supplying county (e.g., goods and services provided by firms in Nakuru to
firms in Nairobi take the colour of the segment for Nakuru). The width of each edge (links between segments) is
proportional to the share of the trade flow with respect to the aggregate volume of trade flows in the transaction-
level VAT data. To improve readability, we only separate the trade flows for eight counties (prioritizing those
with the largest aggregate amount of transactions and those that act as regional hubs). We bundle the trade
flows for the remaining 39 counties.


we expect that Nairobi-based firms supply inputs to a wide range of other sectors and firms across
the country. Indeed, as seen in Table 2, firm outlinks (α = 0.35) are more spatially concentrated
than inlinks (α = 0.48), indicating that the supply of inputs to the broader network is more
spatially concentrated. Despite sales destinations of firms being relatively less concentrated, it is
perhaps more unexpected that Nairobi also is the top destination for firm-to-firm sales of formal
firms located outside the city. This is well-illustrated by the left matrix in Figure 4, which plots
the share of each county’s total sales across all 47 counties. The column corresponding to sales
to Nairobi stands out with more intense shading.


Formal firms trade significantly more across regions than within.

Lastly, we find surprisingly limited evidence of home bias in the formal sector firm-to-firm data.
Put differently, local county-level markets are not as important for formal firms as one might
expect.

This pattern is salient in the heatmaps in Figure 4. In both heatmaps, the squares along the
diagonal represent the share of sales and purchases to other firms within the county. While the

                                                        13
                          Figure 4: County-by-county trading relationships

                    Sales shares                                         Purchase shares




Each column and row in the above graph corresponds to one of Kenya’s 47 counties. The graphs plot how much
each county accounts for any other county’s sales (purchase) shares, i.e. the row- (column-) normalised county-
by-county matrix derived from the administrative firm-to-firm transaction-level data. The rows and columns are
ordered alphabetically based on county names, which are omitted to improve readability.


diagonal is clearly visible, particularly in the heatmap highlighting the destinations of firm sales,
the lighter shades indicate that local markets are less important than trade with Nairobi. Indeed,
for the median county, the number of supplier-to-buyer links with Nairobi is twice as large as the
number of intra-county links. Overall we find that the number of inter-county supplier-buyer
links is 4 times larger than the number of intra-county links for the median county. However, as
expected, selling locally is more common for firms than sourcing inputs locally. This is reflected
in the heatmap, where the diagonal, representing local trade, stands out more clearly for sales
than for purchases.

The concentration of trade flows around firms based in metropolitan areas alongside the relat-
ively less prominent role for intra-county trade are two striking features of formal firm-to-firm
trade. These patterns may not be reflective of overall economic activity. For the remainder of
the paper we will focus on getting a better understanding of whether we expect these to be
features of overall trade, including the informal sector, or whether these patterns are innate to
the formal sector, similar to the documented urban concentration of formal economic activity.


Fact 2: Incidence of informality varies by sector, geography, and position
along the supply chain.

We now investigate whether informality is randomly distributed across the economy or sys-
tematically varies by sector, geography, and position in the supply chain. If informality is not


                                                      14
randomly distributed, we expect that accounting for the informal sector will systematically alter
the structure of the observed network. A first piece of suggestive evidence is that regions with
higher levels of informality have fewer observed firm-to-firm links in the VAT data, even after
controlling for population and travel time to metropolitan areas (Table A1).


Informal-sector shares correlate negatively with regional economic size and
income.

First, we explore the spatial distribution of informal firms, which we find predominantly reside
in smaller markets. Figure A4 plots the distribution of formal sector shares across counties,
measured using both value-added and employment metrics. The graph shows that in most
counties, the formal sector accounts for less than 20% of economic activity.

We find a strong correlation between a county’s formal sector share and both its economic size
(measured by Gross County Product) and income level (measured by Gross County Product
per capita). As shown in Figure A5, economic size alone explains between 35% and 52% of
the variation in formal sector shares across counties. This pattern is consistent across all three
measures of economic activity: employment, value added and the number of firms.

To validate that this positive correlation between market size and formal sector share is not
merely an artifact of the administrative data, Figure 5 presents correlations between Gross
County Product and three additional employment-based formality measures that do not rely on
the administrative data. Notably, while more stringent definitions of informality yield flatter
slopes, the R2 remains stable. This consistency suggests that economic size explains similar
proportions of county-level informality variation regardless of the measurement approach.


The incidence of informality systematically varies across sectors.

Beyond geographic patterns, informality also varies systematically across sectors. Figure 6
compares a sector’s value added (from administrative data) with its contribution to Kenya’s
GDP (from national accounts). Manufacturing and business services show the closest alignment
between these measures, which suggests that the bulk of economic activity in those two sectors
takes place in the formal economy. This pattern is consistent with the fact that both sectors
rely predominantly on inputs from other firms and tend to sell to other businesses (Figure A1).

In contrast, downstream sectors closer to final consumers exhibit larger disparities between value
added and GDP contributions (Figure A8). This pattern aligns with weaker self-enforcement
in consumer-facing sectors inherent to most VAT systems (Pomeranz, 2015; Naritomi, 2019).
We observe similar patterns in both the extensive margin of informality (comparing firm counts
across data sources, Figure A6) and the intensive margin (comparing formal and informal em-

                                               15
              Figure 5: Share of formal sector employment and regional market size




The first measure uses the formal sector employment share according to the 2019 population census, the second
measure considers the number of employees in licensed businesses, the third uses the same measure but disregards
micro-enterprises, and the fourth measure considers employment in the tax records. Each measure represents a
share, i.e. captures the proportion of economic activity that can be attributed to the formal sector.


ployment from the 2019 population census, Figure A7). Both measures indicate higher inform-
ality in downstream sectors such as wholesale, retail, and other services.


Informal firms are located downstream of larger firms.

While the previous section indicates that informal firms predominantly operate in downstream
sectors, we now utilize survey data to further document their location along the supply chain.
We find that informal firms predominantly operate downstream of large formal firms and in
consumer-facing roles. If interactions between large firms and smaller, often informal firms
occur, they often follow the following pattern: large firms serve as input providers to informal
businesses, whilst informal firms primarily act as distributors, serving end consumers.17

We draw on survey data on trading partners of micro, small and mid-sized enterprises (MSMEs)
by KNBS (2016), which asks about a firm’s main source of input and main type of customer.
Only 2.3% of all MSMEs state that a large firm is their main customer, while 14.5% rely on large
firms as their main source of inputs.18 Figure 7 shows that the pattern holds across sectors.19
  17
     Cordaro et al. (2022) document this pattern in Kenya, showing how microenterprises distribute fast-moving
consumer goods for multinationals.
  18
     KNBS (2016) defines large firms as those with more than 99 employees, which is larger than the average
VAT-paying firm.
  19
     The survey likely provides a lower bound estimate of the interaction between the VAT-registered and non-
VAT-registered firms. While MSMEs primarily trade with each other, the survey does not distinguish micro,


                                                      16
                             Figure 6: Value added by VAT firms vs GDP




This graph compares the sector-level contribution to national GDP to the value added (sales - purchases) of firms
covered in the administrative tax records for 2019.


                                      ohme and Thiele (2014); Zhou (2022) who document similar
These results align with findings by B¨
                                                                            ote d’Ivoire, Mali,
linking patterns between formal and informal firms in Benin, Burkina Faso, Cˆ
 en´
S´              ohme and Thiele, 2014) and India (Zhou, 2022) respectively.20 Gadenne et al.
   egal, Togo (B¨
(2022) use granular data covering a wide range of sectors to document that non-VAT paying firms
that participate in a simplified tax scheme in West Bengal, India, also sell little to VAT-paying
firms, but purchase between 50-75% of their inputs from VAT-paying businesses. The higher
incidence of informality in downstream sectors, as well as informal firms being more likely to
purchase from larger firms rather than vice versa, are consistent with the underlying enforcement
structure of VAT systems. The VAT system incentivises downstream firms to request receipts
from suppliers to claim input VAT deductions against their output VAT obligations. However,
end consumers and VAT-exempt entities lack incentives to demand receipts as they cannot claim
VAT refunds (Naritomi, 2019). Consequently, we indeed expect downstream sectors to exhibit
higher shares of economic activity outside the VAT system.
small, and medium firms. Small enterprises, defined as those with up to 50 employees and KShs five million
annual turnover (KNBS, 2016) may fall below the VAT threshold, but medium-sized firms often exceed it.
  20
      ohme and Thiele (2014) look at multiple sectors, while Zhou (2022) focuses on manufacturing. For a review
     B¨
on links between the formal and informal economy, see Meagher (2013).




                                                       17
               Figure 7: Links of small and medium sized enterprises to large firms

                        Sales                                                Purchases




The figure draws on data from the 2016 Small and Medium Enterprises (MSME) Survey by the Kenya National
Bureau of Statistics (KNBS, 2016). We restrict the sample to participating firms with an annual revenue below
the VAT registration cut-off. The survey asks each firm for their main input sources and their main customer
type. Note that the customer and supplier category “MSME” also contains medium sized firms which can include
formal tax-registered firms. The percentage captured in the “Large firm” category thus represents a lower bound
on linkages between small non-VAT registered businesses and large VAT-registered private sector firms. KNBS
(2016) defines non-MSMEs/large firms as entities with more than 99 employees.


4     A Network Formation Model with Heterogeneity in Sectors,
      Regions, and Firm Size

We now present and estimate a network formation model to (i) predict the formal firm network
as observed in the data and (ii) estimate a revised network that accounts for informal firms. We
will use the revised network to measure the extent to which ignoring informality has implications
for the spatial patterns of domestic trade and the geographic variation in the pass-through of
domestic and international trade shocks.


4.1     Model Motivation

                                                           e et al. (2012). This model is
We rely on the network formation model outlined in Bramoull´
particularly well-suited for our purposes for three reasons.

First, it focuses on the entry of nodes into the network and the formation of links among them.
In other words, the model captures the extensive margin of trade, i.e. firm location and firm-
to-firm links. As discussed earlier and document in Appendix A.5, we show that these two
components account for 70-90% of the variation in trade flows.

Second, it allows us to easily incorporate three key dimensions of firm heterogeneity that can
affect network formation - sectors, geography, and size. The sectoral dimension captures the
underlying input-output structure, while the geographic dimension also allows us to study the
question of spatial inequality. The size dimension captures both the well-documented positive

                                                      18
relationship between firm sales and firm-to-firm connections (Bernard et al., 2022; Bacilieri et al.,
2023) as well as potential differences in how small firms organize their supply chains across space
and sectors.21 Table B1 shows that small firms within the same sector and county are less likely
to directly source from manufacturing firms (upstream), but instead are more likely to source
from retailers or wholesalers (downstream). Further, they are less likely to source from Nairobi-
based suppliers and more likely to source locally.

Third, the model incorporates a flexible network formation process such that the emergent degree
distribution can follow a power law. The underlying dynamic network formation process gives
rise to the substantial inequality in outdegrees across firms that has been widely documented
in the literature (Bernard and Moxnes, 2018; Panigrahi, 2022; Bernard et al., 2022; Bacilieri
et al., 2023). Figure B1 plots the outdegree and indegree distribution of the formal firms in
our VAT data, revealing a very unequal degree distribution that resembles a power law. Our
framework is flexible and allows us to estimate the share of firm-to-firm links formed via directed
search (searching among the trade partners of existing suppliers) versus undirected search (often
referred to as random search in the networks literature (Jackson and Rogers, 2007)).


4.2     Model Setup

Consider an economy with a set of firms denoted by N . Each firm i ∈ N is of a given type θi
∈ Θ where Θ is the set of all possible types. In our application, we specify firm types as unique
sector-county-size combinations, i.e. all firms in the same sector, county, and size group are
classified as the same type. Our aim is to predict a matrix π that captures the number of links
that exist between every possible pair of firm types.

The network formation process is as follows. In every period t, a new buyer firm of type θ enters
with probability p(θ). Hence, the number of firms in the network in any given time period is
equal to the number of time periods that have passed since t = 0. In order for its operations to
be viable, the new firm needs to source inputs from a fixed number of suppliers m. To do this,
it first chooses a sector-county-size pair (i.e. a type) with probability p(θ, θ′ ) for all θ′ in Θ. The
probabilities p(θ, θ′ ) represent the firm’s bias in terms of sectors, regions, and firm size types it
wants to link with. In other words, the probability that a buyer of type θ finds a supplier of
type θ′ may not necessarily be equal to the probability of θ′ in the firm population. These biases
can reflect production technologies or homophilous preferences arising out of search costs and
information frictions. Firms in a location θ might find it easier to link to firms in location θ′
that is close to them as opposed to firms in location θ′′ that is far. Likewise, firms in sectors that
  21
     Economies of scale in trade cost at the firm level give rise to supply chain structures with several intermediaries
(Grant and Startz, 2022). Hence, we expect the linking patterns of small firms to diverge from large firms. Relevant
for our case, economies of scale can result in firms of different sizes, but operating within the same geography and
sector exhibiting different sourcing patterns.


                                                          19
supply services like electricity or telecommunication, which almost every firm requires as inputs,
might find themselves with linking probabilities p(θ, θ′ ) that exceed their entry probability p(θ).

Having chosen the sector-region-size type it wants to link with, the firm now relies on two
different search technologies to form its m links: first, undirected search (a.k.a. random search).
Here, the new firm ‘randomly’ links to other firms of the chosen type. It forms a fraction r
of its total m links in this manner. Second, preferential attachment. The new firm forms the
remaining fraction 1 − r of its m links to suppliers by searching among the existing suppliers it
acquired via undirected search. In other words, once the buyer firm forms links to the first set
of suppliers, it then ‘randomly’ links with the suppliers of its suppliers. The second step of this
process is preferential in that suppliers that are more connected are more likely to be chosen.
This process continues for several time periods and the network evolves accordingly.22

Ultimately, we are interested in the number of links between each sector-county-size type and
their outdegrees. To this end, consider a matrix B where each row and column represents a type
                                                          ′
                                                (θ,θ )
θ ∈ Θ. Its θθ′ ’th entry is then equal to p(θ) pp                 e et al. (2012) rely on B to derive
                                                  (θ′ ) . Bramoull´
the matrix π whose ij ’th entry shows the number of directed links at time t between buyers of
type i and suppliers of type j which are born in t0 :

                                                    r
                                          t
                                         πt  =m        (f (t, B) − I)                                         (1)
                                           0
                                                   1−r

Here, t refers to the time period, I is the identity matrix, and f is a scaled geometric series of
the matrix B defined as follows:
                                                  µ=∞
                                                        ((1 − r) log(t)B)µ
                                     f (t, B) =
                                                  µ=0
                                                                µ!

Newly entered buyers form m inlinks in every period. As a result the outdegree of existing
firms, i.e. the suppliers of the newly entered firms, evolves over time. Thus, the matrix πt
                                                                                           t gives
                                                                                             0

the expected outdegree (i.e. number of buyers) of each column node born in time t0 to a row,
computed at time t. The purpose of the dynamic network formation process is to rationalise the
heterogeneity in outdegree. Our focus on predicting outdegrees while keeping indegrees fixed is
further discussed in Appendix B.3. We motivate this assumption using stylized facts observed
in the data. While this assumption helps us characterize the steady state of the model, we also
discuss how relaxing it would likely reinforce our results.
  22
    The model takes the distribution of firm types as given, abstracting from firms’ endogenous entry decisions
across sectors, regions, and size categories (and subsequently, formality status). Instead, we capture these entry
patterns through exogenous probabilities p(θ) for each firm type, which correspond to the observed spatial,
sectoral, and size distributions in our data. While one could extend the model to microfound these entry choices
— for instance, to explain the concentration of economic activity in Nairobi —we prioritise matching the observed
proportions of different firm types rather than explaining their underlying determinants.




                                                        20
4.3     Estimation Strategy

Given the granular data on the empirical formal sector firm-to-firm network, we are able to obtain
the majority of the model parameters directly from the data (see Table 3 for an overview). These
include all entry probabilities p(θ) ∀ θ ∈ Θ that a firm enters in a given sector, county, and size
group as well as all interaction probabilities p(θ, θ′ ) between all possible sector-county-size types.
We use the cross-section from 2019, the last pre-COVID year of our panel, to obtain the p(θ)s
and p(θ, θ′ )s.23

The parameter we need to estimate is r, the fraction of input links a firm obtains via undirected
search independent of the network environment.

                                        Table 3: Model parameters

 Parameters      Description                             Source      Proxy                               Value
 r               Share of suppliers via random search    Estimated   -                                    0.45
 p(θ)            Entry probability of type θ             Data        Share of firms observed as θ       (0,0.12]
 p(θ, θ′ )       Linking probability of θ and θ′         Data        Share of links between θ and θ′      (0,1]
 m               Indegree                                Data        Avg. number of suppliers              30
 t               Number of entry periods                 Data        Number of firms in admin data       56822


First, we classify firms into types defined by unique sector-location-size combinations. Sectors
refer to 13 aggregate sectors, namely, agriculture, mining, manufacturing, utilities, construc-
tion, transportation and logistics, wholesale, hospitality, retail, business services, non-market
services, other services, and miscellaneous (incl. international organisations and non-classified
firms). Locations are given by the county in which the firm is located. Within each sector and
county we further group firms into large and small firms. We define small firms as firms in the
bottom sales quartile within a sector-county group. By restricting ourselves to two size bins
only, we avoid having too few observations in each firm-type bin and the matrix of linking prob-
abilities becoming too sparse. For example, all firms in the top three sales quartiles of Nairobi’s
manufacturing sector are classified as the same type.

Next, we compute the probability that a type exists for all types in Θ. We do so by dividing the
number of formal firms of a sector-county-size type by the total number of formal firms in the
economy. The interaction probabilities p(θ, θ′ ) then represent the fraction of a sector-county-size
type θ’s supplier relationships that it forms with type θ′ . We compute the above probabilities
for all possible combinations of types and use them to construct the matrix B. Moreover, we
follow Jackson and Rogers (2007) and define m as the average indegree (i.e. average number of
suppliers) in the network. The variable t, by definition, is equal to the number of firms observed
in the data equal to 56,822.
  23
    We exclude a small proportion of firms with zero suppliers in the data, as the model requires all entrants to
form m buying links.



                                                        21
Using the parameters from the empirical data, we are able to predict the matrix of type-to-type
network links π (r) for different choices of r ∈ [0, 1]. Appendix B.2 discusses the practical steps
needed to construct the matrix during the estimation.

In addition to the predicted version of the matrix π , we also observe the actual π in the data
where the ij ’th entry of π is just the number of links between types i and j . We match the model
predicted matrix and the matrix in the data using the method of moments procedure to obtain
r∗ . Each moment is weighted by the probability with which we observe a specific sector-region-
size type in the data. In doing so, we assign greater weight to more common sector-region-size
types whose probabilities tend to be more stable over time. r∗ is defined as follows:

                   r∗ = arg min        p(θ)        (πmodel (θ, θ′ ; r) − πactual (θ, θ′ ))2    (2)
                                   θ          θ′


r∗ is obtained by minimising the distance between the model predicted matrix of type-by-type
interactions and the corresponding matrix obtained from the data (method of moments). We
estimate r using simulated annealing. With only one parameter to estimate, we can plot the
objective function for various values of r to ensure that our estimated value is indeed the global
minimum (see Figure B4).


4.4    Estimation Results

Our estimation strategy yields a result of r∗ = 0.45. It suggests that a newly entered firm
chooses 45% of its m suppliers at random, and the remaining 55% among the suppliers of its
existing suppliers. A network with 55% of all links being formed via directed search suggests
a prominent role for information frictions as firms rely on their suppliers to form new links. It
aligns with previous research documenting the importance of relational contracts in Kenya and
neighbouring economies (Fafchamps, 2003). In a variant of this model, Chaney (2014) estimates
r = 0.6 for French exporters forming links with trade partners abroad, which also suggests a
substantial, but not quite as prominent role of information asymmetries.


4.5    Model Fit

To assess how well our model does in fitting the targeted outdegree distribution (i.e. distribution
of the number of buyers), we plot the degree distribution (i.e. total number of outlinks) of each
sector-county-size type as observed in the data and as predicted by the model. Figure B5 shows
that the key properties of the outdegree distribution are replicated by the model’s predictions.
The model and data match particularly well in the right tail of the distribution i.e. the part
that is specifically targeted by allowing for directed search in the network formation process.



                                                      22
We also estimate the Pareto exponent α, which was not explicitly targeted by the model, for
both degree distributions. We obtain an α of 0.36 from the model and 0.37 from the data. In
addition, since we have previously shown that the formal firm network is spatially concentrated,
we also assess how well the model predicts the share of buyer relationships in the economy that
originate from firms in Nairobi, Mombasa, to firms in other counties. We find that the model
performs well on this dimension too (Table 5). In the administrative data, 69% (11%) of all
outlinks in the economy are captured by Nairobi (Mombasa) based firms and the model predicts
a share of 70% (11%).


5     Predicting a Revised Network

With the estimated model at hand, we are now able to tackle the question of how including
informal firms might affect the spatial patterns of domestic trade. Our proposed thought exper-
iment is the following: suppose we were to observe informal firms. What would happen to the
outdegree distribution of various types θ? To obtain a ‘revised network’ that accounts for in-
formal firms, we rely on updated information on the spatial and sectoral dispersion of economic
activity in Kenya – now including the informal sector. In model terms, our exercise shifts the
probabilities p(θ) with which we observe nodes of certain sector-region-size types θ to be born.
Knowing r∗ and our updated p(θ)s, we can then once again predict the type-by-type matrix of
firm-to-firm links π , keeping everything else constant. We will also make additional assumptions
about the linking probabilities of informal firms.


5.1    Predicting the Sector-County Profile of Non-VAT Firms

To incorporate informal firms into the network, we first update the firm-type probabilities p(θ)
for each sector-county-size cell, this time accounting for the entire firm size distribution. To
update p(θ), we ideally would want to observe the number of firms Ncs in each sector s, county
c, and size cell – irrespective of their formality status. However, none of the KNBS records
available to us feature a breakdown of the firm count along both the sector s and the county c
dimension, let alone size dimension. Therefore, instead of the firm count, we rely on granular
sector and region level information on formal and informal employment in the 2019 census labor
force module to compute our alternative entry probabilities p(θa ):

                                                               eformal          1
                         p(θa, large   f ormal )   =            sc
                                                                            × formal               (3)
                                                                s osc + esc
                                                         47     13           xsc
                                                         c




                                                      oformal + oinformal + einformal      1
              p(θa, inf ormal/small   f ormal )   =    sc        sc          sc
                                                                                      × informal   (4)
                                                                   s osc + esc
                                                              47   13                  xsc
                                                              c




                                                          23
where osc is the number of self-employed people, and esc the number of (wage) employed people.

The denominator sums total private sector employment (both wage and self-employed, formal
and informal) across all 13 sectors and 47 counties. The updated sector-region-size probabilities
p(θa ) will again sum to one and hence capture a relative change in the number of firms rather
than an absolute change.

Using simple employment shares to compute p(θa ) relies on the assumption that the mapping
of employees to firms is the same across all sectors and regions. However, empirically, manufac-
turing firms, for example, tend to be larger than businesses in the hospitality sector. Nairobi
hosts larger firms than Mandera County in Kenya’s north. We therefore re-scale the number
of employees by the average firm size in each sector-county-size cell xf
                                                                       sc
                                                                         ormal,inf ormal . For small

formal and informal firms, we rely on the KNBS (2016) survey to compute the average number
of employees, while we use the administrative data for large formal sector firms.24

For agriculture and non-market services, we estimate their p(θa ) drawing only on formal private
sector employment. Formal VAT-paying firms occupy a very specific niche in both cases (e.g.
formal firms in these sectors are disproportionately export-oriented or the sector is dominated by
non-profit entities, see discussion in Section 2.2) and informal employment takes vastly different
forms (e.g. mainly reflects subsistence farming).

How does the probability p(θ) that a formal firm enters in a given sector-county-size cell shift
to p(θa ) once we account for informal firms? Figure C1 suggest that a 10 percentage point in-
crease in formality corresponds with a 0.5 percentage point increase in p(θ)-p(θa ) (0.35 standard
deviations). As expected, p(θa ) is lower than p(θ) for sectors and counties with a high degree of
formality, indicating that their importance for the overall economy is overstated in the admin-
istrative data. To recap, our proposed revised network accounts for informal firms being born
into the network based on their sector-region profile. Rather than thinking of the exercise as
adding new firms, we adjust the weights of each sector-region-size type.


5.2     Assumptions about Linking Probabilities of Informal Firms

Another challenge for integrating informal firms into the network of formal firms arises due to
the lack of granular data on the sectoral and geographic composition of how informal firms link
  24
     If big formal firms employ informal casual workers not captured in the administrative data, we understate
their size and hence overestimate the probability of big formal firms in the network. As a result, our revised
network becomes biased towards the baseline network that only covers the formal sector. This is also illustrated
in Figure 8 where we compare the spatial inequality in county-level outdegrees in the baseline network to several
scenarios that account for informal firms. Comparing the two scenarios where in one case we use the simple
employment shares to compute p(θa ) and in the other case further adjust for differences in firm size across sectors
and counties, we find that in line with the intuition on the implications of informal workers in formal firms, the
former scenario is closer to the original network with only formal firms.



                                                        24
with both each other and with the formal sector. An ideal data set would provide details on
sourcing and selling patterns by sector, geography, and formality status. In the absence of such
data and given the strong correlation of size and formality status, our default approach will be
to assume that informal firms exhibit linking preferences p(θ, θ′ ) similar to those of small formal
firms, conditional on sector and geography.

This approach is particularly attractive given its straightforward implications for the sectoral
and geographic composition of linking patterns. It further allows for informal and small formal
firms to experience different wedges in link formation, provided these wedges generate aggregate
linking patterns that are still comparable to those of small formal firms at the sector-county-
size level. Consider, for instance, a small formal retailer and an informal retailer both seeking
to purchase soap. While neither can source directly from manufacturers in Nairobi, the small
formal retailer might purchase from a large formal wholesaler locally, whereas the informal
retailer—potentially excluded due to their tax status—might source from an alternative local
wholesaler. Despite sourcing from different individual suppliers, both retailers exhibit the same
patterns once their sourcing is aggregated at the sectoral and geographic levels. The similarity
assumption is also motivated by findings in the administrative data, where we document that
small formal firms tend to source more locally and from intermediaries (see Table B1).

Nevertheless, this assumption may not fully capture the implications of challenges specific to
informal firms for their linking patterns. Therefore, we subsequently introduce modifications to
our approach, drawing on stylized facts from the existing literature, to assess the sensitivity of
our results to alternative assumptions about the linking patterns of informal firms.


5.3    Characteristics of the Revised Network

How does the revised network that accounts for informal firms compare to the baseline network?
First, we find that firm types in sectors and counties with a high incidence of informality are
predicted to have a relatively larger increase in outdegrees (Figure C2). Second, the share of total
outlinks attributed to Nairobi- and Mombasa-based firms declines from 80% to 75% (see Table
5). While the two cities maintain their prominent role in the network, this shift is meaningful
for the remaining counties in relative terms as it represents a 25% increase in their outdegrees.
Third, accounting for informal firms reduces the variation in outdegrees across counties by 7.5%
(Table 4). We visualize this reduction in inequality by plotting the Lorenz curve for county-level
outlinks in Figure 8.

What drives the reduction of outdegree inequality? First, Nairobi and Mombasa become less
important as a destination for products and services from other counties. In Figure 9 we plot the
row-normalised adjacency matrices, before and after accounting for informal firms, at the county


                                                25
                    Table 4: County-level changes in the dispersion of outdegrees

                         County outdegree                      ∆ sd/mean (in %)
                         All counties                                -7.5
                         Without Nairobi & Mombasa                  -18.0
The above table reports the difference in outdegrees between the original and the revised network - aggregated
at the county level. We look at the coefficient of variation as the key metric. Adjusting for the mean accounts
for the fact that the change in the number of outlinks predicted by the model needs to be interpreted in relative
rather than absolute terms. We exclude the outdegrees of Nairobi and Mombasa when we compute the coefficient
of variation in the second row.


Figure 8: Inequality in county-level outlinks in the baseline and revised network




To visualise the change in inequality between the baseline and the revised network, we plot the Lorenz curve for
the number of outlinks at the county level. The default scenario uses the entry probabilities p(θ) specified in
Equations 3 and 4 and assumes similar linking patterns for informal firms and small formal firms conditional on
their sector and county of operation.


and sector level, respectively. The matrix is normalised so that each row sums to one. After
accounting for informal firms, a smaller proportion of a county’s total outlinks now connects
with firms in Nairobi (i.e., the column with the lightest color in the baseline matrix).

Second, downstream relationships with firms in the same county now become relatively more
prominent. The values of the diagonal entries of the adjacency matrix increase between baseline
and revised network. This is consistent with the earlier stylized fact that smaller firms are
more likely to source locally (Table B1). With the exception of five counties, most notably
Nairobi and Mombasa, trade within the county gains in importance for the other 42 counties.
This pattern is illustrated even more clearly in Figure C4, which compares changes in both
inter- and intra-county links between the baseline and revised networks. After accounting for


                                                       26
informal firms, both inter- and intra-county outdegree increase for the median county as well as
on average. However, for 83% of counties, the increase in intra-county outdegrees exceeds the
increase in inter-county links.25 If informal firms source an even higher share of their inputs
locally, the predicted shift towards intra-county trade represents a lower bound. We will discuss
this in Section 6.1.
             Table 5: The importance of Kenya’s primary cities in a revised network

                                                            Mombasa Other counties
                                                      Nairobi
                                                                 in %
            Population                               9         3          88
            GDP                                      25        5          70
            GDP w/o agriculture                      33        7          60
                                           Number of outlinks
            Model fit

            Administrative data                             69         11               20
            Model predicted network                         70         11               19

            Revised network: informal firms ≈ small firms

            Default scenario                                65         10               25
            p(θ) using employ. share only                   66         10               23

            Revised network: alternative linking patterns for informal firms

            0% sales, 50% input to/from formal              59          9               32
            0% sales, 75% input to/from formal              63         10               28
            25% sales, 50% input to/from formal             59          9               32
The above table documents the share of overall firm-to-firm links which have a supplier based in Nairobi, Mombasa
or the remaining 45 counties. We present the spatial dispersion of outlinks in the data, the predicted (formal
sector) network as well as our default scenario for the revised network and four alternative scenarios to assess
sensitivity.



If counties are selling less to Nairobi and Mombasa, where do their inter-county trade links shift?
We find that the number of bilateral trade links is now more sensitive to social ties between
counties. In Table C2, we regress the number of links between two counties on both travel
distance and social connectedness (Bailey et al., 2021), and compare the results for the baseline
and the revised network.26 The findings indicate that, in the revised network, inter-county trade
links are more strongly associated with the strength of social ties between counties.

The increased importance of within-county trade and trade between socially connected counties
gives rise to a network with a more pronounced group structure. We quantify the extent to
which the network is partitioned by measuring the network’s modularity (Newman, 2006). The
modularity of a network is higher when groups of nodes have more links among each other
  25
     While inter-county trade rises for the median county, 18 out of 47 counties actually have fewer links with
other counties in the revised network.
  26
     Social connectedness is measured using friendship network data provided by a popular social media platform,
see Bailey et al. (2021).


                                                       27
than what we would expect in a random network. We compute the modularity of the weighted
adjacency matrix at the sector-county level. We find that modularity in the revised network
with informal firms increases by about 46% suggesting that the revised network exhibits a
more pronounced group structure.27 To further characterize this group structure, we apply a
community detection algorithm to the trade flows between counties as per the revised network.
As illustrated by Figure C3, the group structure now correlates strongly with Kenya’s geography,
i.e., geographically proximate counties are now more likely to be clustered in the same group.

Next, we explore for which group of counties we underestimate the overall number of links the
most. To do so, we regress the change in county-level outlinks between baseline and revised
network on various county characteristics, including the aggregate level of formality, population,
Gross County Product, and market access (Table C3). The results reveal that the baseline
network, which relies exclusively on formal sector data, most significantly underestimates con-
nectivity in smaller counties and those with high market access. Notably, after controlling for
these other characteristics, the relationship between a county’s aggregate formality share and
the change in outlinks is no longer statistically significant.

Finally, we find that incorporating informal firms also reshapes inter-sector trade patterns (Fig-
ure 9). Sectors with substantial informal activity like other services, retail, and wholesale, now
gain prominence as buyers in the network. Manufacturing, wholesale, and mining experience
the largest relative gains in new links.


5.4    Simulating the Effect of Economic Shocks

As a next step, we ask how the newly predicted network that accounts for informal firms com-
pares to the previous network in terms of its role in propagating domestic and international
shocks. How does the predicted impact of the shock depend on whether we account for inform-
ality? Are sectors and regions with more informality more or less vulnerable to shocks than
the administrative data would suggest? To answer these questions, we first simulate a series
of domestic output shocks that reduce each firm type’s output and then analyse how it affects
the output of all other types, both directly and indirectly, by propagating through the network
over multiple time periods. Then, we simulate international supply shocks that affect firm types
depending on their exposure to international markets. We discuss the results for both domestic
and international shocks below.
  27
    We compute this by running the Leiden algorithm multiple times and averaging the resulting modularity
scores. See Traag et al. (2019) for details on the algorithm.




                                                   28
                                   Figure 9: Baseline versus revised network

                                    County-by-county trading relationships
                      Baseline network                                           Revised network




                                     Sector-by-sector trading relationships
                      Baseline network                                           Revised network




The above figures show heatmaps of the predicted row-normalised adjacency matrix of the network (where row
sells to column) as per the baseline p(θ) on the left and augmented p(θa ) on the right at the county level (top)
and sector level (bottom).


5.4.1        Domestic Shocks

Following the supply-side version of classic input-output models (Sargent and Stachurski, 2022),
we define firm type j ’s output yjt in period t as the sum of inputs it purchases from other types
i plus payments to other factors of production (value added) υjt :28


                                               yjt =      gij yit−1 + υjt                                    (5)



The intermediate inputs purchased from other firm types are the product of each supplier i’s
total output in the previous period yit−1 and the fraction it sells to type j , gij . The gij s represent
  28
       Alternatively, υjt can also be interpreted as a type-specific and period-specific shock to output.


                                                           29
the normalized cells of our type-by-type matrix π that captures the total number of links between
all types. We normalise the rows of π by dividing each entry in a row by the sum of that row. We
abstract from any endogenous network adjustments (see e.g. Panigrahi, 2022; Eaton et al., 2022;
Arkolakis et al., 2023). We assume that υjt is an independent draw from a uniform distribution
U [−10, 10] for every type j in every time period t. Each type starts with a randomly chosen
output drawn from the distribution U [0, 100] in t = 0.

Using this set-up, we first simulate the output process without any shock. Then, we simulate
the output process following a negative output shock to sector-region-size type j ’s value added
υjt in the first time period.29 We repeat this exercise for all types j .

To study the relevance of unobserved informal firms, we conduct our simulation exercise twice.
In the first scenario, we use the matrix π derived from administrative records. In the second
scenario, we use an alternative version of π using our revised network that accounts for the
presence of informal firms.30 Our primary question is: how do domestic shocks impact each
type when informal firms are considered versus when they are not? For each simulated shock,
we compute: (i) the absolute reduction in output of each type using the original adjacency
matrix (excluding informal firms) and (ii) the absolute reduction in output of each type using
the new adjacency matrix (including informal firms), both averaged across all time periods.
This yields a matrix of shock impacts where each row corresponds to a supplier who is shocked
and each column corresponds to a buyer who faces the impact of the shock. We then aggregate
across rows to compute the average impact of the shocks on buyers.

We find that the higher the incidence of informality in a sector and region, the more we under-
estimate the adverse impact of a domestic output shock. Figure 10 shows that our established
employment-based measure of informality negatively correlates with the impact of the shock
under the two scenarios. To align with our most granular measure of informality, we aggregate
the response to shocks at the sector-region level.31 As shown in Table C4, a one percentage
point decrease in the formal sector share corresponds with a 4.9 percentage point larger output
drop following a domestic shock.

Figure C5 presents the distribution of the ratio of the shock impact at the sector-county level
(revised network/baseline network). This ratio exceeds one for 48% of the sector-county pairs
and 42% of the sector-county-size types, indicating the baseline network on average underestim-
ates the impact of domestic shocks for these types. Among the types for which we underestimate
  29
     We compute the impact of the shock on each type’s output over 100 periods of time by comparing the two
output processes. All of the outputs reported below are averages across the 100 time periods.
  30
     We ensure that the random component of output, υjt , is identical across these two scenarios for each type j
in every time period t to ensure it does not affect our results.
  31
     Put differently, we average the impact across each buyer’s suppliers and then compute a weighted sum for
each sector-region cell. The weights are determined by a type’s entry probabilities.


                                                       30
 Figure 10: How do output shocks pass-through in a revised network that takes into account
            informal firms? - % change in output drops and the level of formality




The above graph plots the percentage change of the output reduction in response to domestic output shocks for
two scenarios: the baseline network using only administrative data and the revised network including informal
firms. We aggregate the output reduction across buyer types at the sector and county level, weighted by the entry
probability p(θ) for each size and formality type. The x-axis shows the formal sector share for each sector-region
pair.


the impact, 73% are types with small firms, an indicator that the omission of informal firms is
the primary driver of this result.

It is important to note that by considering the changes in output reduction between the two
scenarios, we focus on relative shifts. These shifts are more notable for sector-county pairs that
face a relatively smaller impact initially due to their peripheral role in the formal sector network.
However, even after accounting for informal firms, the aggregate impact of the shock remains
largest for Nairobi- and Mombasa-based sectors. This mirrors the finding that the two cities
still account for a sizable share of outlinks, as discussed in the previous section.


5.4.2     Import Shocks

In addition to a domestic shock, we consider the impact of a reduction in output in response to
an adverse shock to international suppliers whom Kenyan firms source from. As before, firm j ’s
output can be written as follows:

                                      yjt =      gij yit−1 + mjw yw + υit                                     (6)


Firm j ’s output now additionally depends on world output yw in line with its import share
miw , which we obtain from the administrative data. We re-normalise the rows of the adjacency
matrix such that        j   gij + mjw = 1. Next, we simulate a series of negative shocks to yw and


                                                       31
analyse how it affects total output in the economy and the heterogeneous effects on various firm
types.

The bottom graph of Figure C5 plots the impact of the shock relying on the revised network as a
proportion of the impact based on the baseline network. Unlike domestic shocks, our findings for
import shocks indicate that extrapolating from data on the formal sector network to the overall
economy leads to an overestimation of the reduction in output. The impact is consistently less
negative when using the revised network.

This effect is particularly pronounced in sectors and regions with a higher incidence of informal
activity (see Figure 11). Specifically, Table C4 shows that a 10 percentage point increase in
the informal sector share corresponds to a 1 percentage point overestimation of the reduction in
output. This pattern emerges because formal firms in predominantly informal markets have more
unobserved connections than captured in administrative data, reducing their effective exposure
to import shocks.

Why do the predictions differ for domestic and import shocks? When accounting for inform-
ality, sectors and counties with a high share of informal activity become more prominent in
the network, making them more susceptible to economic shocks. Conversely, this adjustment
reduces the relative importance of formal-dominated sectors, which typically have higher import
shares and international exposure. By adjusting their prominence (i.e., modifying their entry
probabilities and considering informality), we find that the economy seems more resilient to
trade shocks but more vulnerable to domestic shocks. The intuition behind our result aligns
with the mechanism discussed in Di Giovanni and Levchenko (2012), who show that smaller
economies tend to experience more volatility due to having fewer firms and less diversification.
Applied to our setting, focusing only on formal sector firms leads to overstating the importance
of internationally-linked formal firms and underestimating the diversification of the regional
economy.


6    How Sensitive are Results to Alternative Linking Patterns?

What happens if we relax the assumption that informal firms have similar linking patterns to
small formal firms? Given our lack of disaggregated data on how informal firms link with the
formal sector, we stress-test our results relying on alternative assumptions motivated by evidence
provided in the literature and consistent with the stylized facts presented in Section 3. First,
informal firms almost exclusively sell to final consumers. Second, informal firms purchase from
                                                                         ohme and Thiele,
formal firms, but not as much as formal firms purchase from each other (B¨
2014; Gadenne et al., 2022). Finally, informal firms predominately source locally (Amodio et al.,
2024). These assumptions allow us to construct extreme bounds for linking patterns that we

                                               32
Figure 11: How does a shock to import markets pass-through in a revised network that takes
           into account informal firms? - Output drops and the level of formality




The above graph plots the percentage change of the output reduction in response to international trade shocks
for two scenarios: the baseline network using only administrative data and the revised network including informal
firms. We aggregate the output reduction across buyer types at the sector and county level, weighted by the entry
probability p(θ) for each size and formality type. The x-axis shows the formal sector share for each sector-region
pair.


might observe if we were to have granular network data with informal firms.


6.1     Alternative Assumptions about Linking Patterns of Informal Firms

Consider now the three groups of firms; small formal firms si ∈ Fs , large formal firms li ∈ Fl ,
and informal firms ni ∈ N with their types (i.e. sector and country composition) denoted by
θli , θsi , and θni respectively. We will drop the index i in what follows as the conditions do not
vary across firms, once we take their sector, county, and size (informal, small, or formal) into
account.

We start by considering the sales patterns of informal firms. We consider two alternative linking
probabilities that capture the differential sales patterns of informal firms. In one scenario, we
set the probability that an informal firm sells to formal firms to zero. In a second scenario,
we assume that informal firms form one-fourth of their outlinks with formal firms, where the
sector and geographic composition of these links follows those of small formal firms. These
assumptions are motivated by stylized facts documented earlier which show that micro, small,
and medium enterprises with sales below the VAT threshold rarely sell to large firms in Kenya.
                                       ohme and Thiele (2014) for informal firms in six urban
This is also in line with findings in B¨
centers in West Africa, and firms that participate in simplified tax scheme and are similar to
informal firms in West Bengal, India in Gadenne et al. (2022). Finally, it is consistent with
the set up of VAT systems requiring firms to ask for a receipt from their supplier in order to

                                                       33
claim input VAT (Pomeranz, 2015). In summary, for all formal and informal types we make the
following extreme assumptions regarding the sales patterns of informal firms:

                    p(θn , θl ) = 0             sensitivity: p(θn , θl ) = 0.25 × p(θs , θl )              (7)


                   p(θn , θs ) = 0              sensitivity: p(θn , θs ) = 0.25 × p(θs , θs )


We now turn to informal firms and their sourcing patterns. In this case, we assume that informal
firms source only 50% of their inputs from formal firms (large or small) in one scenario and 75%
in an additional sensitivity check. Conditional on sourcing from the formal sector, the sectoral
and geographic preferences of informal firms will again follow those of small formal firms. As
documented in our empirical section, MSMEs do source some of their inputs from large firms.
 ohme and Thiele (2014) find that informal firms buy about half as much from formal firms than
B¨
the amount formal firms source from each other. Gadenne et al. (2022) find that the smallest
group of firms under the simplified tax scheme in West Bengal source a similar share from VAT
firms, while the largest non-VAT firms source as much as three-quarters of their inputs from
VAT firms. We rely on these point estimates to make the following assumption:

            p(θl , θn ) = 0.5 × p(θl , θs )                sensitivity: p(θl , θn ) = 0.75 × p(θl , θs )   (8)


            p(θs , θn ) = 0.5 × p(θs , θs )                sensitivity: p(θs , θn ) = 0.75 × p(θs , θs )


Informal firms will then source the remainder of their inputs from other informal firms. This
requires assumptions about their preferences on which sectors and counties to source from.
Motivated by evidence from the literature where Amodio et al. (2024) find that the vast majority
of small, largely informal firms in Ethiopia obtain their inputs from local sources, we allow inputs
obtained from other informal firms to only be sourced locally.

Consider any counties a and b in the set of counties C and let θf,a be the type of small or large
firm f in Fs ∪ Fl in county a. The assumption implies the following:

                p(θn,a , θn,b ) = 1 −                 p(θf , θn ) × p(θn,b )            if   a=b           (9)
                                          f ∈Fs ∪Fl


                                     p(θn,a , θn,b ) = 0              if   a ̸= b


Finally, we now compute separate p(θ)s for informal and small formal firms, splitting the main
term of Equation 4. For additional details see Section C.2.




                                                           34
6.2     Implications for the Revised Network

Figure 8 and Table C1 summarize how alternative assumptions about informal firms affect the
distribution of outlinks across counties. If we assume informal firms source half of their inputs
from the formal sector but have no formal buyers, inequality in county-level outlinks declines by
16% compared to our model-predicted network. This represents an 8.5 percentage point larger
reduction than our default scenario where small formal and informal firms share similar linking
patterns. The share of total outlinks accounted for by Nairobi- and Mombasa-based firms now
drops down to 69%, down from 80% in the baseline network and 75% in the first scenario with
informal firms (see Table 5).

Allowing informal firms to sell 25% of their output to formal firms, instead of none, has little
implication for both the overall decline in inequality and Nairobi’s share of outlinks. Allowing
for a larger share of informal firms’ inputs to be sourced from the formal sector results in a
smaller reduction in inequality (11%) and a higher share of Nairobi- and Mombasa-based links
(73%), more closely matching the initial network with informal firms. This pattern is driven by
the greater reliance of informal firms on formal suppliers in this scenario, which in turn tend to
locate in counties with larger formal economies.

Our simulations of domestic and international trade shocks under these alternative assumptions
reinforce our earlier findings, showing even stronger relative effects in sectors and counties with
high informality (Figure C6). This more pronounced impact stems from increased intra-county
linkages among informal firms. While more prominent local linkages reduce spatial inequality
and urban concentration, they amplify the impact of domestic shocks through stronger within-
county multiplier effects.


7     Conclusion

Firm-to-firm transaction-level data sourced from tax records have become a valuable tool for
mapping domestic firm networks and trade flows. Aside from studying the micro dynamics
of firm-to-firm relationships, they allow researchers to compute often hard-to-observe flows of
goods and services across regions within national economies, which can be leveraged to estimate
the welfare implications of policy interventions. This paper shows that applying such data in
settings with high informality can pose important challenges. We find evidence that formal
sector trade patterns provide a skewed representation of overall economic activity: they are
more spatially concentrated and underestimate intra-regional trade in favor of trade with urban
hubs.

We incorporate informal firms into a structural model of network formation leading to a revised


                                                35
network that is more locally connected and spatially less concentrated. We further provide
evidence for the implications of extrapolating from formal sector data about the aggregate
impact of economic shocks, i.e. ignoring the presence of the informal sector. Simulations of the
pass-through of output shocks using the revised network reveal that formal sector data leads
us to underestimate the impact of domestic output shocks in regions and sectors with high
informality. Conversely, we may overstate the local output effects of international trade shocks
in sectors and regions with a high incidence of informality. This is because formal sector data
places more weight on formal firms with stronger links to international markets, when in fact
the overall economy has weaker ties to import markets.

Our findings are applicable to a variety of contexts with high informality across the world.
Informality matters given its systematic occurrence as it is not randomly distributed across
sectors and geographies. Our proposed approach for incorporating informal firms is useful for
settings where data on the links of informal firms are not available with sufficient information
on heterogeneity across key dimensions (e.g. sector and geography) and researchers therefore
have to rely on secondary sources to recover the structure of the broader network. Our findings
highlight the value of complementing administrative firm-to-firm data with auxiliary data on
informality to draw inferences about the aggregate implications of economic policies.

An important question for future research, beyond the scope of this paper, is whether the
observed spatial concentration of formal sector firm networks is a result of market frictions or a
feature of structural transformation (Gollin, 2008). Understanding its drivers can inform policy
recommendations about the optimal distribution of formal economic activity across space.




                                               36
References
Acemoglu, D., Carvalho, V. M., Ozdaglar, A. and Tahbaz-Salehi, A. (2012), ‘The network origins
  of aggregate fluctuations’, Econometrica 80(5), 1977–2016.

Ad˜ao, R., Carrillo, P., Costinot, A., Donaldson, D. and Pomeranz, D. (2022), ‘Imports, exports,
  and earnings inequality: Measures of exposure and estimates of incidence’, The Quarterly
  Journal of Economics 137(3), 1553–1614.

Ades, A. F. and Glaeser, E. L. (1995), ‘Trade and circuses: explaining urban giants’, The
  Quarterly Journal of Economics 110(1), 195–227.

Albert, C., Bustos, P. and Ponticelli, J. (2021), The effects of climate change on labor and
  capital reallocation, Technical report, National Bureau of Economic Research.

          na, A., Manelici, I. and Vasquez, J. P. (2022), ‘The effects of joining multinational
Alfaro-Ure˜
  supply chains: New evidence from firm-to-firm linkages’, The Quarterly Journal of Economics
  137(3), 1495–1552.

Almunia, M., Henning, D. J., Knebelmann, J., Nakyambadde, D. and Tian, L. (2023), Leveraging
  Trading Networks to Improve Tax Compliance: Experimental Evidence from Uganda, Centre
  for Economic Policy Research.

Amodio, F., Benveniste, E., Pham, H. and Sanfilippo, M. (2024), ‘The local (informal) multiplier
 of industrial jobs’. Mimeo.

    as, P., Chor, D., Fally, T. and Hillberry, R. (2012), ‘Measuring the upstreamness of pro-
Antr`
 duction and trade flows’, American Economic Review 102(3), 412–16.

Arkolakis, C., Huneeus, F. and Miyauchi, Y. (2023), Spatial production networks, Technical
  report, National Bureau of Economic Research.

Atkin, D. and Donaldson, D. (2015), Who’s getting globalized? the size and implications of
  intra-national trade costs, Working Paper 21439, National Bureau of Economic Research.

Atkin, D. and Khandelwal, A. K. (2020), ‘How distortions alter the impacts of international
  trade in developing countries’, Annual Review of Economics 12(1), null.

Bacilieri, A., Borsos, A., Astudillo-Estevez, P. and Lafond, F. (2023), ‘Firm-level production
  networks: what do we (really) know’, INET Oxford Working Paper 2023.

Bailey, M., Gupta, A., Hillenbrand, S., Kuchler, T., Richmond, R. and Stroebel, J. (2021), ‘In-
  ternational trade and social connectedness’, Journal of International Economics 129, 103418.

Balboni, C., Boehm, J. and Waseem, M. (2023), ‘Firm adaptation in production networks:
  evidence from extreme weather events in pakistan mimeo’.

Baqaee, D. R. and Farhi, E. (2019), ‘The macroeconomic impact of microeconomic shocks:
  Beyond hulten’s theorem’, Econometrica 87(4), 1155–1203.

Bernard, A. B., Dhyne, E., Magerman, G., Manova, K. and Moxnes, A. (2022), ‘The ori-
  gins of firm heterogeneity: A production network approach’, Journal of Political Economy
  130(7), 1765–1804.

Bernard, A. B. and Moxnes, A. (2018), ‘Networks and trade’, Annual Review of Economics
  10, 65–85.



                                              37
Bernard, A. B., Moxnes, A. and Saito, Y. U. (2019), ‘Production networks, geography, and firm
  performance’, Journal of Political Economy 127(2), 639–688.

Blanchard, P., Gollin, D. and Kirchberger, M. (2021), ‘Perpetual motion: Human mobility and
  spatial frictions in three african countries’, CEPR Discussion Papers No. 16661.

 ohme, M. H. and Thiele, R. (2014), ‘Informal–formal linkages and informal enterprise perform-
B¨
  ance in urban west africa’, The European Journal of Development Research 26, 473–489.

Boken, J., Gadenne, L., Nandi, T. and Santamaria, M. (2023), ‘Community networks and trade’,
  CEPR Working Paper DP17787.

Bramoull´e, Y., Currarini, S., Jackson, M. O., Pin, P. and Rogers, B. W. (2012), ‘Homophily and
  long-run integration in social networks’, Journal of Economic Theory 147(5), 1754–1786.

Bustos, P., Garber, G. and Ponticelli, J. (2020), ‘Capital accumulation and structural trans-
  formation’, The Quarterly Journal of Economics 135(2), 1037–1094.

Carvalho, V. M., Nirei, M., Saito, Y. U. and Tahbaz-Salehi, A. (2021), ‘Supply chain disrup-
  tions: Evidence from the great east japan earthquake’, The Quarterly Journal of Economics
  136(2), 1255–1321.

Castro-Vincenzi, J., Khanna, G., Morales, N. and Pandalai-Nayar, N. (2024), Weathering the
  storm: Supply chains and climate risk, Technical report, National Bureau of Economic Re-
  search.

Chacha, P. W., Kirui, B. K. and Wiedemann, V. (2024), ‘Supply chains in times of crisis:
  Evidence from kenya’s production network’, World Development 173, 106363.

Chandrasekhar, A. (2016), ‘Econometrics of network formation’, The Oxford Handbook of the
  Economics of Networks pp. 303–357.

Chaney, T. (2014), ‘The network structure of international trade’, American Economic Review
  104(11), 3600–3634.

Cordaro, F., Fafchamps, M., Mayer, C., Meki, M., Quinn, S. and Roll, K. (2022), Microequity
  and mutuality: Experimental evidence on credit with performance-contingent repayment,
  Technical report, National Bureau of Economic Research.

De Paula, A. and Scheinkman, J. A. (2010), ‘Value-added taxes, chain effects, and informality’,
  American Economic Journal: Macroeconomics 2(4), 195–221.

Demir, B., Javorcik, B., Michalski, T. K. and Ors, E. (2022), ‘Financial Constraints and Propaga-
  tion of Shocks in Production Networks’, The Review of Economics and Statistics pp. 1–46.

Demir, B., Javorcik, B. and Panigrahi, P. (2024), ‘Breaking invisible barriers: Does fast internet
  improve access to input markets?’, CESifo Working Paper 11567.

Di Giovanni, J. and Levchenko, A. A. (2012), ‘Country size, international trade, and aggregate
  fluctuations in granular economies’, Journal of Political Economy 120(6), 1083–1132.

Dix-Carneiro, R., Goldberg, P. K., Meghir, C. and Ulyssea, G. (2024), Trade and domestic
  distortions: The case of informality, Technical report.

Donaldson, D. (2025), Transport infrastructure and policy evaluation, in ‘Handbook of Regional
 and Urban Economics, vol. 6’, Elsevier North Holland Amsterdam.



                                               38
Eaton, J., Kortum, S. and Kramarz, F. (2011), ‘An anatomy of international trade: Evidence
  from French firms’, Econometrica 79(5), 1453–1498.

Eaton, J., Kortum, S. S. and Kramarz, F. (2022), Firm-to-firm trade: Imports, exports, and the
  labor market, Technical report, National Bureau of Economic Research.

Elgin, C., Kose, M. A., Ohnsorge, F. and Yu, S. (2021), ‘Understanding informality’, CERP
  Discussion Paper 16497.

Fafchamps, M. (2003), Market institutions in sub-Saharan Africa: Theory and evidence, MIT
  press.

Fan, T., Peters, M. and Zilibotti, F. (2023), ‘Growing like india—the unequal effects of service-
  led growth’, Econometrica 91(4), 1457–1494.

Gabaix, X. (2009), ‘Power laws in economics and finance’, Annu. Rev. Econ. 1(1), 255–294.

Gadenne, L., Nandi, T. K. and Rathelot, R. (2022), ‘Taxation and supplier networks: Evidence
 from india’, Working Paper .

Goldberg, P. K. and Reed, T. (2023), ‘Demand-side constraints in development: The role of
 market size, trade, and (in)equality’, Econometrica .

Gollin, D. (2008), ‘Nobody’s business but my own: Self-employment and small enterprise in
 economic development’, Journal of Monetary Economics 55(2), 219–233.

Grant, M. and Startz, M. (2022), Cutting out the middleman: The structure of chains of
  intermediation, Technical report, National Bureau of Economic Research.

                          on-Ciliotta, G. and Teachout, M. (2020), ‘Vertical integration,
Hansman, C., Hjort, J., Le´
 supplier behavior, and quality upgrading among exporters’, Journal of Political Economy
 128(9), 3570–3625.

Herrendorf, B., Rogerson, R. and Valentinyi, A. (2022), New evidence on sectoral labor pro-
  ductivity: Implications for industrialization and development, Technical report, National
  Bureau of Economic Research.

Huremovic, K., Jim´                                   o, J.-L. and Vega-Redondo, F. (2024),
                     enez, G., Moral-Benito, E., Peydr´
 ‘Production and financial networks in interplay: Crisis evidence from supplier-customer and
 credit registers’, Available at SSRN 4657236 .

Jackson, M. O. and Rogers, B. W. (2007), ‘Meeting strangers and friends of friends: How random
  are social networks?’, American Economic Review 97(3), 890–915.

Jefferson, M. (1939), ‘The law of the primate city’, Geographical Review 29(2), 226–232.

Jefferson, M. (1989), ‘Why geography? The law of the primate city’, Geographical Review
  79(2), 226–232.

Klenow, P. J. and Rodriguez-Clare, A. (1997), ‘The neoclassical revival in growth economics:
  Has it gone too far?’, NBER macroeconomics annual 12, 73–103.

KNBS (2010), ‘Basic Report on the 2010 Census of Industrial Production’.

KNBS (2016), Micro, Small and Medium Establishment (MSME) Survey: Basic Report2016,
 Technical report, Kenya National Bureau of Statistics.

KNBS (2017), Report on the 2017 Kenya Census of Establishments (CoE), Technical report,
 Kenya National Bureau of Statistics.

                                               39
KNBS (2019), 2019 Kenya Population and Housing Census: Volume I, Technical report, Kenya
 National Bureau of Statistics.

KNBS (2022), Gross county product 2021 (gcp), Technical report, KNBS.
 URL: https://www.knbs.or.ke/reports/kenya-gross-county-product-2021/

Kreindler, G. E. and Miyauchi, Y. (2023), ‘Measuring commuting and economic activity inside
  cities with cell phone records’, Review of Economics and Statistics 105(4), 899–909.

Liu, E. (2019), ‘Industrial policies in production networks’, The Quarterly Journal of Economics
  134(4), 1883–1948.

 u, L. and Zhou, T. (2011), ‘Link prediction in complex networks: A survey’, Physica A:
L¨
  statistical mechanics and its applications 390(6), 1150–1170.

Meagher, K. (2013), ‘Unlocking the informal economy: A literature review on linkages between
 formal and informal economies in developing countries’, Work. ePap 27, 1755–1315.

Memon, P. A. (1976), ‘Urban primacy in kenya’, IDS Working Paper Series, University of
 Nairobi 282.

Miyauchi, Y. (2024), ‘Matching and agglomeration: Theory and evidence from japanese firm-
 to-firm trade’, Econometrica 92(6), 1869–1905.

                                    evez, P. and Farmer, J. D. (2023), ‘Reconstructing pro-
Mungo, L., Lafond, F., Astudillo-Est´
 duction networks using machine learning’, Journal of Economic Dynamics and Control
 148, 104607.

Naritomi, J. (2019), ‘Consumers as tax auditors’, American Economic Review 109(9), 3031–72.

Newman, M. E. (2006), ‘Modularity and community structure in networks’, Proceedings of the
  national academy of sciences 103(23), 8577–8582.

Panigrahi, P. (2022), ‘Endogenous spatial production networks: Quantitative implications for
  trade and productivity’, Working Paper .

Pomeranz, D. (2015), ‘No taxation without information: Deterrence and self-enforcement in the
  value added tax’, American Economic Review 105(8), 2539–69.

Sargent, T. J. and Stachurski, J. (2022), ‘Economic networks: Theory and computation’, arXiv
  preprint arXiv:2203.11972 .

Soo, K. T. (2005), ‘Zipf’s law for cities: a cross-country investigation’, Regional Science and
  Urban Economics 35(3), 239–263.

Startz, M. (2021), ‘The value of face-to-face: Search and contracting problems in nigerian trade’,
  Working Paper .

Storeygard, A. (2016), ‘Farther on down the road: transport costs, trade and urban growth in
  sub-saharan africa’, The Review of Economic Studies 83(3), 1263–1295.

Topalova, P. (2010), ‘Factor immobility and regional impacts of trade liberalization: Evidence
  on poverty from india’, American Economic Journal: Applied Economics 2(4), 1–41.

Traag, V. A., Waltman, L. and Van Eck, N. J. (2019), ‘From louvain to leiden: guaranteeing
  well-connected communities’, Scientific reports 9(1), 1–12.

Ulyssea, G. (2018), ‘Firms, informality, and development: Theory and evidence from brazil’,
  American Economic Review 108(8), 2015–47.

                                               40
 arate, R. D. (2022), Spatial misallocation, informality, and transit improvements: Evidence
Z´
  from mexico city, The World Bank.

Zhou, Y. (2022), The value added tax, cascading sales tax, and informality, in M. Bussolo
  and S. Sharma, eds, ‘Hidden Potential: Rethinking Informality in South Asia’, World Bank
  Publications, chapter The Value Added Tax, Cascading Sales Tax, and Informality, pp. p.
  61–90.




                                            41
Appendix

A     Material for Data and Empirical Section

A.1      Supplementary Graphs and Tables

                      Figure A1: Composition of sales and purchases by sector

                         Sales                                                Purchases




The figures in the first row show sector-level aggregate sales (domestic + exports) and purchases (domestic +
imports) for 2019. In the second row, we plot the share of each buyer and supplier type as a percentage of total
sector-level sales and purchases.




                                                       1
                  Figure A2: Proportion of total network links by firm Attributes




The above figure plots the share of total firm-to-firm links accounted for by different groups of suppliers and
buyers. Panels are organized by sector (top), location (middle), and size (bottom). Bars represent the proportion
of total links observed in the administrative data. Q1–Q4 refer to size quartiles, with Q1 being the smallest firms
and Q4 the largest firms based on firms’ average sales volumes for the years 2016-2019.




                                                        2
                  Figure A3: Firm headquarter locations and population density

          Geographic density of firms                                   Population density




The left map shows the density of firm headquarter locations at the sub-county level, i.e. the number of firms
per km2 . The right map shows the population density - also at the sub-county level. The borders of Kenya’s 47
counties, the first administrative layer, are outlined in grey.




                                                      3
     Table A1: Firms in counties with a higher informal sector share have fewer links in the
                                      administrative data

                                                     total                    mean                median              90th percentile      final demand
                                            buyers       suppliers   buyers     suppliers    buyers    suppliers    buyers     suppliers        %
  Formal sector share (sector-county, %)    0.043***    0.037***     0.016**      0.009*     0.011**     0.007*      0.019**     0.011*        -0.166
                                             (0.014)      (0.012)     (0.007)     (0.005)    (0.005)     (0.004)     (0.008)     (0.006)      (0.181)
  Population                               1.559***     1.415***     0.441***   0.282***      -0.180      0.072     0.543***    0.455***      -5.515*
                                             (0.294)      (0.266)     (0.138)     (0.090)    (0.132)     (0.088)     (0.140)     (0.117)      (2.763)
  Travel time to Nairobi                   -0.699***    -0.591***    -0.301**    -0.170**      0.007    -0.139**    -0.318**    -0.225**       0.424
                                             (0.243)      (0.178)     (0.128)     (0.078)    (0.083)     (0.067)     (0.132)     (0.098)      (2.200)
  Travel time to Mombasa                    -0.552**     -0.449**    -0.326**   -0.246***     -0.164   -0.282***   -0.378***   -0.262***     6.256***
                                             (0.244)      (0.176)     (0.132)     (0.083)    (0.127)     (0.084)     (0.137)     (0.095)      (1.998)
  No. observations                            450             472      450            472      379        471         450         472          470
  R2                                         0.469           0.540    0.400          0.326    0.266      0.315       0.408       0.307        0.242
  Sector FE                                    ✓               ✓        ✓              ✓        ✓          ✓           ✓           ✓            ✓

In this table we regress the number of firm-to-firm links aggregated at the sector and county level on informal
sector employment shares from the population census, which we observe at the same level of disaggregation. In the
last column we regress the share of sales to non-registered entities (consumers or firms outside the VAT system)
on the formal sector share. Standard errors are clustered at the county level.




                                    Figure A4: County-level formal sector shares




The above graphs plot the probability density function (pdf) for the dispersion of formal sector shares across
Kenya’s 47 counties. The value added-based measure relies on the difference between county-sector-level national
accounts and the administrative data to obtain formal sector shares. For the employment-based measure we rely
on information about formal and informal employment at the sector and county level from the 2019 census.




                                                                        4
Figure A6: The extensive margins of informality - in which sectors do informal firms operate?




The graph compares the number of firms covered in the administrative data and with an annual revenue of over
KShs 5 million in 2016 to the number of firms with annual revenues above KShs 5 million in the 2016 Census
of Establishments (CoE) (KNBS, 2017) as well as any firm captured in either data set. Further, it plots the of
licensed and unlicensed businesses reported by KNBS in KNBS (2016).




                                                      5
                       Figure A5: Informality, market size, and income levels

                            Correlation of the formal sector share and ...
                                         ... Gross County Product




                                    ... Gross County Product per capita




The two graphs plot the correlation of the formal sector share with the Gross County Product in absolute and
per capita terms respectively. Each marker represents one of Kenya’s 47 counties.




                                                     6
                 Figure A7: Formal and informal employment in private enterprises




The graph compares the number of formal employees employed in VAT-paying firms to the number people who
stated they are formally or informally employed in a private sector entity. To improve readability we omit the
bar for informal employment in the agricultural sector. As per 2019 population census over 11 million people are
informally employed in the agricultural sector.




                      Figure A8: The GDP/value-added gap and upstreamness




We plot the gap between value-added in the VAT and national accounts figures at the sub-sector level for the
most granular sector classification reported in national accounts. We correlate it with a measure of upstreamness
(Antr`as et al., 2012), which captures how removed a sector is from final consumers (it takes a value of one if the
sector sells everything directly to final consumers).




                                                        7
A.2     Margins of informality

Table A2 summarises four different margins of informality that can occur in firm networks: an ex-
tensive margin at the firm level and an intensive margin at the transaction level. Within each cat-
egory, informality can occur due to either non-compliance or simply because a firm/transaction
is too small to be taxed. A formally-registered wholesaler we interviewed in Nairobi’s Central
Business District explained how the notion of an extensive and intensive margin of informal-
ity well-established in the literature on labour markets (Ulyssea, 2018) extends to firm-to-firm
transactions:
      “All firms purchase from manufacturers and importers paying input VAT. They even have an interest
      in getting purchases that have VAT on it to inflate the input VAT. What they do to mitigate the
      VAT levy, [is that] they downplay their output VAT (i.e. sales). Some customers will purchase with
      receipt and output VAT on it. Some customers will purchase without a receipt.”

Moreover, VAT exemptions can be a legal reason why firms or transactions above the VAT
threshold are not captured in administrative tax records.

                         Table A2: Margins of informality in firm networks

                                                Extensive             Intensive
                   Below tax threshold          Small firms        Small transactions
                   Above tax threshold        Non-compliance        Non-compliance


Depending on the tax code, not all four margins of informality arise in every setting. The only
margin of informality, which is not relevant in our setting, is the potential neglect of small VAT
transactions. This is because the Kenya Revenue Authority requires firms to record transactions
of any size, conditional on both parties being VAT-registered. Moreover, we address some issues
around non-compliance by relying on information from a firm’s trade partner to recover some
omitted transactions and under-reported trade volumes.


A.3     Measures of Informality

As documented in Table A3, the two employment-based KNBS measures correlate well with
all measures based on the administrative data. The measure capturing licensed businesses as
a share of the universe of businesses in Kenya (including micro-enterprises) in contrast only
correlates weakly with them. This likely reflects the fact that many of the licensed firms are
very small themselves and their geographic dispersion does not correlate as strongly with the
tax records. Any employment in licensed businesses (second row) is likely concentrated in
larger licensed firms, which is why the employment based measure aligns more strongly with the
administrative data relative to the simple firm count.




                                                      8
                              Table A3: Correlation of formality measures

                                                Formality measures based on admin data
       KNBS measures                            No. firms Employment         Value added
       Employment (census)                           0.78       0.83                 0.78
       Employment (licensed MSMEs)                   0.58       0.69                 0.62
       No. firms (licensed)                          0.20       0.16                 0.11
The above table shows the correlation coefficients of different measures of the formal sector share. Each measure
represents a share, i.e. captures the proportion of economic activity that can be attributed to the formal sector.
The labels indicate the underlying unit of measurement and the source of the data. All measures are aggregated
at the county level.




Figure A9: Comparison of formal sector shares based on census versus administrative records




The above graph correlates the share of the formal sector computed using employment figures from the admin-
istrative records with the share of formal private sector employment as per the 2019 population census (KNBS,
2019). Each market represents a county. The size of each marker is proportional to the economic size of the
county, i.e. its Gross County Product. To avoid mechanical correlation between the two measures we use total
employment in licensed firms as the denominator for the administrative data. The KNBS estimate for employment
in licensed firms is based on micro data that is distinct from the population census.




                                                        9
A.4      The VAT-Paying Sector as a Share of GDP

The most relevant sector that is not well captured in the VAT data is agriculture, which gener-
ates 21%-23% of Kenya’s GDP. While part of the sector receives special tax treatment due to
exemptions of mainly unprocessed agricultural commodities, some of the GDP gap can also be
attributed to informality in the classic sense due to the prevalence of small holders in the sector.
Non-market services include education, health, public administration, and real estate (Herren-
dorf et al., 2022). They contribute 22% to Kenya’s GDP, but are barely represented in the
VAT data as most of the entities operating in these sectors are VAT exempt, not-for-profit, or
the underlying sector’s size in the national accounts is estimated using non-market prices (see
penultimate column of Table A4). Figure 6 highlights another sizeable gap for “others”, which
includes international organisations, unclassified firms, and financial services.

Table A4 illustrates that the value added generated by the VAT sector has been declining over
time as a proportion of GDP. This downward trend in value added can be attributed to two
factors. First, the introduction of a fuel tax in September 2018, which was previously VAT
exempt, has led to a reduction in value added. The impact of this tax is particularly relevant for
the utilities sector. However, this sector alone cannot fully explain the overall downward trend
and kink in the data. Second, sectors that have significantly contributed to Kenya’s growth over
the years, such as agriculture, real estate, financial services, and public administration, are not
well captured in the VAT data.

                    Table A4: Share of GDP covered in the administrative records

                                        Share of GDP (%)
     Year     All   ex Fin.     ex NMS+Fin. ex Agri. ex NMS+Fin.+Agri.                        NMS      Agri.
     2015     36      39             50         42          66                                 22       21
     2016     40      43             56         46          73                                 22       21
     2017     37      40             52         45          71                                 22       22
     2018     37      40             52         45          70                                 22       22
     2019     28      30             39         34          53                                 22       23
The mid-section of the above table reports the share of GDP captured by the VAT data sequentially excluding
(ex ) specific sectors. Fin. refers to financial services. NMS refers to non-market services, i.e. education, health,
public administration, and real estate (Herrendorf et al., 2022). Agri. refers to the agricultural sector. The first
five data columns report the proportion of GDP captured by value added of the VAT-paying firms. The final two
columns report the GDP share of non-market services and agriculture respectively. GDP figures are based on
national accounts data (in current prices) published by the Kenya National Bureau of Statistics.




A.5      Firm Location and Relationships Drive Spatial Concentration in
         Trade Flows

The extensive margins of the firm network, firm location, and firm-to-firm relationships, account
for 70%-90% of the variation in aggregate trade volumes. Using transaction-level data, we are

                                                         10
able to distinguish between four different sales margins: the number of firms, the number of
relationships with buyers per firm, the number of transactions per relationship, and the average
trade volume per transaction. The same is true for purchases. Table A5 summarises the share
of the variance attributed to each term in both upstream (purchases) and downstream (sales)
trade flows.32 The number of firms operating in each county accounts for as much as 67% of
the variance in purchases across counties. This includes purchases the firms make within their
own county and what they buy outside the county. The number of supplier relationships other
counties have with the county accounts for yet another 22%. This leaves a little over 10% of
the variance to be picked up by the intensive margins for trade, i.e. the number of transactions
between firm pairs and the average transaction volume. Turning to downstream trade flows, i.e.
the decomposition of the variance in sales across (sub-)counties, the location of firms plays a
slightly less important role. Instead the number of firm-to-firm relationships now accounts for
one third of the variance in network sales.
                  Table A5: Geographic concentration of economic activity in Kenya

                                                  Purchases

 Aggregation       No. firms   No. relationships/firm   No. transactions/relation   Avg. volume/transaction
 County                 0.67                     0.22                        0.14                     -0.04
 Subcounty              0.53                     0.29                        0.16                      0.06

                                                     Sales

 Aggregation       No. firms   No. relationships/firm   No. transactions/relation   Avg. volume/transaction
 County                 0.60                     0.31                        0.12                     -0.00
 Subcounty              0.39                     0.34                        0.15                      0.16




A.6       Spatial Concentration of Economic Activity and Multi-Establishment
          Firms

A potential concern with the VAT data is that it may overstate spatial concentration because
firms are only required to report their headquarters’ locations, which are often situated in major
cities like Nairobi or Mombasa. To assess the sensitivity of measures of spatial concentration
to multi-establishment firms, we use micro-data from the 2010 Census of Industrial Production
(KNBS, 2010), which includes the mining, manufacturing, and utilities sectors. We compare
the spatial distribution of sales and firm locations for all firms, including those with multiple
branches, to that of single-establishment firms in Table A6. Firms covered in the Census of
Industrial Production overlap closely with the group of VAT-paying firms we observe in the tax
records. A 1:1 mapping is not possible due to the anonymous nature of the data sets. However,
 32
      Our decomposition follows Klenow and Rodriguez-Clare (1997); Eaton et al. (2011); Panigrahi (2022).


                                                        11
the overall number of industrial firms observed in each of the two data sources aligns closely.
In 2015, we observed 4,064 VAT-paying firms33 in mining, manufacturing, and utilities, while
KNBS (2010) covered 2,252 firms five years earlier.

Of all firms involved in industrial production, 48% are located in Nairobi County generating as
much as 61% of total sales in 2010.34 When we limit the data from the Census of Industrial
Production to single establishments, the overall concentration of firm locations does not change.
The concentration becomes even slightly more unequal once we consider sales instead of purely
counting the number of firms. We, however, overstate the concentration of sales in Nairobi by six
percentage points if multi-establishment firms are in the sample and their sales are aggregated
geographically based on headquarter information only (i.e. the measure we obtain from the
VAT data by default). Despite this, the discrepancy is not large enough to fully account for the
higher spatial concentration observed among VAT-registered firms compared to overall economic
activity.

                       Table A6: Geographic concentration of industrial activity

                                        All firms          Single est. firms
                                 Nairobi (%)        α      Nairobi (%)    α
                      Census of Industrial Production (2010)
                                                N = 2252
                      No. firms       48           0.54        48        0.54
                      Sales           61           0.32        55        0.30

                      Industrial firms in admin data (2015)
                                                N = 4064
                      No. firms         64         0.50                      -            -
                      Sales             69         0.21                      -            -
The columns for Nairobi report their share of the respective national aggregate figures (e.g., the share of industrial
establishments located in Nairobi). α is the estimated coefficient from a county-level rank regression of each
county’s rank (log) on the respective measure x (log): log rank = log A − α log x. The Census of Industrial
Production was carried out by KNBS (2010).




   33
      The earliest year for which the VAT records have been fully digitised is 2015. A later Census of Industrial
Production is available for 2018. However, the data set published by KNBS does not include any information on
firm locations. Further, information on sales is missing for over half of the firms.
   34
      The figures for Kenya are similar to the concentration of formal manufacturing firms reported by Storeygard
(2016) for Tanzania. Dar es Salaam, Tanzania’s primate city, accounts for 8% of its population (Storeygard, 2016)
- a very similar figure to Nairobi’s population share in Kenya (KNBS, 2019).


                                                         12
B     Material for Model Setup and Estimation

B.1      Supplementary Graphs and Tables

                                   Table B1: Linking patterns of small buyers

                     Manufacturing   Wholesale    Retail    Nairobi     Mombasa   Same county   Bigger supplier   Final demand
  Small buyer          -0.023***      0.011      0.038***   -0.046***    -0.003    0.037***         0.002            0.030*
                         (0.01)       (0.01)      (0.01)      (0.01)     (0.01)     (0.01)          (0.00)           (0.02)
  No. observations        892           892        892         892        892         892            892              850
  R2                     0.585         0.593      0.568       0.721      0.860       0.872          0.477            0.637
  Sector-county FE         ✓             ✓          ✓           ✓          ✓           ✓              ✓                ✓

We group firms by sector, county, and size. Small firms represent the bottom sales quartile of a sector and county.
We then compute the share of overall links the firm has with another sector-county-size group. We then aggregate
the share for suppliers with specific characteristics (e.g. any wholesaler, irrespective of location) for each type
of buyer (sector-county-size). The column titles list the characteristics of the suppliers. Finally, we regress the
respective sum of shares on whether or not the buyer type is a small buyer type.




                                          Figure B1: Degree distributions




The figure plots the log-log plot of the probability density function (pdf) against firm outdegree and indegree
respectively. The coefficients α shown at the bottom of the plot correspond to the power law exponent indicating
the existence of a thicker tail for the outdegree distribution.




                                                               13
                         Figure B2: Average in- and outdegrees across space

                 Average outdegree                                       Average indegree




The above map plots the average in- and outdegree of firms for each sub-county. The borders of Kenya’s 47
counties, the first administrative layer, are outlined in grey.


                          Figure B3: County-level average in- and outdegree




The histogram plots the average in- and outdegree across firms in each county.




                                                      14
                          Figure B4: Objective function for various values r




The graph plots the sum of the squared difference between each element of the model predicted interaction matrix
π and the matrix π directly observed in the data, for various values of the parameter r ∈ [0, 1]. The figure shows
that r∗ = 0.45 obtained via simulated annealing minimises the objective function.




                                                       15
                Figure B5: Model Fit - actual and predicted outdegree distribution




The figure plots inverse CDF for the actual and model-predicted total outdegree for each type (i.e. sector-county-
size cell). The number of outdegrees is standardised. Note the log scale on both the x- and the y-axis.




B.2       Constructing Type-by-Type Matrix πmodel (θ, θ′ ; r)

Each iteration of the estimation requires two additional steps to construct πt , which captures
type-by-type network linkages at time t.

                        e et al. (2012)’s formula, predicting πt requires us to compute a geo-
First, based on Bramoull´
metric series of matrix B. For ease of computation, we restrict this to the first five entries of
the geometric series as subsequent matrix entries become negligible.

                                   t represents only the expected outdegree of types born
Second, note from Equation 1 that πt0

in t0 evaluated at time t. Since new firms are born in every period up until period t, we need
to aggregate these matrices across all time periods leading up to t to obtain the type-by-type
adjacency matrix of the entire network. The matrix of type-by-type links at time t is given
by πt =     t0 (p
            t          t ′ )′ where p is a column vector containing the probabilities for each type to
                    · πt0

be born. We compute the probability that a node of a certain type is born in time t0 and its
                                                           t . We then repeat this process to
expected links in time t with every other type to get p · πt0

compute the probability that a node of a certain type is born in time t0+1 and its expected
degree in time t to get p · πt
                             t
                              0+1 . We have to undertake this exercise for all time periods leading

up to t. In other words, we need to compute t such matrices and add them up to obtain the


                                                       16
type-by-type degree distribution at time t.

Computing π t for t= 56822 in each iteration while looping through different candidate values
of r is computationally intensive. Therefore, in every iteration, we compute πt
                                                                              t for 500 ‘repres-
                                                                               0

entative’ time periods that we then aggregate to obtain πt . We space these 500 periods equally
between our first period t0 = 1 and final period t0 = 56822. As a result, we compute πt =
  t0 =1:100:56822 (p
                          t ′ )′ . This implies that the network is scaled down in terms of firm count.
                       · πt0

However, this approach ensures that we do not disproportionately sample from either older or
younger nodes and thereby bias our results. For example, sampling from nodes born in the first
500 periods would lead us to predict the type-by-type outdegree distribution only for firms in the
right tail of the firm degree distribution if the observed network happens to exhibit preferential
attachment, i.e. a high degree of directed search. This stems from the fact that directed search
results in older nodes having a higher chance of being more connected. This can bias our estim-
ation of r as we will match the predicted distribution of such ‘older’ firms with all firms observed
in the data. Our sampling strategy prevents this by ensuring a balanced representation of firms
across different time periods and ensures that the essential features of the network formation
process and network structure remain intact.


B.3     Fixed Indegree in the Network Formation Model

Recall that in our model, while a firm’s number of buyers (outdegree) evolves over time, the
number of suppliers (indegree) that a firm has is fixed to m and does not change as new firms
are born. This assumption can be interpreted as a reflection of a fixed production technology
that the firm needs to operate. Our focus on endogenizing the outdegree distribution (i.e. the
number of buyers of a firm) is also motivated by four key reasons.

First, the outdegree distribution in firm networks has been widely documented to exhibit a
substantially higher degree of heterogeneity than the indegree distribution (Bacilieri et al., 2023),
a fact that also replicates when we consider the spatial distribution of firm links in Figure B2
and Figure B3.

Second, supplier relationships can often be established early and remain stable over time. Con-
sistent with this, we find empirically that firm age is more strongly correlated with outdegree
(number of buyers) than with indegree (number of suppliers), suggesting that customer bases
grow with time while suppliers remain relatively fixed (see Figure B6).

Third, given that informal firms are more likely to operate downstream in the supply chain, not
observing them is more likely to affect the outdegree rather than the indegree of formal firms.

Finally, this assumption is also made for analytical convenience as it simplifies the characteriz-

                                                     17
ation of the network’s steady state. This allows us to estimate the model.

We do not think that this assumption would change our key results. To see this, consider the
scenario where firm indegrees were also allowed to grow over time. In that case, informal firms,
which are more likely to be buyers to begin with, would accumulate more supplier links as they
age. This would further increase their integration in the production network relative to formal
firms that are more likely to be suppliers. As a result, by holding indegree fixed, we likely
understate the extent to which informal firms participate in the broader economy.

                               Figure B6: In- and outdegree by firm age




The figure plots the in- and outdegree against firm age. Firm age is truncated at 50 years as only few firms are
older, but the age distribution exhibits a long tail. 46 years corresponds to the 99th percentile.




                                                      18
C     Material on the Revised Network

C.1      Supplementary Material for Revised Network

                Figure C1: Sector-county-size probabilities and formal sector shares




The graph plots each sector-regions formality share against the normalised difference between the baseline p(θ)
and the augmented version p(θa ) that takes into account informal firms. p(θ)-p(θa ) is reported in terms of standard
deviations. A 10 percentage point increase in formality leads to an increase of p(θ)-p(θa ) by half a percentage
point (0.35 standard deviations). To estimate the slope, we exclude eleven sector-county-size types which are
adjusted by more than two standard deviations. All of the eleven types are Nairobi-based. Nine are large types,
plus small firms in business services and construction. The slope becomes a little more than twice as steep if the
five sector-county pairs are included.




                                                         19
          Figure C2: Predicted change in type-level outdegree and formal sector shares




The figure plots sector-county formal sector shares against changes in type-level outdegree. The change in outde-
gree is measured as the difference between the revised network (including informal firms) and the baseline network
(formal firms only), expressed relative to the baseline.


                        Figure C3: Model versus revised county-level network




The left and right panels show the baseline and revised county-level networks, respectively. We use the row-
normalised county-level adjacency matrix to construct the plots with arrows indicating links from suppliers to
buyers. Colours indicate county groupings identified by a community detection algorithm.




                                                       20
               Figure C4: Inter- and intra-county trade patterns in a revised network

               Inter-county outlinks                                      Intra-county outlinks




The figure plots the ratio of the supplier-to-buyer links for the revised network relative to the baseline for each
county, distinguishing between trade links between counties (inter) and within counties (intra). To the left of the
dotted line, at a value of one, are counties that experienced a decline in outlinks in the respective type of trade
linkages.


     Table C1: County-level changes in the dispersion of outdegrees - alternative scenarios

            Scenario                                                              ∆ sd/mean (in %)

                                       Informal firms ≈ small firms

            Default                                       all counties                     -7.5
                                                          w/o NBO & MSA                   -18.0
            p(θ) using employ. share only                 all counties                     -5.6
                                                          w/o NBO & MSA                   -11.8

                           Alternative linking patterns for informal firms

            0% sales, 50% input to/from formal            all counties                    -16.0
                                                          w/o NBO & MSA                   -33.5
            0% sales, 75% input to/from formal            all counties                    -11.1
                                                          w/o NBO & MSA                   -25.7
            25% sales, 50% input to/from formal           all counties                    -16.1
                                                          w/o NBO & MSA                   -33.7
The above table reports the difference in outdegrees between the original and the revised network - aggregated at
the county level. We look at the coefficient of variation as the key metric. Adjusting for the mean accounts for the
fact that the change in the number of outlinks predicted by the model needs to be interpreted in relative rather
than absolute terms. We exclude the outdegrees of Nairobi and Mombasa when we compute the coefficient of
variation in every other row. The first two scenarios assume similar linking patterns for informal and small formal
firms, conditional on their sector and county. In the second scenario, we use a simple version of the updated entry
probabilities p(θ) that does not account for differences in firm size across sectors and locations. Scenarios three to
five rely on the default assumptions on how to update p(θ) to incorporate informal firms. Instead, assumptions
about p(θ, θ′ ) are modified: Scenarios three and four assume that informal firms do not sell to formal firms at all,
but buy 50% or 75% of their inputs from the formal sector, respectively. Scenario five maintains the assumption
that 50% of inputs are sourced from the formal sector and further allows 25% of the informal firms’ sales to go
to formal firms.




                                                         21
       Table C2: Social connectedness, travel time, migration and county-by-county-links

                                                                    Outlinksij
                                                Baseline      Revised       Baseline       Revised
           Travel timeij (log)                 -0.010***     -0.009***      -0.011***     -0.009***
                                                 (0.004)       (0.002)       (0.004)       (0.002)
           Social connectednessij (log)         0.006***      0.014***      0.007***      0.014***
                                                 (0.002)       (0.001)       (0.002)       (0.001)
           Migrationij                         0.530***         0.176
                                                 (0.187)       (0.181)
           Migrationji                                                      0.114***       0.158*
                                                                             (0.029)       (0.096)
           No. observations                       2,124         2,124         2,124         2,124
           R2                                     0.904         0.351         0.901         0.351
           Origin FE                                ✓             ✓             ✓             ✓
           Destination FE                           ✓             ✓             ✓             ✓
We regress the matrix of county-by-county outlinks, more precisely the share of inputs a given county purchases
from another county, on social connectedness and travel time (in hours) and the number of migrants (in millions)
between the two counties. Standard errors are clustered at level of the origin-destination dyad. Social connected-
ness captures the probability of two random individuals being friends on a popular social media platform (Bailey
et al., 2021), conditional on their present location.




              Table C3: County-level changes in outdegree and county characteristics

                                                  Outlinks counterfactual/outlinks baseline

            Formal sector share                   -3.525***      -1.569      0.561         0.839
                                                   (1.303)      (1.711)     (2.293)       (2.244)
            Population (log)                                    -0.365*       0.323        0.871
                                                                (0.213)     (0.542)       (0.613)
            Gross County Product (log)                                       -0.545      -1.063**
                                                                            (0.395)       (0.485)
            Market access (distance, log)                                                 0.330*
                                                                                          (0.187)
            No. observations                           47           47        47           47
            R2                                       0.140        0.194      0.228        0.281
We regress the county-level change in outdegrees on various county characteristics including the formal sector
share, the Gross County Product, and market access. We weight observations by Gross County Product.




                                                       22
 Table C4: Differences in simulated output reduction for revised network with informal firms
                        versus baseline network with formal firms only

                                                     Domestic output shocks                    Import shocks
                                                    (1)           (2)         (3)        (4)        (5)          (6)
  Buyer sector-county formal employment share    -4.948***     -4.803***   -4.066***   0.115***   0.238***      0.053
                                                  (0.931)       (1.458)     (1.001)     (0.032)    (0.052)     (0.034)
  No. observations                                 431           431         431         431        431         431
  Sector FE                                         -             ✓           -           -          ✓           -
  County FE                                         -             -           ✓           -          -           ✓
The outcome of interest measures the ratio of the impact response to an adverse shock if we account for informal
firms versus relying only on the administrative data. The ratio is larger than one if we underestimate the impact
of the shock and smaller than one if we overestimate it by not accounting for informality. The above table shows
the results from regressing this change in output reduction at the sector-county level on the sector-county formal
sector share.




                                                          23
     Figure C5: The ratio of ∆ output (revised network) and ∆ output (baseline network)
                                            Domestic output shocks




                                                Import shocks




We plot the ratio of the impact response to an adverse shock if we account for informal firms versus relying
only on the administrative data, for the domestic and trade shocks respectively. The ratio is larger than one if
we underestimate the impact of the shock and smaller than one if we overestimate it, if we do not account for
informality. We aggregate the impact of the shock at the sector-county level where we weight the impact for each
size and formality type using its entry probability p(θ).




                                                      24
                              Figure C6: % change in output drops
     Scenario: no sales of informal firms to and 50% inputs sourced from the formal sector
                                            Domestic output shocks




                                                 Import shocks




The above graphs plot the percentage change of the output reduction in response to domestic and international
output shocks for two scenarios: the baseline network using only administrative data and the revised network
assuming that informal firms do not sell to the formal sector and source 50% of their inputs from formal firms.
We aggregate the output reduction across buyer types at the sector and county level, weighted by the entry
probability p(θ) for each size and formality type. The x-axis shows the formal sector share for each sector-region
pair.




                                                       25
C.2    Updating Entry Probabilities for Scenarios that Distinguish Between
       Small Formal and Informal Firms

To differentiate informal firms from small formal firms, we simply split the first term in Equation
4 into two components: self-employment in the formal sector (small formal firms) and employ-
ment in the informal sector (informal firms). As outlined in Section 6.1, our proxy for informal
firms’ linking patterns still relies to some extent on small formal firms’ connections to capture
the sector-region composition of formal-informal linkages. Consequently, we can only introduce
informal types where corresponding small formal types exist in the administrative data. This
approach excludes 53 potential types, representing sector-county cells that account for 9% of
private sector employment (excluding agriculture and non-market services). We do not introduce
informal types for the agricultural and non-market service sectors. The resulting classification
yields 1,376 distinct firm types: 419 informal, 493 small formal, and 464 large formal.




                                                26