Policy Research Working Paper                          9629




Machine Learning in International Trade Research
            Evaluating the Impact of Trade Agreements

                                  Holger Breinlich
                                 Valentina Corradi
                                   Nadia Rocha
                                   Michele Ruta
                                J.M.C. Santos Silva
                                    Tom Zylkin




 Development Economics
 Development Research Group
  &
 Macroeconomics, Trade and Investment Global Practice
 April 2021
Policy Research Working Paper 9629


  Abstract
  Modern trade agreements contain a large number of                                  important provisions and quantifying their impact on trade
  provisions besides tariff reductions, in areas as diverse as                       flows. The proposed methods have the advantage of not
  services trade, competition policy, trade-related investment                       requiring ad hoc assumptions on how to aggregate individ-
  measures, or public procurement. Existing research has                             ual provisions and offer improved selection accuracy over
  struggled with overfitting and severe multicollinearity prob-                      the standard lasso. The analysis finds that provisions related
  lems when trying to estimate the effects of these provisions                       to technical barriers to trade, antidumping, trade facilita-
  on trade flows. This paper builds on recent developments                           tion, subsidies, and competition policy are associated with
  in the machine learning and variable selection literature to                       enhancing the trade-increasing effect of trade agreements.
  propose novel data-driven methods for selecting the most




 This paper is a product of the Development Research Group, Development Economics and the Macroeconomics, Trade and
 Investment Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a
 contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the
 Web at http://www.worldbank.org/prwp. The authors may be contacted at h.breinlich@surrey.ac.uk, v.corradi@surrey.ac.uk,
 nrocha@worldbank.org, mruta@worldbank.org, jmcss@surrey.ac.uk, and tzylkin@richmond.edu.




          The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development
          issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the
          names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those
          of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and
          its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent.


                                                        Produced by the Research Support Team
Machine Learning in International Trade Research –
   Evaluating the Impact of Trade Agreements
           Holger Breinlichy               Valentina Corradiz                 Nadia Rochax
            Michele Ruta{                J.M.C. Santos Silvak                  Tom Zylkin




    Originally published in the Policy Research Working Paper Series on April 2021. This version is
    updated on May 2022.
    To obtain the originally published version, please email prwp@worldbank.org.



      KEY WORDS: Lasso, Machine Learning, Preferential Trade Agreements, Deep
      Trade Agreements.
      JEL CLASSIFICATION: F14, F15, F17.



      Research for this paper has been in part supported by the World Bank’    s Multidonor Trust Fund
for Trade and Development. The …ndings, interpretations, and conclusions expressed in this paper are
entirely those of the authors. They do not necessarily represent the views of the International Bank for
Reconstruction and Development/World Bank and its a¢ liated organizations, or those of the Executive
Directors of the World Bank or the governments they represent. We gratefully acknowledge …nancial
support through ESRC grant EST013567/1, and thank Scott Baier, Maia Linask, Yoto Yotov, and
seminar participants at the World Bank Economics of Deep Trade Agreements Seminar Series for useful
comments. Alvaro Espitia, Diego Ferreras-Garrucho, Jiayi Ni, and Nicolas Apfel provided excellent
research assistance. The usual disclaimer applies. An R package (penppml) implementing penalized
PPML regressions with high-dimensional …xed e¤ects is available from CRAN.
    y
      University of Surrey, CEP and CEPR. Email: h.breinlich@surrey.ac.uk
    z
      University of Surrey. Email: v.corradi@surrey.ac.uk
    x
      World Bank. Email: nrocha@worldbank.org.
   {
      World Bank. Email: mruta@worldbank.org.
   k
      University of Surrey. Email: jmcss@surrey.ac.uk.
      University of Richmond. Email: tzylkin@richmond.edu.
1     Introduction
International trade is of vital importance for modern economies, and governments around
the world try to shape their countries’ export and import patterns through numerous
interventions. Given the di¢ culties facing multilateral trade negotiations through the
World Trade Organization (WTO), in the last two decades, countries have increasingly
turned their focus to preferential trade agreements (PTAs) involving only one or a small
number of partners. At the same time, attention has shifted from the reduction of import
tari¤s to the role of non-tari¤ barriers and behind-the-border policies, such as di¤erences
in regulations, technical standards, or intellectual property rights protections. Accord-
ingly, modern trade agreements contain a host of provisions besides tari¤ reductions, in
areas as diverse as services trade, competition policy, trade-related investment measures,
or public procurement (Hofmann, Osnago, and Ruta, 2017).
    Against this background, researchers and policy makers interested in the e¤ects of
trade agreements face di¢ cult challenges. In particular, recent research has tried to
move beyond estimating the overall impact of PTAs and to establish the relative im-
portance of individual trade agreement provisions in determining an agreement’       s overall
impact (e.g., Kohl, Brakman, and Garretsen, 2016, Mulabdic, Osnago, and Ruta, 2017,
Dhingra, Freeman, and Mavroeidi, 2018, Regmi and Baier, 2020, and Falvey and Foster-
McGregor, 2022). However, such attempts face the di¢ culty that the large number of
provisions, and the fact that similar provisions appear in di¤erent trade agreements,
create severe multicollinearity problems, which make it very di¢ cult to identify the ef-
fects of individual provisions. Traditional methods such as gravity regressions of trade
‡ ows on dummies for individual provisions are not able to deal with such multicollinear-
ity. Instead, researchers have grouped or aggregated provisions in di¤erent ways. For
example, Mattoo, Mulabdic, and Ruta (2017) use the count of provisions in an agree-
ment as a measure of its ‘   depth’, hence implicitly giving equal weight to each measure.
Dhingra, Freeman, and Mavroeidi (2018) overcome multicollinearity problems by group-
ing services, investment, and competition provisions and examining the e¤ect of these
“provision bundles”on trade ‡     ows.
    In this paper, we build upon recent developments in the machine learning and variable
selection literature to propose novel data-driven methods to select the most important
provisions and quantify their impact on trade ‡    ows. These methods address di¢ culties
arising from the high degree of correlation between individual PTA provisions, without
requiring ad hoc assumptions on how to aggregate individual provisions. Though, to
be clear, they do not completely answer the question of “which provisions matter for
trade?” , our proposed methods do lead to substantial improvements in our ability to
…nd the more relevant provisions while narrowing down the large number of potential
explanatory variables.
    We start by proposing an extension of the well-known lasso (Least Absolute Shrink-
age and Selection Operator) method for variable selection (see, e.g., Hastie, Tibshirani,
and Friedman, 2009) to the case of nonlinear models with high-dimensional …xed e¤ects,
which have become standard in the analysis of trade ‡     ows (see, e.g., Yotov, Piermartini,
Monteiro, and Larch, 2016). Speci…cally, we use a Poisson pseudo-maximum likelihood


                                              2
(PPML) version of the lasso and show how to choose the tuning parameter of this es-
timator using either cross-validation or the “plug-in” (or “theory-driven”     ) approach of
Belloni, Chernozhukov, Hansen, and Kozbur (2016), which accounts for heteroskedastic-
ity and clustered errors.1
    We apply our PPML-lasso estimators to a comprehensive data set on PTA provisions
recently made available by the World Bank (Mattoo, Rocha and Ruta, 2020). Impor-
tantly, this database is very detailed, to the point where the number of provision variables
we consider is larger than the number of PTAs we observe in our data. In addition, due
to template e¤ects and possible synergies between groups of provisions, the 305 provision
variables in our data can be highly correlated with one another. We …nd that the number
of provisions selected when using the PPML-lasso with the tuning parameter chosen by
cross-validation is too large for the model to have a meaningful interpretation and that,
in contrast, the number of provisions identi…ed when using the plug-in penalty is too
small to allow us to be con…dent that it includes the majority of relevant provisions.2
    To address these issues, we introduce two additional methods that seek to identify
potentially important variables that may have been missed in an initial lasso step based
on the plug-in penalty. One of the methods, that we call “iceberg lasso”  , involves regress-
ing each of the provisions selected by the plug-in lasso on all other provisions, with the
purpose of identifying relevant variables that were initially missed due to their collinear-
ity with the provisions selected in the initial step. The other method, termed “bootstrap
lasso”, augments the set of variables selected by the plug-in lasso with the variables
selected when the plug-in lasso is bootstrapped. As we show using simulations, these
new methods present a favorable balance between the parsimony of the plug-in lasso and
the lenience of cross-validation methods in small-to-moderate data sets where the true
causal variables may be highly correlated with an unknown number of other variables.
    To provide some headline results, the PPML-lasso based on cross-validation selects
133 provisions as being relevant, whereas using the plug-in penalty we …nd that only 8
provisions are associated with enhancing the trade-increasing e¤ect of trade agreements.
In turn, the iceberg lasso procedure identi…es a set of 42 provisions and, depending on
the cuto¤ used, the bootstrap lasso identi…es between 30 and 74 provisions that may be
impacting trade. Therefore, our iceberg lasso and bootstrap lasso methods select sets of
provisions that are small enough to be interpretable and large enough to give us some
con…dence that they include the more relevant provisions, something that is con…rmed by
the simulation evidence we provide. Reassuringly, both the iceberg lasso and bootstrap
lasso select similar sets of provisions, mainly related to technical barriers to trade, anti-
dumping, trade facilitation, subsidies, and competition policy. Having identi…ed the set
of provisions that are more likely to have an impact on trade, we also discuss how our
   1
     An R package (penppml) implementing penalized PPML regressions with high-dimensional …xed
e¤ects is available from CRAN and can be installed with install.packages("penppml"). For more
details see https://github.com/tomzylkin/penppml.
   2
     Our simulation results in Section 4 suggest that the lasso with a penalty parameter chosen by the
plug-in method often fails to select the relevant regressors. A similar result, in a di¤erent context, is
reported by Wüthrich and Zhu (2021).




                                                   3
…ndings can be used to estimate the e¤ects of di¤erent PTAs and to predict the impact
of future ones, as well as the risks associated with such exercises.
    Our work contributes to several di¤erent literatures. Most directly, we contribute to
the large and growing literature on the e¤ects of PTAs on trade ‡     ows. As previously
discussed, recently this literature has tried to decompose the overall PTA e¤ect by dis-
entangling the e¤ects of individual trade agreement provisions. The new methods we
propose allow us to select the most important provisions and to quantify their impact
on trade ‡ ows, while avoiding the need to make essentially arbitrary assumptions about
how to aggregate individual provisions (see Mattoo, Mulabdic, and Ruta, 2017; Dhingra,
Freeman, and Mavroeidi, 2018).
    In addition, we contribute to the machine learning literature interested in variable
selection and prediction. In particular, we extend and adapt existing work by Belloni,
Chernozhukov, Hansen, and Kozbur (2016) on the use of the lasso in the presence of
heteroskedasticity and clustered errors, to make it applicable to the context of interna-
tional trade ‡ ows and trade agreements. As noted above, this requires an extension of
their original method to the estimation of nonlinear models with high-dimensional …xed
e¤ects using PPML. The iceberg lasso and bootstrap lasso that we propose build on
the results obtained using the plug-in penalty and identify additional sets of provisions
that may have a causal e¤ect on trade. Both methods add to the information provided
by the standard lasso approaches and, as illustrated in our simulations, are better able
to identify the provisions that have a causal e¤ect. Therefore, these new methods can
potentially be useful in other contexts, especially when the available sample is relatively
small and contains a large number of highly-correlated potential explanatory variables.
    Finally, we contribute to a small existing literature that has used machine learning
and other related methods to study the e¤ects of trade agreements in a gravity context.
For example, Regmi and Baier (2020) use an unsupervised learning method to group
PTAs by textual similarity, so as to provide a more nuanced notion of PTA depth.
Following from a similar motivation, Hofmann, Osnago, and Ruta (2017) propose an
earlier depth measure for PTAs based on principal components analysis applied to their
provisions data. In contrast, Baier, Yotov, and Zylkin (2019) use a two-step methodology
where pair-speci…c PTA e¤ects are estimated in a …rst stage and then predicted out of
sample using country- and pair-speci…c variables.
    The rest of this paper is structured as follows. Section 2 presents the data on PTA
provisions and provides a descriptive analysis of these data, highlighting a number of
stylized facts about the provisions present in recent trade agreements. Section 3 intro-
duces the variable selection problem in the three-way gravity model context and explains
how we implement PPML-lasso estimation with high-dimensional …xed e¤ects. Section 4
presents the results of a simulation study comparing the relative performance of di¤erent
lasso methods in a simpli…ed setting with high correlation between regressors. Section
5 applies our methods to our database on PTA provisions and shows which individual
provisions are the strongest predictors of trade ‡ ows. Section 6 concludes and technical
details are gathered in an Appendix.




                                            4
2       Data
Our analysis combines data on international trade ‡  ows from Comtrade with the new
database on the content of PTAs that has been collected by Mattoo, Rocha and Ruta
(2020). On trade, we use merchandise trade exports between 1964 and 2016 from 220
exporters to 270 importers. Country pairs without export information are considered
as zeros. The database on the content of trade agreements includes information on 282
PTAs that have been signed and noti…ed to the WTO between 1958 and 2017. The data
focus on the sub-sample of 17 policy areas that are most frequently covered in trade
agreements – these are areas that are close or above the 20 percent share of the trade
agreements that have been mapped in Hofmann, Osnago, and Ruta (2017). These policy
areas range from environmental laws and labor market regulations, that are covered in
roughly 20 percent of the PTAs, to areas such as rules of origin and trade facilitation
that are present in over 80 percent of the agreements (see Figure 1).
                 Figure 1: Share of PTAs that cover selected policy areas




                   Note: Figure shows the share of PTAs that cover a policy area.
                             Source: Mattoo, Rocha and Ruta (2020).

   For each agreement and policy area, the database provides granular information on
the speci…c provisions covering stated objectives and substantive commitments, as well
as aspects relating to transparency, procedures and enforcement. The coding exercise
focuses on the legal text of the agreements and therefore excludes information on the
actual implementation of the commitments included in the agreements.3
    3
    In this data set, information coming from secondary law (the body of law that derives from the
principles and objectives of the treaties) has not been coded. This is of particular importance for
agreements such as the EU, since most policy areas covered have used secondary law such as regulations,
directives, and other legal instruments to pursue integration.


                                                  5
              Table 1: Distribution of essential provisions by policy area
                                            Number of          Number of
 Policy Area                                 provisions Essential provisions                Share
 Anti-dumping and Countervailing Duties           53               11                       28:8%
 Competition Policy                               35               14                       40:0%
 Environmental Laws                               48               27                       56:3%
 Export Taxes                                     46               23                       50:0%
 Intellectual Property Rights                    120                67                      55:8%
 Investment                                       57               15                       26:3%
 Labor Market Regulations                         18               12                       66:7%
 Movement of Capital                              94                8                        8:5%
 Public Procurement                              100                5                        5:0%
 Rules of Origin                                  38               19                       50:0%
 Sanitary and Phytosanitary Measures              59               24                       40:7%
 Services                                         64               21                       32:8%
 State-Owned Enterprises                          53               13                       24:5%
 Subsidies                                        36               13                       36:1%
 Technical Barriers to Trade                      34               19                       55:9%
 Trade Facilitation and Customs                   52               11                       21:2%
 Visa and Asylum                                  30                3                       10:0%
 Total                                           937               305                      32:6%

    To alleviate the problems caused by the high dimensionality of the data and the high
level of correlation across the provisions included in the agreements, the analysis pre-
sented in this paper focuses on a sub-set of “essential”provisions. This includes the set
of substantive provisions (those that require speci…c integration/liberalization commit-
ments and obligations) plus the disciplines among procedures, transparency, enforcement
or objectives, which are viewed as indispensable and complementary to achieving the
substantive commitments. Non-essential provisions are referred to as “corollary”   .4 The
share of essential provisions in the total number of provisions included in an agreement
ranges from less than 10 percent for public procurement, movement of capital and visa
and asylum, to more than 50 percent for policy areas such as environmental laws and
labor market regulations. Overall, the sub-set of essential provisions represents almost
one-third (305/937) of the total number of provisions coded in this exercise (see Table
1).
    The coverage of essential provisions also varies widely across trade agreements and
disciplines, indicating that not all PTAs cover the same set of essential provisions. As
shown in Table 2, more than 3/4 of agreements cover 25 percent or less of essential
provisions included in policy areas such as environmental laws, anti-dumping, sanitary
and phytosanitary measures, and technical barriers to trade. Conversely, for policy
areas such as visa and asylum, rules of origin, and trade facilitation and customs, more
than 70 percent of the mapped agreements cover between 25 and 75 percent of essential
   4
    The classi…cation into essential and corollary in the database is based on experts’knowledge and,
hence, has an element of subjectivity.


                                                 6
provisions. With the exception of services and investment, coverage of more than 75
percent of essential provisions is rare and happens in less than 15 percent of the mapped
agreements.
     One important caveat regarding this data set is that it does not cover all of the
trade agreements that have been in force during the period under study. Speci…cally,
our information on provisions is limited to agreements that are in e¤ect in present day,
i.e., excluding any agreements that are no longer in e¤ect. For this reason, we drop
observations associated with agreements no longer in e¤ect. This means that the e¤ects
of newer agreements are identi…ed by changes in trade relative to when that pair did not
have any agreement rather than relative to pre-existing agreements. The majority of the
observations that are dropped are due to pre-accession agreements that new European
Union (EU) members sign before joining the EU. Thus, to use one of these cases as
an example, Italy-Croatia is included in our data for years 1992-2000 (after Croatian
independence and before the initial EU-Croatia PTA in 2001) and for year 2016 (after
Croatia joins the EU in 2013). The EU is treated di¤erently in our analysis for this
reason, as we discuss further in Section 4. To identify agreements no longer in e¤ect,
we consult the NSF-Kellogg database created by Je¤ Bergstrand and Scott Baier cross-
checked with data from the WTO. The EU and the earlier European Community are
treated as the same agreement for these purposes, though it is allowed to evolve as new
provisions are added.

               Table 2: Coverage of essential provisions by policy area
                                                  Share of agreements covering:
    Policy Area                                0 to 25% 25% to 75% over 75%
    Anti-dumping and Countervailing Duties       99%            1%         0%
    Competition Policy                           48%           47%         5%
    Environmental Laws                           88%           12%         0%
    Export Taxes                                 41%           59%         0%
    Intellectual Property Rights                 76%           23%         1%
    Investment                                    6%           64%        30%
    Labor Market Regulations                     68%           17%        15%
    Movement of Capital                          44%           42%        13%
    Public Procurement                           53%           40%         7%
    Rules of Origin                               7%           93%         0%
    Sanitary and Phytosanitary Measures          87%           13%         0%
    Services                                      6%           62%        33%
    State-Owned Enterprises                      45%           54%         1%
    Subsidies                                    59%           41%         0%
    Technical Barriers to Trade                  93%            7%         0%
    Trade Facilitation and Customs               21%           78%         0%
    Visa and Asylum                              27%           70%         3%
    Note: Coverage ratio refers to the share of essential provisions for a policy area contained
    in a given agreement relative to the maximum number of essential provisions in that policy
    area. Source: Mattoo, Rocha and Ruta (2020)



                                                 7
3     Determining Which Provisions Matter for Trade
We now outline the methodology we use to identify which PTA provisions have the
largest impact on bilateral trade. To preview our approach, we will …rst specify a typical
panel data gravity model for trade ‡   ows. Following the latest recommendations from the
methodological literature (Yotov, Piermartini, Monteiro, and Larch, 2016, Weidner and
Zylkin, 2021), we will use a multiplicative model where expected trade ‡      ows are given
by an exponential function of our covariates of interest plus three sets of …xed e¤ects.
Drawing on this standard framework, we will then consider the estimation challenges
that arise when the number of covariates (here, provision variables) is allowed to be
very large. As we will discuss, it will be convenient to reformulate the usual estimation
problem as a “variable selection”problem, where we suppose that many of the provisions
have zero or approximately zero e¤ect.
    Bringing together these elements will require that we extend recent computational
advances in high-dimensional …xed e¤ects estimation to incorporate lasso and lasso-type
penalties. It will also require that we introduce our own innovations, the iceberg lasso and
bootstrap lasso methods, which we will motivate as providing a balance between “cross-
validation” approaches that tend to select too many variables and more parsimonious
“plug-in”methods that may select too few.

3.1    The Gravity Model
Our starting point for estimation is the following multiplicative gravity model:

                ijt   := E (yijt jxijt ;   it ;   jt ;   ij )   = exp(x0ijt +   it   +   jt   +   ij ):   (1)

Here, i, j; and t respectively index exporter, importer, and time. Bilateral trade ‡      ows
from exporter i to importer j at time t are therefore given by yijt , xijt are our covariates
of interest, and it , jt , and ij are, respectively, exporter-time, importer-time, and
exporter-importer (“pair”  ) …xed e¤ects.
     Because of the three …xed e¤ects, the model in (1) is often called the “three-way
gravity model”  . Intuitively, the exporter-time and importer-time …xed e¤ects it and
  jt may  be thought  of as controlling for changes over time in the “gravitational pull”
that the exporter and importer each exert on world trade ‡     ows. More formally, these
…xed e¤ects can be shown to depend on the market sizes of the two countries as well as
on what Anderson and van Wincoop (2003) call “multilateral resistance”         , a theoretical
measure of each country’  s connectedness to the overall trade network. The inclusion of
pair …xed e¤ect ij was suggested by Baier and Bergstrand (2007), who convincingly
argue that estimates of the e¤ect of trade agreements and other similar variables would
otherwise be biased due to omitted cross-sectional heterogeneity. In terms of a trade
model, this omitted heterogeneity is often motivated as coming from unobserved trade
costs.
     An important point about (1) is that it motivates estimating the model in its original
nonlinear form using PPML; see Gourieroux, Monfort and Trognon (1984). In principle,
one could instead use a linear model after taking logs, but Santos Silva and Tenreyro

                                                                8
(2006) have pointed out that this estimator is generally inconsistent and recommended
that (1) should instead be estimated by PPML. Though the resulting model is nonlinear
with three sets of high-dimensional …xed e¤ects, estimation is feasible due to recent
computation innovations by Correia, Guimarães, and Zylkin (2020) and others.5 Weidner
and Zylkin (2021) have recently established the consistency and asymptotic distribution
of the three-way PPML estimator, and Yotov, Piermartini, Monteiro, and Larch (2016)
recommend it as the workhorse method for estimating the e¤ects of trade policies. It is
frequently applied to the context of trade agreements in particular.
    Having established these details, our focus is on the set of covariates, xijt . In most
applications in the trade agreements literature, xijt is often either a single variable— i.e.,
a dummy for the presence of a trade agreement— or minor variants thereof, such as intro-
ducing interactions with either the depth of the agreement or the bilateral characteristics
of the two countries (Baier, Bergstrand, and Feng, 2014; Baier, Bergstrand, and Clance,
2018). However, a major estimation challenge that arises in our setting is that we must
treat the number of provisions as being very large. As we will show, in our data set this
high dimensionality, combined with the relatively small number of PTAs, creates strong
multicollinearity that results in implausibly large and uninterpretable estimates when
a standard estimator is used. Furthermore, the estimated model has poor predictive
performance due to over…tting. We therefore must discuss how the standard gravity
estimation approach must be modi…ed in order to deal with this additional source of
high dimensionality.

3.2     Variable Selection and Gravity
The starting point for our methodological innovations is to suppose that only a handful
of our provision variables have a non-negligible e¤ect on trade ‡ows. To be more precise,
we have p = 305 essential provision variables, coded as dummies, of which a subset
s < p are assumed to have non-zero e¤ects, where s is typically small with respect to the
sample size.6 We do not know s beforehand, nor do we know the identities of any of the
s provisions that substantively a¤ect trade. Our goal then is to use statistical methods
along with the model described in (1) in order to identify these provisions.
    Because of the high dimensionality of xijt , experimenting with di¤erent subsets of
provisions to see which has the best performance is unlikely to be fruitful. Instead, we
adopt a penalized regression (or “regularization”  ) approach that involves appending a
penalty term to the Poisson pseudo-likelihood one would use to estimate the unpenalized
gravity model. The idea is that the penalty term “shrinks” all estimated coe¢ cients
towards zero and forces some of them to be exactly equal to zero. The higher the
   5
     Correia, Guimarães, and Zylkin (2020) and Stammann (2018) have each proposed algorithms for
estimating nonlinear …xed e¤ects models based on iteratively re-weighted least squares (IRLS). Heuris-
tically, this type of algorithm exploits the linearity of the weighted least squares step in the IRLS
algorithm to wipe out the …xed e¤ects in each iteration, then uses an application of the Frisch-Waugh-
Lovell theorem to update the weights, repeating until convergence. For a di¤erent approach, see Larch,
Wanner, Yotov, and Zylkin (2019).
   6
     Note that of the 305 provisions in our data, 8 are always equal to zero. Therefore, the e¤ective
number of provisions we consider is 297.


                                                  9
penalty, the fewer the variables that are found to have non-zero coe¢ cients and are
therefore “selected” . By design, the variables that are selected should be those that
exert the strongest in‡ uence on the …t of the model; coe¢ cients for variables that are
not as relevant should end up getting shrunken to zero completely.
    Because of its computational feasibility, the most frequently used approach to this
type of variable selection problem is the lasso, introduced by Tibshirani (1996). In our
setting, the penalized objective function that de…nes the three-way PPML-lasso is
                                                          !     p
                               1 X                           1 Xb
              PL( ; ; ; ) =                    yijt ln ijt +          j k j;         (2)
                               n i;j;t ijt                   n k=1 k
                               |            {z             } |    {z     }
                                        1 PPML pseudo likelihood        Lasso penalty

                                                                                           0
where n is the number of observations,7 as in (1) above, ijt = e it + jt + ij +xijt is the
conditional mean, and           0 and bk      0 are tuning parameters that determine the
penalty. As indicated in (2), the …rst term in this expression is the standard PPML
objective function one would minimize in order to estimate the three-way gravity model.
Thus, the PPML-lasso nests PPML as a special case when is set to zero.
    The second term in (2) is a modi…ed lasso penalty that allows for regressor-speci…c
penalty weights as opposed to having as the only tuning parameter as in the standard
lasso. Intuitively, larger penalties increasingly shrink the estimated -coe¢ cients towards
zero. The coe¢ cients for any variables that do not su¢ ciently increase the likelihood are
set to exactly zero, thereby giving us a way of identifying which variables to include in
the …nal model. For some illustration, if we consider ! 1, the only way to minimize
PL is to set all bk s equal to zero, meaning that no variables are selected. As in Belloni,
Chernozhukov, Hansen, and Kozbur (2016), we will use the regressor-speci…c bk penalty
terms to iteratively re…ne the model while also re‡       ecting any heteroskedasticity and
within-cluster correlation featured in the data.
    Importantly, the …xed e¤ects parameters ; , and are not penalized. This is
mainly because there is no reason to believe that most of the …xed e¤ects parameters are
actually zero. In addition, it turns out they do not pose special issues for computation.
This is because they do not depend on the penalty. As such, for any given , the …xed
e¤ects can be obtained by solving their usual PPML …rst-order conditions from the
standard unpenalized regression approach. In practice, this means that the …xed e¤ects
can actually be dealt with in the exact same manner as in Correia, Guimarães, and Zylkin
(2020). More details on the computational methods are provided in the Appendix, but,
basically, we use the original HDFE-IRLS algorithm of Correia, Guimarães, and Zylkin
(2020) to take care of the …xed e¤ects but replace the weighted linear regression step
from that algorithm with a weighted lasso regression.8
   7
     Naturally, the number of observations will depend on the number of countries for which we have
data and on the number of years we observe them. For simplicity, we do not make that relation explicit.
   8
     For the lasso regression step, we use the coordinate descent algorithm of Friedman, Hastie, and
Tibshirani (2010).




                                                  10
3.3    Implementing the Lasso
The next question of course is how to determine the tuning parameters and bk . As a
starting point, the two existing approaches we will …rst examine are the “plug-in”lasso of
Belloni, Chernozhukov, Hansen, and Kozbur (2016) and the traditional cross-validation
approach, both of which we have modi…ed to …t the demands of the three-way PPML
setting. As we will discuss, each of these methods has its strengths and weaknesses.
Therefore, we will then turn to describing two extensions of the plug-in lasso, which we
call the “iceberg lasso” and “bootstrap lasso” , that are intended to address one of the
plug-in lasso’s key shortcomings in this context.

Plug-in Lasso
The plug-in lasso is so-named because it speci…es appropriate functional forms for the
penalty parameters based on statistical theory and then uses plug-in estimates for these
parameters. It is therefore a “theory-driven”approach to the variable selection problem,
whereas cross-validation, discussed next, is a more traditional machine learning method
that relies on out-of-sample prediction. The plug-in lasso was …rst proposed by Belloni,
Chen, Chernozhukov, and Hansen (2012), though the speci…c implementation we build
on is the “cluster lasso”method of Belloni, Chernozhukov, Hansen, and Kozbur (2016),
which allows for correlated errors within clusters.
    Without delving too much into technical details, which we defer to the Appendix,
variable selection using the plug-in lasso can be thought of as involving the following
three ingredients:

   i. The absolute value of the score for each   k   when evaluated at 0,

  ii. The standard error of the score for each   k,

  iii. Values for and bk set high enough so that the absolute value of the score for
         k must be large relative to its standard error in order for regressor xijt;k to be
       selected.

    Intuitively, the value of the score re‡ects the impact that a small change in k has
on the …t of the model. When evaluated at 0, it tells us how much the …t of the
model improves when we make k non-zero. The standard logic of the lasso is that this
improvement in …t must be large relative to the penalty in order for bk to be non-zero.
One of the main innovations of the plug-in lasso is to allow the regressor-speci…c penalty
b to adjust to re‡    ect the standard error of the score. This way, we counteract the
  k
possibility that regressors could be mistakenly selected due to estimation noise rather
than because of their true impact on the model. These regressor-speci…c penalties play
an important role in the presence of heteroskedasticity, which of course is an important
feature of trade data. Because the provision sets in xijt vary by agreement, and because
we expect errors to be serially correlated over time, we use the cluster lasso approach
to constructing these weights as in Belloni, Chernozhukov, Hansen, and Kozbur (2016).
Speci…cally, we cluster all observations belonging to pairs that form agreements by the

                                            11
agreement they eventually belong to, including before the agreement begins. Other
observations are clustered by pair.
    A principal advantage of the plug-in lasso is that it is very parsimonious in terms of the
number of variables it selects. As shown by Drukker and Liu (2019), the plug-in method
o¤ers superior performance versus cross-validation approaches in …nite samples, in large
part because these other methods tend to select too many variables. Furthermore, the
“post-lasso” estimates obtained using unpenalized PPML on the covariates selected by
the plug-in lasso have a “near-oracle”property that ensures they will capture the correct
model if the sample is su¢ ciently large relatively to the number of potential regressors
(see Belloni, Chen, Chernozhukov, and Hansen, 2012).9
    However, the plug-in lasso’  s parsimony can also be a weakness it that it may select
too few variables. In general, it attempts to select a small number of variables that
are most useful for predicting the outcome. However, in data settings where there are
a substantial number of regressors that are highly correlated, as is the case with our
provisions data, it is possible that the plug-in lasso will wrongly select a regressor that
does not a¤ect the outcome but is strongly correlated with another regressor that does,
since either (or perhaps both) can have similar predictive value for …tting the model. We
discuss this issue in more detail when we introduce our extensions of the plug-in lasso.

Cross-Validation
As an alternative to the plug-in method, we also consider a more traditional approach
based on cross-validation. Under cross-validation, one repeatedly holds out some of the
data and chooses in order to maximize the predictive …t of the model when evaluated
on the held-out data. The regressor-speci…c bk do not play a role and are set equal to 1.
    Because of the size of the data and the nature of our model, implementing this ap-
proach presents some interesting challenges. A standard implementation would be a
“k -fold”approach that randomly partitions the sample into k folds and then uses k 1
subsets to estimate the parameters and the excluded subset to evaluate the predictive
ability of the model. To adapt this idea to our setting, we validate our model by repeat-
edly dropping the observations corresponding to randomly selected groups of agreements
in our data, and then use their provisions to predict trade for the dropped observations,
similar to the approach taken by Baier, Yotov, and Zylkin (2019). In this case, all …xed
e¤ects are always present in each practice sample, so that we can always form the nec-
essary predictions for the omitted trade ‡ ows associated with the PTA that have been
dropped.10
   9
     The “oracle” property of estimators such as the adaptive lasso of Zou (2006) refers to their ability
to correctly recover which parameters are zero and non-zero in a setting where the number of potential
regressors is …xed and the number of observations is large. The “near-oracle” property of the plug-in
lasso is similar, but its rate of convergence is slower and depends on the number of potential regressors
because in the setting considered by Belloni, Chen, Chernozhukov, and Hansen (2012) the number of
potential regressors is allowed to grow with the sample size.
  10
     It may, however, happen that some provisions are not included in the agreements used in the
estimation sample. This is less likely to happen if k is large, and therefore we use k = 25:



                                                   12
    The main advantage of cross-validation is that it is explicitly designed to optimize pre-
dictive performance. Thus, it may o¤er a conceptual advantage where forecasting tasks
are concerned. However, a known weakness of the standard lasso with cross-validation
is that it often errs on the side of selecting too many variables that are not relevant.11
Furthermore, it does not take into account heteroskedasticity when performing the se-
lection, and it generally does not have either an oracle or near-oracle property in large
samples. For these reasons, cross-validation is not our preferred method for answering
the question of which provisions matter for trade; we consider it mainly to illustrate the
basic mechanics of the lasso and as a check on our plug-in results.12

3.3.1    Extensions of the plug-in lasso
One important feature of the lasso is that it selects variables that are good predictors
of the outcome, but these are not necessarily variables that have a causal impact on
the outcome. Indeed, Zhao and Yu (2006) show that only when the so-called “irrepre-
sentability condition” is valid can we expect the variables selected by lasso to have a
causal interpretation; the condition essentially imposes limits on the degree of collinear-
ity between the variables with a causal e¤ect on the outcome and the other candidate
regressors (see also Wainwright, 2009).
    As we have noted, in the case of our data set, there is a very high degree of collinearity
between some of the variables, and therefore we cannot expect the irrepresentability
condition to hold. Furthermore, for the plug-in lasso especially, which tends to select
a very parsimonious model, we should be worried whether the selected provisions mask
the e¤ects of a potentially more complex set of other provisions that are often included
in the same agreements as the provisions that are selected. To address this important
complication, we now introduce two methods that add variables to the set of regressors
selected by the plug-in lasso, and in the next section we evaluate their performance in
a simulation experiment; we call these methods the “iceberg lasso” and the “bootstrap
lasso”.

The Iceberg Lasso Simply put, the iceberg lasso involves performing a subsequent set
of plug-in lasso regressions in which each of the provisions selected by the plug-in lasso
estimator is regressed on all of the provisions that were excluded; the set of variables
selected by the iceberg lasso is the union of the set selected in the …rst step with the sets
  11
     In linear models, tuning      using cross-validation is analogous to selection based on the Akaike
information criterion, which ensures that the probability of selecting too few variables goes to zero but
does not eliminate the possibility of selecting too many. Relatedly, Drukker and Liu (2019) …nd that
selecting using cross-validation also leads to the inclusion of too many regressors in Poisson regressions.
In our own application, we too …nd that the cross-validation method selects many more provisions than
the plug-in method.
  12
     Alternatively, we could consider the adaptive lasso (Zou, 2006), which adds a second tuning para-
meter and is known to deliver consistent variable selection. However, in our application we have found
that the adaptive lasso is similar to the standard cross-validation lasso in that it is much too lenient
and it keeps too many regressors that are not relevant. The simulations reported in the next section
suggest that this is likely to be the case in relatively small samples.



                                                    13
selected in each of the regressions of the second step. The purpose of the second-step
regressions is to identify bundles of provisions that are highly correlated with the ones
selected in the …rst step, and therefore may be representable by them, in the sense of
Zhao and Yu (2006). That is, each of the variables selected by the PPML-lasso with the
plug-in tuning parameter may be just “the tip of the iceberg” of a bundle of variables
that have a causal impact on trade, and the lasso regressions in the second step may help
to identify these bundles. As such, the iceberg lasso may be interpreted as a data-driven
alternative to the method used in Dhingra, Freeman, and Mavroeidi (2018) to construct
provision bundles.13

The Bootstrap Lasso It is well documented that in small to moderate samples the
set of variables selected by the lasso can be somewhat unstable, in the sense that it is
very sensitive to perturbations of the sample (see, e.g., Mullainathan and Spiess, 2017).
We use this feature of the lasso to try to alleviate the tendency of the plug-in lasso
to select too few variables. In what we call the bootstrap lasso, we apply the plug-in
lasso to an additional set of B 1 samples obtained by bootstrap, and de…ne the set of
variables selected by this method as the variables that are more frequently selected in
the B samples considered. Doing so has several conceptual bene…ts.
    First, because this method is likely to uncover variables that substitute for the orig-
inally selected variables in approximating the patterns found in the data in di¤erent
versions of the sample, the augmented set of variables it selects is likely to contain more
of the relevant variables than the initial set selected by the plug-in lasso. Second, the
frequency with which each variable is selected provides useful information about the sta-
bility of its selection and thus the degree of con…dence we should have in its importance
to the model. Third, averaging estimates and predictions across bootstrap samples may
reduce over…tting due to the sampling error in the original data; in the machine learning
literature, this approach is known as “bootstrap aggregating”    , or “bagging” for short
(see, e.g., Hastie, Tibshirani, and Friedman, 2009).
    Naturally, the performance of the bootstrap lasso will depend on B and on the
frequency cuto¤ used to select the variables, with lower cuto¤s increasing the proportion
of relevant variables selected but also the number of irrelevant variables included in the
model. In our application, we use B = 250 and restrict our attention to variables that
are selected with a frequency exceeding 5% or 1%.14
  13
     The iceberg lasso complements the approach adopted by Regmi and Baier (2020), who use machine
learning tools to construct groups of provisions and then use these clusters in a gravity equation.
The main di¤erence between the two approaches is that Regmi and Baier (2020) use what is called
an unsupervised machine learning method, which uses only information on the provisions to form the
clusters. In contrast, the iceberg lasso selects the provisions using a supervised method that considers the
impact of the provisions on trade, and then adds another step which can be interpreted as unsupervised
learning.
  14
     In the simulations we use B = 20 and use only the 5% cuto¤.




                                                    14
3.4     Discussion and caveats
Having described the ideas behind our methods, several further caveats are in order.
First, by construction, not all of the provisions selected by the iceberg lasso and the
bootstrap lasso can be said to have causal e¤ects. Whether or not these methods are
more informative than other methods that are already known to over-select regressors is
an empirical matter and the answer will depend on the application. Second, in general,
we need to be very humble about potential causal interpretation of our results. We
view our approach as a statistical method to select a group of variables that is likely to
include the ones most relevant to the …t of the three-way gravity model. This of course
requires taking the model to be an appropriate representation of the determinants of
trade. The three-way gravity model has the considerable advantage that it isolates
a particular variation in the data that is empirically relevant for the study of trade
agreements, namely the within-pair variation that is time-varying and independent of
country-speci…c changes in trade. However, the initial PPML-lasso with the tuning
parameter selected by the plug-in method is likely to omit relevant variables, and that
obviously complicates interpretation of those estimates. The additional steps in the
iceberg lasso and in the bootstrap lasso are explicitly designed to address this latter
issue and should at least partially alleviate this problem, at the cost of possibly selecting
some variables that e¤ectively have little or no impact on trade.


4     Simulation Evidence
In this section we report the results of a simulation exercise investigating the …nite-sample
properties of the variable-selection methods discussed before. The simulation design
we use covers a range of scenarios that, to di¤erent degrees, combine two important
features of our application: a relatively small sample and a high degree of collinearity
between several potential explanatory variables. The results we obtain, therefore, provide
information on the performance of the di¤erent methods in conditions similar to those we
face, and illustrate how these performances change when we progressively move towards
less challenging environments.
    In all experiments, the n observations of the dependent variable are generated as

                                 y = exp (1 + x1 + z + ") ;

where and are parameters and x1 , z , and " are independent random draws from
the standard normal distribution. In the estimation, performed by PPML-lasso, " is not
included as a regressor (it is the error term), z is always included as a regressor whose
coe¢ cient is not penalized, and we use di¤erent methods to select other regressors from a
set of p potential explanatory variables x1 ; : : : ; xp . Therefore, in this design, x1 plays the
role of the presumably small number of provisions that e¤ectively a¤ect trade, x2 ; : : : ; xp
represent the provisions that have no impact on trade, and z mimics the role of the …xed
e¤ects that explain a signi…cant share of the variation of trade and are included without
penalty.


                                               15
    The parameters and determine the relevance of x1 and the signal-to-noise ratio:
because gravity equations typically have an excellent …t, we set = 0:2 and = 0:3,
which ensures that model has a reasonably high R2 and that the e¤ect of x1 is neither
too small (which makes its role very di¢ cult to detect) nor too large (in which case all
approaches have an excellent performance).
    The p potential explanatory variables are obtained as random draws from the nor-
mal distribution; the …rst variables x1 ; : : : ; x are equicorrelated with correlation co-
e¢ cient , and the remaining ones are independent of all other variables. All regres-
sors have zero mean and variance 1 and we perform simulations   p     with 2 f5; 10; 20g,
  2 f0:75; 0:90; 0:99g, n 2 f250; 1000; 4000g, and set p to 5 d ne, where d e denotes the
ceiling function; that is, depending on the value of n, p is either 80, 160, or 320.15
    In these simulations we considered each of the four methods presented before: cross-
validation lasso, plug-in lasso, the iceberg lasso, and the bootstrap lasso. The bootstrap
lasso is performed with B = 20 and we include in the set of selected variables any variable
that is selected in at least one sample (that is, we use a cuto¤ of 5%). Additionally, we
also considered the adaptive lasso of Zou (2006), with the penalty parameter chosen by
cross-validation in both steps.16 Unlike the other methods we consider, the adaptive
lasso has the so-called oracle property, implying that asymptotically it will choose the
right set of regressors, and therefore it provides an interesting benchmark against which
the performance of the other methods can be judged.17 We repeat the simulations 1; 000
times and study both the ability of each method to correctly select x1 as a regressor and
their predictive performance.

4.1     Variable selection
For each of the cases considered, Table 3 presents the percentage of times the regressor
x1 is selected and, in parentheses, the average number of regressors selected by each
method. The results in Table 3 reveal that the various methods can have very di¤erent
performances.
   Starting with the ability of each method to correctly select x1 as a regressor, we …nd
that for n = 250 the lasso with penalty chosen by the plug-in method (PI) is the method
with the worst results, and its performance deteriorates quickly as and increase. The
adaptive-lasso (AL) leads to better results, but its performance is also very poor when
  = 0:99. Lasso with the penalty chosen by cross-validation (CV) provides a substantial
improvement, but it also struggles for larger values of . The bootstrap lasso (BL) is at
  15
     A noticeable di¤erence between the simulation design we use and our application is that in the sim-
ulations the potential explanatory variables have a continuous distribution whereas in the application
they are dummies. We preformed some experiments where the potential explanatory variables are dum-
mies generated using the method described by Lunn and Davies (1998) and found broadly comparable
results. However, we prefer to report the results obtained using the normally distributed variables be-
cause when dummies are used we frequently encounter numerical issues and cases of perfect collinearity
that make it more di¢ cult to keep track of the variables selected.
  16
     We also performed simulations using Zou and Hastie’    s (2005) elastic-net. However, those results
are not particularly interesting and are not reported to conserve space and to simplify the exposition.
  17
     Note, however, that the plug-in lasso has a related near-oracle property.


                                                  16
least as successful as CV, but clearly dominates it for the higher values of . Finally, the
iceberg lasso (IL) is marginally outperformed by CV and BL when = 0:75, outperforms
CV but is again marginally outperformed by BL when = 0:9, but has a substantial
advantage over all other methods for = 0:99.18

  Table 3: Percentage of times x1 is selected & average number of variables selected
                      = 0:75                   = 0:90                  = 0:99
    n           =5     = 10     = 20     =5     = 10    = 20   =5      = 10     = 20
   250   CV 100:0      99:7     99:3 96:6       91:8    85:5 55:2      37:7     23:4
                  (8:65)    (8:55)    (8:74)    (8:87)    (8:66)    (8:64)    (8:52)    (8:22)    (7:93)
           AL      99:7     99:4      97:9      93:9      87:4      80:4      45:3      29:4      17:7
                  (7:22)    (7:21)    (7:05)    (7:34)    (7:21)    (7:05)    (6:99)    (6:72)    (6:26)
            PI     91:6     89:9      88:1      80:6      72:1      63:7      41:1      26:8      16:9
                  (1:26)    (1:52)    (1:89)    (1:45)    (1:73)    (2:06)    (1:23)    (1:33)    (1:41)
           BL    100:0     100:0     99:8      96:6      98:4      96:7      90:4      79:2      64:2
                 (11:11)   (12:81)   (15:27)   (11:31)   (13:25)   (15:66)   (11:27)   (12:77)   (14:03)
            IL     95:7     95:9     95:2       95:9      95:8     93:0       95:3      93:4     80:1
                  (4:80)    (9:14)   (15:97)    (4:81)    (9:43)   (17:00)    (4:78)    (9:32)   (15:65)
  1000     CV 100:0        100:0     100:0 100:0         100:0     99:9       81:0     69:8      56:4
                  (9:43)   (9:59)    (10:05)   (9:76)    (10:10)   (10:69)    (9:92)   (10:11)   (10:51)
           AL    100:0     100:0     100:0 100:0          99:7      99:7      68:3      54:8      40:8
                  (3:93)   (4:19)    (4:49)    (4:71)     (5:22)    (5:85)    (5:37)    (5:97)    (6:22)
            PI     99:8     99:8      99:7      99:2      98:4      97:5      71:4      55:9      41:4
                  (1:31)    (1:54)    (1:88)    (1:63)    (2:02)    (2:57)    (1:75)    (2:02)    (2:34)
           BL    100:0     100:0     100:0 100:0         100:0     100:0      98:0     93:70     87:1
                  (8:88)   (10:89)   (13:91)   (9:26)    (11:67)   (15:23)    (9:36)   (11:85)   (14:81)
            IL   100:0     100:0     100:0 100:0         100:0     100:0 100:0         100:0     98:8
                  (5:01)   (10:00)   (19:22)   (5:00)    (10:01)   (19:69)   (5:01)    (10:01)   (19:72)
  4000     CV 100:0        100:0     100:0 100:0         100:0     100:0     99:0      97:8      94:9
                 (10:46)   (10:85)   (11:24)   (10:78)   (11:28)   (11:88)   (11:18)   (12:06)   (12:63)
           AL    100:0     100:0     100:0 100:0         100:0     100:0      91:9      86:0      79:1
                  (1:00)   (1:00)    (1:00)    (1:03)    (1:00)    (1:03)     (1:18)    (1:30)    (1:70)
            PI   100:0     100:0     100:0 100:0         100:0     100:0      98:0      93:9      88:1
                  (1:23)   (1:43)    (1:68)    (1:53)    (1:96)    (2:42)     (2:00)    (2:60)    (3:18)
           BL    100:0     100:0     100:0 100:0         100:0     100:0 100:0         99:9      99:8
                  (7:86)   (9:91)    (13:03)   (8:44)    (11:04)   (14:94)   (8:93)    (11:94)   (16:27)
            IL   100:0     100:0     100:0 100:0         100:0     100:0 100:0         100:0     100:0
                  (5:00)   (10:00)   (19:99)   (5:00)    (10:00)   (20:00)   (5:01)    (10:00)   (20:00)



    The performance of all methods improves for the larger sample sizes, but the IL
maintains its advantage in the more challenging cases with = 0:99, with BL having
a very similar performance. Overall, the two extensions of the PI we consider, the BL
and IL, lead to greatly improved ability of identifying the relevant regressor, and there
is generally little to choose between them, except in the extreme cases with n = 250 and
  = 0:99 where the IL has a very clear advantage.
    The results for the average number of variables selected are also interesting. In all
cases considered, CV tends to lead to a high average number of selected regressors. On
the other extreme, PI is generally the most parsimonious except for when n is large,
in which case the oracle property of AL starts to become salient. Turning now to the
  18
     Part of the reason why in some cases IL does not perform well is that sometimes PI selects no
regressors at all, and in those cases IL cannot improve on it but BL can.


                                                  17
extensions of the PI method we consider, we observe that the average number of regres-
sors selected by BL is always reasonably high and that, for the values of we consider,
the average number of variables selected by the IL increases with , suggesting that the
method performs as intended. Naturally, this behavior will be less pronounced for lower
values of , and we have con…rmed that in unreported simulations.
    In summary, for very large samples, the adaptive lasso with penalty parameter se-
lected by cross-validation is the preferred method; this is justi…ed both by our simulation
results and by its oracle property. However, for small to medium samples, and especially
with high correlation between potential explanatory variables, the adaptive lasso is out-
performed by other methods. In these cases, the choice of method depends on whether
we favor selecting the relevant regressors or having a parsimonious model. If parsimony
is paramount, the lasso with penalty parameter selected by the plug-in method is dif-
…cult to beat. However, if selecting the relevant regressor is important, the bootstrap
lasso and the iceberg lasso are safe bets, with the iceberg lasso being clearly preferable
only for smaller samples where there is extremely high collinearity between the relevant
variable and other potential controls.

4.2     Prediction
We now consider the predictive ability of the models obtained with the di¤erent variable-
selection methods. To that end, for each replica of the simulations we generated 100
additional observations and used the di¤erent models to predict these observations. In
this context, we can consider both lasso predictions, using the penalized lasso estimates,
and post-lasso predictions, using unpenalized estimates.19 We computed penalized and
unpenalized predictions for all approaches and found that for CV and AL penalized
predictions tend to dominate unpenalized ones, and the reverse holds for PL, IL, and
BL.
     Table 4 summarizes these results and reports the mean square error (MSE) of the
prediction error for each of the models considered. To conserve space, we only report
the results obtained with the penalized predictors for the CV and AL, and unpenalized
predictors for PI, IL, and BL. For comparison, the table also presents the MSE of the
predictions obtained with the unpenalized PPML estimates of the models that includes
all p regressors and with the PPML estimates of the “oracle” model that just includes
x1 .
     The results in Table 4 show that the predictions obtained with the unpenalized
estimator of the full model are clearly outperformed by all lasso-based predictions, with
the di¤erence being particularly stark in the smaller sample. The results also suggest
that the predictive performance of the di¤erent methods depends little on the values of
  19
    The unpenalized predictions for the IL are computed from the PPML estimates of a model including
the full set of variables selected by IL; for BL they are computed as the average of the predictions
corresponding to the post-lasso PPML estimates in each sample. The penalized predictor for the IL
is obtained from a plug-in lasso based on the full set of variables selected by IL; for BL, the penalized
predictor is the average of the predictions obtained with the penalized estimates in each of the bootstrap
samples.



                                                   18
and , but generally improves with n. The exception to this is the IL, for which we see a
small but systematic drop in performance as increases. This is not surprising because
the method is designed to select all the regressors that are su¢ ciently correlated with
the ones identi…ed by the PI, and therefore for large the IL selects many irrelevant
predictors.
    Perhaps the most striking feature of the results in Table 4 is, however, the excellent
performance of PI, which can be comparable to that of the oracle model even in cases
where PI often fails to identify x1 as a predictor. It is also noteworthy that the per-
formance of the BL is also very good and better than that of the IL, especially for the
larger values of . For the larger sample, however, there is little to choose between the
di¤erent lasso methods, but AL has the best performance.


                               Table 4: MSE for prediction errors
                              = 0:75                = 0:90                              = 0:99
                      =5       = 10   = 20    =5    = 10     = 20  =5                   = 10       = 20
                                           n = 250
      CV         6:85          6:83   6:86 6:87      6:88    6:88 6:83                   6:83      6:80
       AL        7:27          7:23   7:22 7:29      7:26    7:24 7:17                   7:18      7:08
       PI        6:57          6:53   6:66 6:59      6:63    6:71 6:53                   6:52      6:52
       BL        6:63          6:60   6:66 6:64      6:62    6:66 6:57                   6:53      6:53
       IL        6:71          6:83   7:21 6:71      6:84    7:25 6:72                   6:85      7:23
 All regressors 10:98         10:98 10:98 10:98 10:98 10:98 10:98                       10:98     10:98
     Oracle      6:39          6:39   6:39 6:39      6:39    6:39 6:39                   6:39      6:39
                                           n = 1000
      CV             6:34      6:35   6:35 6:34      6:34    6:35 6:33                   6:32       6:34
       AL            6:34      6:31   6:30 6:35      6:39    6:40 6:39                   6:41       6:47
       PI            6:19      6:19   6:22 6:18      6:19    6:22 6:16                   6:17       6:20
       BL            6:19      6:18   6:21 6:18      6:18    6:21 6:16                   6:16       6:18
       IL            6:22      6:31   6:48 6:22      6:31    6:47 6:22                   6:31       6:48
 All regressors      8:44      8:44   8:44 8:44      8:44    8:44 8:44                   8:44       8:44
     Oracle          6:19      6:19   6:19 6:19      6:19    6:19 6:19                   6:19       6:19
                                           n = 4000
      CV             6:37      6:37   6:37 6:36      6:37    6:38 6:37                   6:38       6:38
       AL            6:34      6:34   6:34 6:33      6:33    6:34 6:34                   6:34       6:34
       PI            6:34      6:35   6:36 6:34      6:35    6:35 6:33                   6:34       6:35
       BL            6:35      6:36   6:37 6:36      6:37    6:37 6:36                   6:37       6:38
       IL            6:34      6:35   6:43 6:34      6:35    6:43 6:34                   6:35       6:43
 All regressors      7:39      7:39   7:39 7:39      7:39    7:39 7:39                   7:39       7:39
     Oracle          6:34      6:34   6:34 6:34      6:34    6:34 6:34                   6:34       6:34
 Note: The table reports the mean square error of the prediction error obtained using penalized
 predictors for the CV and AL, and unpenalized predictors for PI, IL, and BL. For comparison, the
 table also presents the mean square error of the predictions obtained with the model with all regressors
 and with the “oracle” model that just includes the relevant regressor.


                                                   19
4.3     Summary of the …ndings
The simulation results presented above, which con…rm and extend the …ndings of Drukker
and Liu (2019), have important implications for our work.
    Given that in our application we only have data on 282 trade agreements,20 we
cannot expect any of the methods considered to be able to precisely identify the set of
provisions that matter for trade. The task of identifying the correct set of explanatory
variables is particularly challenging in our application because many of the provisions
have very strong correlations with others, and there are even cases of perfect collinearity.
In this challenging context, the iceberg lasso and bootstrap lasso emerge as providing a
good compromise between parsimony and the ability to identify the relevant variables.
The iceberg lasso has the practical advantage of being easier to implement and not
requiring the choice of additional parameters, such as the number of bootstrap samples.
Consequently, the iceberg lasso is our preferred approach to select the relevant variables,
but the bootstrap lasso is a credible alternative that can be used, at least, as a robustness
check. Additionally, the bootstrap lasso is the only approach we have considered that
can provide information about model uncertainty, but exploring that is beyond the scope
of this paper.
    If the objective of the researcher is to accurately predict the trade impact of a given
PTA, the preferred approach is to compute the predictions using the post-lasso estimates
obtained with the plug-in penalty. Indeed, this approach performs extremely well in all
cases, and is only marginally outperformed by the adaptive lasso in the larger sample we
considered.21 However, the bootstrap lasso is also a credible alternative in this context
and it can serve as a useful robustness check.


5      Empirical Results
In this section, we present the lasso results obtained using the methods described and
studied in the previous sections. We …rst present results for the plug-in method before
brie‡y discussing the results obtained using cross-validation. We then turn to the iceberg
and bootstrap lasso results, which each build in their own way on the selection done by
the plug-in lasso. We also include a brief discussion of using these methods for prediction.
  20
     Note that the information on the e¤ect of the di¤erent provisions is limited by the relatively small
number of PTAs that are observed. Therefore, despite having a large number of observations, we
e¤ectively only have a small sample to identify the e¤ect of the di¤erent provisions.
  21
     One may wonder why the PPML-lasso with the tuning parameter chosen by the plug-in method
predicts so well, even if it often fails to select the right regressor. The answer, of course, is that when
the purpose is simply to predict the outcome, the results change little if the regressor with a causal
impact is replaced by another that is highly correlated with it.




                                                    20
5.1     Plug-in Lasso Results
Table 5 presents results for the plug-in lasso and post-lasso regressions discussed before.22
In column (1), we start by presenting the results of a traditional PPML gravity estimation
with a dummy for the presence of a PTA between the trading partners. This shows that
we can replicate the usual …nding that PTAs lead to a signi…cant increase in trade ‡    ows.
Speci…cally, we …nd that the PTAs in our data increase trade by 14% (exp (0:131) 1 =
0:14).
    Column (2) then shows the results of the plug-in PPML-lasso regression, showing
only the coe¢ cients that are found to be non-zero. Using this approach, the lasso selects
8 provisions related to anti-dumping, competition policy, technical barriers to trade
(TBT), and trade facilitation. Broadly speaking, these variables all can be rationalized
as having intuitive e¤ects on trade. The selected anti-dumping and competition policy
provisions create more certainty as to how disciplinary investigations and proceedings
will be carried out in these policy areas.23 This increased certainty may increase entry by
foreign exporting …rms. The inclusion of provisions related to technical barriers to trade
and trade facilitation is likewise intuitive, but the selection of TF45, which facilitates
obtaining certi…cates of origin, seems of particular note in that it highlights the costs of
complying with rules of origin. It is worth noting that the plug-in PPML-lasso selects
TBT2 and TBT29, two provisions that are perfectly collinear in our data set. This
illustrates both the ability of the method to select variables that are perfectly collinear
as well as the challenges faced when trying to interpret the results in this setting.
    We next estimate a “post-lasso”PPML regression— a standard PPML regression us-
ing only the provisions that were selected in the previous step. These post-lasso PPML
results, presented in column (3), show that some of the selected provisions have large ef-
fects when estimated in the conventional way. For example, the inclusion of anti-dumping
provision AD14, which requires that anti-dumping proceedings establish “material in-
jury”to domestic producers, is associated with an increase in trade ‡     ows of about 42%
(exp (0:349) 1 = 0:42). Interestingly, not all of the provisions selected by the lasso step
are found to be statistically signi…cant in the post-lasso step. This apparent contradic-
tion arises for two reasons. First, the lasso focuses on the contribution of each variable
to the pseudo-likelihood function, which is not the same as testing whether its coe¢ cient
is statistically di¤erent from zero. Second, because the lasso shrinks all coe¢ cients to-
wards zero simultaneously, it reduces the in‡   uence of the collinearity between them and
can allow individual provisions that are not signi…cant in the conventional regressions to
speak more loudly.
    In column (4), we re-estimate the model using the same covariates as column (3) but
now re-adding our original PTA dummy from column 1. In this case, the coe¢ cient on
PTA captures any e¤ect on trade ‡      ows that is not already captured by the provision
variables that were selected by the lasso. With this in mind, we take the insigni…cant
  22
     Both the PPML standard errors and the plug-in lasso estimates account for clustering, which is
done at the agreement level for observations that correspond to agreements, and at the pair level for
the remaining observations.
  23
     For more on the e¤ect of anti-dumping provisions, see Prusa, Teh, and Zhu (2022).



                                                 21
               Table 5: PPML, PPML-lasso, and post-lasso PPML results for plug-in approach
                  Dependent variable: Bilateral Trade Flows (1964-2016, every 4 years)
                                                      PPML Lasso Post-lasso PPML PPML
                                                        (1)    (2)       (3)         (4)     (5)
     PTA                                             0:131                          0:008 0:087
                                                                 (0:044)                              (0:062)   (0:041)
     EU                                                                                                         0:658
                                                                                                                (0:087)
     AD14. Anti-dumping –Material Injury                                     0:329    0:349         0:347
                                                                                      (0:117)       (0:119)


     CP23. Competition Policy –Transparency / Coordination                   0:002    0:118         0:118
                                                                                      (0:077)       (0:078)
                                                                             0:142    0:184         0:182




22
                                                                                      (0:142)       (0:144)
     TBT2 / TBT29. Mutual Recognitiony

                        s: use International Standards
     TBT7. Technical Reg’                                                    0:016    0:032         0:034
                                                                                      (0:078)       (0:080)
     TBT8. Conformity Assessment: Mutual Recognition                         0:028    0:123         0:124
                                                                                      (0:099)        0:099
     TBT33. Standards: use Regional Standards                                0:109    0:113         0:116
                                                                                      (0:061)       (0:064)
     TF45. Issuance of Proof of Origin                                       0:000    0:089         0:095
                                                                                      (0:032)       (0:053)
     Gravity equations with exporter-time, importer-time, and exporter-importer FE, estimated by PPML using 316,317
     observations. Columns labelled “Post-lasso” report PPML coe¢ cients for all variables selected by a plug-in lasso
     method in a prior step. All other columns report further experiments using PPML. Cluster-robust standard errors are
     reported in parentheses. * p < 0:10 , ** p < :05 , *** p < :01. yTBT2 is perfectly collinear with TBT29: TBT2 refers
     to mutual recognition of technical regulations; TBT29 refers to mutual recognition of standards.
and near-zero coe¢ cient on PTA in column (4) as an encouraging indication that the
selected provisions completely explain the average PTA e¤ect reported in column (1).
    Next, column (5) returns to our original simple model from column (1) but adds a
dummy variable for the EU.24 Our reasons for treating the EU separately from other
agreements are three-fold. First, we suspect that not all of the EU’   s e¤orts to promote
trade are captured in how their provisions variables are coded in our data. There could
also be unobserved e¤ects that are channeled through the EU’    s secondary law process, in
which the EU’  s governing institutions are empowered to pass new regulations and direc-
tives on an ongoing basis. Second, our provisions data set does not include agreements
that are no longer in e¤ect. For the most part, the agreements that cannot be included are
EU pre-accession agreements, which obviously are subsumed by the EU agreement once
each new member joins the EU. As discussed in Section 2, we deal with this data issue in
practice by dropping all observations associated with obsolete agreements. Nonetheless,
this could lead to biased estimates of the EU agreement and the provisions associated
with it. Third, the latest EU agreement has in place six of the eight provisions selected
in column 2 (all except AD14 and TBT7); thus, we want to make sure we are not simply
picking up an “EU e¤ect”in the provisions that are selected.
    As the PPML results in column (5) show, the estimated EU e¤ect is large, several
times that of non-EU PTAs in fact. However, when we treat the EU as a possible
predictor in the lasso, we …nd that is not selected and consequently the set of provision
variables selected is identical to that in column (2), which is our preferred set to work
with in the subsequent iceberg lasso and bootstrap lasso analyses.

5.2     Cross-Validation Lasso Results
As discussed before, the plug-in approach to choosing penalty parameter tends to choose
a relatively small set of regressors and may fail to pick the “correct” regressors. For
comparison, we now discuss the choice of regressors when we use the cross-validation
approach.25
    Figure 2 shows how the out-of-sample mean  P square error (MSE) varies with the log
of the tuning parameter, which is scaled by ijt yijt so that the results do notP depend
on the scale of the data. At the optimal value of the tuning parameter, = ijt yijt =
0:00025, the cross-validation approach selects 128 provisions to have non-zero e¤ects.
Additionally, some of the selected provisions are perfectly collinear with variables that
are not selected; if we take this into account, the e¤ective number of provisions selected
is 133, which is many more than what we found using the plug-in approach.
    For more illustration, Figures 3 and 4 show the corresponding regularization paths
for selected provisions.26 That is, the …gures show how the value of the estimated (post-
lasso) coe¢ cient on the selected provisions changes as we P  vary . As expected, fewer
provisions are selected as we increase and, for values of = ijt yijt around 0:01, which
  24
     We use EU as shorthand for the EU and EC agreements.
  25
     As explained before and in the Appendix, the cross validation is performed clustering by agreement.
  26
     In each panel of the …gures, the fourth set of estimates from the right corresponds to the variables
selected by the cross-validation method.


                                                   23
is forty times larger than the optimal value, we generally see a close correspondence
between the results in Figures 3 and 4 and those that we found earlier using the plug-in
method.

                Figure 2: Cross-validation MSE vs. tuning parameter




                                  300      250
                      Cross-validation MSE
                    150         200
                                  100




                                                 -10   -8                    -6             -4
                                                       Log of the scaled tuning parameter



    Note, however, that it is not necessarily the case that the set of provisions selected
at lower levels of includes the set of provisions selected at higher levels. For example,
Figure 3 shows that provision AD14, which was one of the provisions selected by the
plug-in approach, is selected with a negative coe¢ cient for the smallest value of we
consider, drops out when we increase the penalty, and is selected with a positive coef-
…cient for higher values for . Intuitively, for small values of , the procedure selects
many provisions, and the high collinearity between the variables selected makes it dif-
…cult to precisely identify their e¤ect. As we increase , some provisions are dropped;
because many provisions are correlated with AD14, it can be dropped without signi…cant
deterioration of the out-of-sample forecasts during cross-validation, and hence it is no
longer selected. It is only when the provisions correlated with AD14 are purged from
the model as increases even more, that AD14 on its own gains predictive power and is
again included.
    Overall, the plug-in and cross-validation approaches lead to the selection of very
di¤erent sets of trade agreement provisions. While some provision, such as TBT07 or
TF45 are selected by both approaches, others, such as AD14, are only selected by the
plug-in method, and many provisions are only selected using cross-validation, such as
anti-dumping provisions AD05 and AD06. Furthermore, we also see in Figures 3 and
4 that many of the estimated e¤ects for the provisions selected by cross-validation are
not plausible when interpreted on their own. These observations re‡        ect the known
shortcomings of the cross-validation approach that we stated earlier and found support
for in our simulations.




                                                                  24
25
     Figure 3: Regularization path for selected provisions (AD, ET, CP, STE, SUB, ENV, LM, and MIG)
26
     Figure 4: Regularization path for selected provisions (IPR, TBT, SPS, SER, ROR, TF, INV, MOC, and PP)
5.3     Iceberg Lasso Results
As previously mentioned, we cannot be certain whether the variables selected by the lasso
have a causal e¤ect on trade or are simply highly correlated with the variables that have a
causal e¤ect. In this section, we investigate this issue further by carrying out the iceberg
lasso analysis we proposed earlier. That is, for each of the provisions from our preferred
set of estimates (those from the third column of Table 5), we run an additional plug-in
lasso regression where we regress each selected provision on all of the provisions excluded
by our …rst-stage lasso.27 As discussed, the purpose of these auxiliary regressions is to
construct bundles of provisions that, at least when combined together, are likely to have
a causal impact on trade ‡   ows when included in trade agreements. As we have noted, the
reader should be cautioned that we will not be able to say with high certainty whether a
given provision is important for promoting trade, but, as we will see, this method gives
us signi…cantly increased parsimony versus relying on cross-validation. Furthermore, as
we have seen from our simulations, it should also give us more con…dence in the results.
    Table 6 presents the results of our iceberg lasso analysis. The …rst two rows of Table
6 list each of the eight provisions selected by the …rst-stage plug-in lasso, as well as their
estimated impact on trade ‡     ows from column (3) of Table 5. The subsequent rows of
Table 6 report all provisions that were not selected by the lasso in the …rst step but
are identi…ed in the second step of the iceberg lasso; we also report the correlation of
each of these provisions with the selected provision in the …rst row. Finally, the last row
reports the R2 of the regression of each selected provision on the corresponding correlated
provisions. For example, column (1) shows that anti-dumping provision AD14 is highly
correlated with two further anti-dumping provisions (AD06 and AD08), as well as with
one provision on environmental protection (ENV42); the R2 of the regression of AD14
on these three provisions is 0:95.
    The results in Table 6 show that the iceberg-lasso identi…es a total of 42 (= 8 + 34)
distinct provisions that are likely to be associated with increased trade. This …nding con-
trasts with the 133 provisions identi…ed by the cross-validation lasso and the 8 provisions
selected by the plug-in lasso. Therefore, as in the simulations in the preceding section,
the iceberg lasso appears to provide a good compromise between the cross-validation
lasso, which selects so many provisions that makes it di¢ cult to interpret its results, and
the plug-in lasso, which is likely to miss important provisions.
    Looking in more detail at the results in Table 6, we …nd that provision AD14 is
correlated with other anti-dumping provisions; this correlation is not surprising because
all these provisions ful…ll a similar purpose, which is to increase transparency in the use
of anti-dumping duties. In that sense, one conclusion to be drawn from this exercise is
that anti-dumping provisions are likely to increase trade ‡    ows, although we cannot say
which of them has the biggest e¤ect. Table 6 shows that, more surprisingly, AD14 is
also strongly correlated with ENV42. This correlation seems to be due to what might
  27
    These linear plug-in lasso regressions are performed using only the 34; 370 observations for which
PTAs are in force. This is because the provisions are identically zero for the remaining observations,
which therefore are not informative about the relations of interest. As a consequence, the clustering
now is only by agreement.



                                                 27
                                         Table 6: Iceberg lasso results
         (1)           (2)           (3)             (4)             (5)         (6)                                            (7)
        AD14          CP23       TBT02/29          TBT07           TBT08       TBT33                                           TF45
      (+41:7%)      (+12:5%)      (+20:2%)        (+3:2%)         (+13:1%)    (+12:0%)                                        (+9:3%)
     AD06 (0.98)   AD06 (0.40)  AD06 (-0.07)   AD06 (0.51) SUB10 (0.84) AD11 (-0.05)                                        AD06 (0.16)
     AD08 (0.98)   AD08 (0.40)  AD08 (-0.07)   AD08 (0.51)      TF42 (0.93) ENV44 (-0.02)                                   AD08 (0.16)
     ENV42 (0.98)  CP22 (0.80)   CP14 (0.61)   ENV42 (0.51)                 MOC26 (-0.10)                                   AD11 (0.08)
                   CP24 (0.89)   CP21 (0.77)   ENV44 (0.08)                  PP08 (-0.01)                                    CP15 (0.71)
                  ENV42 (0.40)   CP22 (0.80)   SPS21 (0.16)                 SUB07 (0.08)                                    ENV19 (0.40)
                   PP08 (0.05)  ENV22 (-0.01) SUB07 (0.10)                  TBT05 (0.69)                                    ENV27 (0.50)
                  SPS24 (-0.05) ENV42 (-0.07) TBT15 (0.68)                  TBT06 (0.98)                                    ENV42 (0.16)
                  STE31 (0.54) ENV44 (-0.01) TBT34 (0.93)                   TBT14 (0.89)                                    MOC26 (0.16)
                  TBT10 (-0.01) SPS11 (-0.00)                               TBT15 (0.58)                                    STE37 (0.06)
                   TF42 (0.65)  STE32 (0.66)                                TBT32 (0.69)                                    SUB07 (0.03)




28
                   TF43 (-0.04) SUB09 (0.78)                                TBT34 (0.42)                                    SUB10 (0.28)
                   TF44 (0.38)  SUB10 (0.90)                                 TF42 (0.64)                                     TF44 (0.98)
                                 TF42 (0.98)
        0:95          0:82          0:97            0:86            0:86        0:97                                              0:96
     Notes: Table shows PTA provisions associated with increases in bilateral trade ‡    ows (row 1), together with the estimated increase in
     trade ‡ ows (row 2), as well as other provisions that predict the provision in row 1 (rows 3-15; numbers in brackets are raw correlations
     with the provision from line 1). The last row displays the R2 of the regression of each selected provision on the corresponding correlated
     provisions.
be called a template e¤ect, that is, the tendency of important trading blocs such as the
EU and the US to use similar provisions in all their agreements. For example, most
agreements signed by the EU include provisions on anti-dumping and the environment,
hence leading to a high correlation between the corresponding provisions in our data.28
     The same provisions that were found to be correlated with AD14 also have a reason-
ably high correlation with CP23, which serves to promote transparency in competition
policy. That said, the variables with the strongest correlations with CP23 are other
competition policy provisions, namely CP22 and CP24. Thus, it seems likely that the
presence of provisions on competition policy is behind the observed trade increasing ef-
fect of CP23, although we are again unable to say which provision exactly is driving this
e¤ect.
     We …nd that TBT07 also has a substantial correlation with the above-mentiond AD6,
AD8, and ENV42 provisions but, not surprising, the strongest correlations are with other
TBT provisions (TBT15, TBT34) that also relate to the use of international standards.
Thus, it seems that provisions encouraging the use of international standards in the area
of technical barriers to trade are likely to be behind the trade increases associated with
provision TBT07, although we cannot say which of the individual TBT provisions is
driving the observed e¤ect.
     As for the other TBT provisions selected in the …rst step, TBT02/29, TBT08,
and TBT33, they are all strongly related to TF42, a trade facilitation provision, with
TBT02/29 being also correlated with provisions related to competition policy (CP14.
CP21, and CP22), state-owned enterprises (STE32), and subsidies (SUB09 and SUB10),
and TBT33 to other TBT provisions such as TBT06 and TBT14. This set of results
makes clear that provisions related to TBT are likely to have a signi…cant trade facilita-
tion e¤ect, but we are not able to identify precisely which ones are relevant.
     The plug-in PPML-lasso also selects a provision related to the simpli…cation of proce-
dures to issue proof of origin (TF45), and this provision is highly correlated with TF44,
which relates to the simpli…cation of requirements for proof of origin. As noted above,
Table 6 also indicates that other trade facilitation provisions are correlated with some of
the provisions selected in the …rst stage; this is true for CP23, TBT33, and especially for
TBT02/29 and TBT8. Thus, our results suggest that trade facilitation procedures, par-
ticularly those related to rules of origin, are likely to play an signi…cant role in increasing
trade ‡  ows.
     Finally, the iceberg lasso also identi…es provisions from other areas that help predict
the provisions identi…ed in the …rst step. For example, provisions in policy areas such
as movement of capital and public procurement are related to TBT33, but these types
of provisions are associated with smaller raw correlations. By the logic of the lasso, it
is likely that these provisions are informative for predicting the presence of TBT33 in a
relatively small number of agreements where other provisions with higher raw correlations
are not found.
     In summary, although it is not possible to identify with certainty which provisions
are most important for increasing trade, our results allow us to …nd a relatively small
  28
       In our data, ENV42 is perfectly colinear with AD06 and AD08.



                                                  29
bundle of provisions that are likely to have the desired e¤ect. In particular, provisions
related to TBTs, anti-dumping, trade facilitation, subsidies, and competition policy are
likely to enhance the trade-increasing e¤ect of trade agreements.

5.4    Bootstrap Lasso Results
As an alternative to the iceberg lasso, we now present the results obtained with the
bootstrap lasso. Tables 7 and 8 summarize the results obtained from 250 bootstrap
samples. The resampling process treats pairs belonging to the same agreement as be-
longing to the same cluster, treating pairs as clusters otherwise. In each replication, we
performed selection using plug-in lasso and record which variables are selected and their
their post-lasso PPML coe¢ cient estimates.

                             Table 7: Bootstrap lasso results
                      Provisions with largest          Provisions selected
                       average coe¢ cients              most frequently
                       AD14            0:079          AD14            0:372
                        CP23           0:065           CP23           0:320
                        CP22           0:063          TBT07           0:308
                       AD05            0:055          SPS06           0:228
                       TBT07           0:054          TBT08           0:208
                      TBT02/29         0:048          SUB12           0:184
                       TBT08           0:038         TBT02/29         0:168
                       SUB12           0:030          TBT33           0:160
                       TBT34           0:029           CP22           0:156
                       SPS06           0:028          TBT34           0:152
                        TF42           0:027          TBT06           0:148
                       TBT33           0:023          AD05            0:140
                        TF41           0:023           CP21           0:124
                       TBT06           0:021           TF45           0:116
                        CP21           0:020          ENV33           0:116
                      Notes: Bootstrap plug-in lasso performed using
                      cluster-bootstrap resampling with 250 replications.
                      The numbers shown are (left) the 15 largest average
                      post-lasso coe¢ cient estimates across all replications,
                      and (right) selection frequencies for the 15 most fre-
                      quently selected provisions.

    Table 7 presents the average coe¢ cients for the provisions with the 15 largest average
coe¢ cient across all replications (on the left) as well as the selection frequencies for the
15 most frequently selected provisions (on the right). It is worth noting that even the
provisions that are selected most frequently in relative terms are selected less that half
of the time. For example, AD14 is the most commonly selected provision, and it has the
largest coe¢ cient estimate of the variables selected by the plug-in lasso (see Table 5),
but it is only selected in 37% of replications. This illustrates that, as discussed before,

                                                30
we should only have limited con…dence that AD14 is the provision the delivers the e¤ect
indicated by the original plug-in estimates for AD14. At the same time, if we take the
method literally, AD14 is found quantitatively to be more likely to matter than other
provisions.
   Overall, the results in Table 7 are reassuring in that they broadly con…rm our earlier
…ndings using the iceberg lasso. Indeed, most of the provisions in Table 7 were previously
identi…ed as potentially relevant by the iceberg lasso. Moreover, there are multiple
provisions related to anti-dumping, competition policy, trade facilitation, and TBTs
that tend to be selected with relatively high frequency and have relatively high average
coe¢ cients (when averaged across all the bootstrap replications), reminiscent of the
provision groupings that were indicated with the iceberg method.

     Table 8: Bootstrap lasso results: Summarizing results by provision category
                               Number of provisions     Number of provisions     Sum of average
                                selected more than       selected more than     post-lasso e¤ects
                                  5% of the time           1% of the time       across categories
       Anti-dumping                      3                        5                   0:171
    Competition Policy                   3                        5                   0:151
       Environment                       1                        5                   0:017
       Export Taxes                      2                        5                   0:049
        Investment                       0                        2                   0:020
            IPR                          0                        5                   0:019
      Labor Markets                      0                        0                   0:000
         Migration                       1                        1                   0:012
    Movement of Capital                  1                        2                   0:023
    Public Procurement                   0                        1                   0:013
      Rules of Origin                    1                        4                   0:021
          Services                       0                        1                   0:004
            SPS                          1                       10                   0:062
         State aid                       2                        2                   0:011
         Subsidies                       5                        7                   0:076
           TBTs                          8                       13                   0:237
     Trade Facilitation                  2                        5                   0:064
           Total                         30                      74                   0:951
    Note: The table documents the categories in which provisions were most likely to be selected
    and the total of the average coe¢ cients of each provision within each category.

   Table 8 further summarizes the bootstrap lasso results by documenting the broad
provision categories in which provisions were most likely to be selected as well as the
sum of the average coe¢ cients within each category. These results, therefore, show
which provision categories, when taken as a whole, are likely to have the biggest impact
on trade. The category with the biggest total impact turns out to be TBTs, followed
by anti-dumping and competition policy. Next after that are subsidies, sanitary and
phytosanitary measures, trade facilitation, and export taxes. Overall, the di¤erences
between categories seem to comport with intuition (very small impacts for services and

                                                 31
labor markets, for example). They also are, again, broadly in line with the …ndings
obtained with iceberg lasso.

5.5     Predicting the e¤ect of trade agreements
Having identi…ed sets of provisions that are more likely to positively a¤ect trade ‡  ows,
it is natural to think of ways to use this information to evaluate the e¤ects of di¤erent
PTAs, and even to predict the impact of new ones. In the reminder of this section we
discuss ways to perform these prediction exercises and the associated caveats.29
    The simulation results presented in Section 4 suggest that, in small to moderate sam-
ples, the most reliable predictions are the ones based on the (post-lasso) PPML estimates
of a model whose regressors are the provisions selected by the plug-in lasso. This kind of
prediction can easily be obtained using the results in column (3) of Table 5. For example,
we have noted that the latest EU agreement includes all the provisions selected by the
plug-in lasso, with the exception of AD14 and TBT7. Therefore, the e¤ect of the latest
EU agreement is estimated to be 87% (exp (0:118 + 0:184 + 0:123 + 0:113 + 0:089) 1 =
0:87). This result is comparable to the e¤ect estimated when the EU dummy is included
in the model as in column (5) of Table 5, which is 86% (exp (0:618) 1 = 0:86).30
    In results that are summarized in the third column of Table 9, we repeat this exercise
for each of the PTAs in our data.31 As in Baier, Yotov, and Zylkin (2019), we …nd
a wide variety of e¤ects, ranging from very large impacts in agreements such as the
Eurasian Economic Union, which includes all of the selected provisions, to no e¤ect
at all in agreements that do not include any, such as ASEAN.32 In comparison with
column 1 of this Table, which describes results for PPML with the full set of provision
variables, we see an immediate advantage of using the plug-in method to model PTA
heterogeneity: it greatly cuts down on over…tting. The range spanned by the estimates
obtained with the full set of provision reaches implausibly large positive and negative
values at the extremes, and their standard deviation is thousands of times that of the
estimates produced using plug-in lasso. As shown in column 2, over…tting may also be a
problem for the predictions generated by cross-validation lasso, which also lead to some
implausible estimates. These results resonate with what we found in the simulations
reported in Section 4, where both the model with all regressors and the model with
regressors selected by cross-validation performed poorly.
    We next consider the performance of the two extensions of the plug-in lasso we have
proposed, the iceberg lasso and the bootstrap lasso. The iceberg lasso has the advantage
  29
     As in Section 4, in this section we compute penalized predictions when using cross-validation, and
post-lasso unpenalized predictions for the plug-in, iceberg, and bootstrap lasso. For the bootstrap lasso,
the predictions are obtained by averaging the post-lasso predictions in each of the bootstrap samples.
  30
     Of course, using the delta method it is possible to obtain con…dence intervals for these e¤ects.
However, such con…dence intervals do not take into account model uncertainty, which is likely to be the
main source of uncertainty in this context. We consider this issue below.
  31
     Note that the average estimated e¤ect is 13:8%, which is very close to the estimated PTA e¤ect of
14:0% corresponding to result in column 1 of Table 7.
  32
     In contrast to Baier, Yotov, and Zylkin (2019), we are able to identify heterogeneity across di¤erent
PTAs but not within PTAs.


                                                   32
that it is likely to select more of the provisions with a causal impact than the plug-in
lasso. Moreover, it performed reasonably well as a predictive method in our simulations.
However, as is apparent from column 4 of Table 9, in this application, predictions based
on iceberg lasso lead to some unrealistic estimates. Intuitively, the provisions selected
by the iceberg lasso will, by design, include multiple regressors that are highly collinear
with one another. Therefore, although it may be possible to estimate the joint e¤ect of
these variables with reasonable precision, the same is unlikely to be the case for each
individual e¤ect. This implies that the iceberg lasso is likely to be a good predictor of
the e¤ect of PTAs that include all these variables, but it may lead to unreliable results
for PTAs that only include a subset of the highly collinear provisions.

            Table 9: Summarizing Estimates of Heterogeneous PTA E¤ects
                                (1)        (2)       (3)     (4)       (5)
                                  All variables        CV      Plug-in    Iceberg   Bootstrap
      Descriptive statistics
      Min                              81:2%        50:4%   0:0%          62:8%        0:0%
      Max                             > 1e6%       387:0% 144:4%         284:9%      101:0%
      Mean                         328774:6%        32:1% 13:8%           17:2%       12:5%
      Median                           26:4%        14:4%   9:3%           6:7%        7:2%
      Stdev.                      300514:7pp       63:0pp 20:7pp         42:4pp      15:3pp
      Correlations
      PPML                                 1           0:146    0:054       0:233      0:041
      CV                                 0:146           1      0:391       0:550      0:513
      Plug-in                            0:054         0:391      1         0:507      0:925
      Iceberg                            0:233         0:550    0:507         1        0:679
      Bootstrap                          0:041         0:513    0:925       0:679        1
      Estimated partial e¤ects for selected PTAs
      EU                           104:9% 105:4% 87:1%                   101:6%       64:2%
      EEA                           80:4%     90:5%   9:3%                94:4%       18:3%
      Eurasian Econ. Union          21:8%     71:8% 144:4%                38:5%      101:0%
      NAFTA                         77:9%     77:5% 79:9%                 81:5%       52:9%
      MERCOSUR                     145:5% 115:9% 42:1%                    76:2%       39:6%
      ECOWAS                       469:6% 379:2%      9:3%                23:3%       19:4%
      ASEAN                           1:8%     9:4%   0:0%                 0:0%        3:3%
      This table summarizes estimated partial e¤ects for individual PTAs produced by the dif-
      ferent methods we consider. The column labelled “All variables”refers to an unpenalized
      PPML regression with all 305 provision variables. The other columns refer to variants
      of the lasso discussed in Section 3.

    The predictions based on the bootstrap lasso performed well in our simulations, and
this approach also shows promise here. As shown in column 5 of Table 9, the PTA
estimates produced by bootstrap lasso lead to less extreme predictions and have the
lowest dispersion of any the methods we consider, consistent with what would be expected
for a method based on bootstrap aggregation. Though they are highly correlated with

                                                  33
the estimates produced by plug-in lasso, the selected PTA estimates shown in the bottom
panel of Table 9 reveal that the estimated e¤ects obtained with the plug-in lasso and
bootstrap lasso can di¤er substantially for individual PTAs.
    It should be noted that the bootstrap lasso is the only approach we have considered
that can provide information about model uncertainty. Indeed, as a by-product of the
bootstrap sampling procedure, it can provide con…dence intervals showing how sensitive
predictions of individual PTA e¤ects are to the particular sample that is used in the
estimation. We have not rigorously evaluated the validity of such con…dence intervals
for bounding prediction uncertainty, but it is certainly an avenue worth exploring.
    In summary, the plug-in lasso is our preferred method to estimate the e¤ect of in-
dividual PTAs, but the bootstrap lasso may be a worthwhile check at the very least.
The results of this exercise, however, need to be treated with some caution. As we have
repeatedly noted, the results of the plug-in lasso do not have a causal interpretation.
Therefore, their accuracy for predicting e¤ects of individual PTAs will depend, at least
to some extent, on whether the selected provisions themselves have a causal impact on
trade or serve as a signal of the presence of provisions that have a causal e¤ect. When
this condition holds, the predictions based on this method are likely to be reasonably
accurate and, indeed, the simulation results reported in Section 4 show that this ap-
proach can work well even in situations where the variables having a causal impact on
the outcome are not selected by the plug-in lasso. That said, it is possible to envision
scenarios where predictions based on the plug-in lasso fail dramatically. For example, it
could be the case that a PTA is incorrectly measured to have zero impact despite having
many of the true causal provisions.


6    Conclusions
In this paper, we have proposed new methods for assessing the impact of individual
trade agreement provisions on trade ‡   ows. While other work in this area has relied on
summary measures of agreement depth or on speci…c provision bundles of interest, our
approach is instead to study the rich provision content of PTAs as a variable selection
problem. By combining the three-way PPML estimator that is popular in the study
of PTAs with lasso methods for variable selection, we are able to identify a relatively
parsimonious set of provisions that are most likely to impact trade. While these provi-
sions span a range of policy areas, our results generally support the conclusion that a
select number of provisions related to technical barriers to trade, anti-dumping, trade
facilitation, subsidies, and competition policy are most e¤ective at promoting trade as
compared to other types of provisions that appear in PTAs.
    In spite of the obvious appeal that lasso methods have in this context, we need to
be clear that interpreting their results requires some important caveats. In particular,
we know that it is possible that even our preferred lasso methods may fail to discover
important trade-promoting provisions, and that they are almost certain to lead to the
inclusion of provisions that are not relevant. The iceberg lasso and bootstrap lasso



                                           34
methods do, however, improve upon both the standard cross-validation lasso and the
plug-in lasso as variable selection methods.
    In terms of broader applications, our methods are not limited to just PTAs or even
just to trade. There are many other contexts in which the iceberg lasso and bootstrap
lasso methods we have introduced could be helpful tools for researchers wishing to de-
termine which of a large number of variables are worth focusing on as most relevant for
the outcome. Furthermore, by integrating the lasso into a nonlinear model with high-
dimensional …xed e¤ects, we show how machine learning methods for variable selection
and related tasks can be utilized in much more generalized settings than what had been
possible previously.




                                          35
                               Table A1: Provisions selected by the iceberg lasso
     Anti-dumping
     AD06     If there are no sales in the normal course of trade in the domestic market of the exporting country
     AD08     Cost of production in the country of origin plus a reasonable amount
     AD11     Price e¤ects of dumped imports
     AD14     Requirement to establish material injury to domestic producers
                                                                                                                                       Appendix




     Competition Policy
                                                                                                                     Provisions list




     CP14     Does the agreement require the establishment or existence of competition policy (either economy wide
              or sector speci…c)?
     CP15     Does the agreement prohibit/regulate cartels/concerted practices?
     CP21     Does the agreement regulate mergers and acquisitions?
     CP22     Does the agreement contain provisions that promote predictability?
     CP23     Does the agreement contain provisions that promote transparency?
     CP24     Does the agreement contain provisions that promote the right of defense?
     Environmental Laws
     ENV19 Does the agreement regulate pollution by ships?




36
     ENV22 Does the agreement regulate …shing subsidies?
     ENV27 Does the agreement promote renewable energy and improving energy e¢ ciency?
     ENV42 Does the agreement require states to comply with the UN Conference on Environment and Develop-
              ment?
     ENV44 Does the agreement require states to comply with the International Energy Program?
     Movement of Capital
     MOC26 Does the transfer provision explicitly exclude “good faith” and non-discriminatory application of its
              laws related to prevention of deceptive and fraudulent practices?
     Public Procurement
     PP08     Does the agreement contain explicit provisions on MFN treatment of third parties?
     Sanitary and Phytosanitary Measures
     SPS11    Does the agreement promote the creation of concerted/regional standards?
     SPS21    Risk Assessment: Is there reference to international standards/procedures?
     SPS24    Is the burden of justifying non-equivalence on the importing country?
                         Table A1 (cont’    d): Provisions selected by the iceberg lasso
     State-Owned Enterprises
     STE31 Does the agreement prohibit anti-competitive behavior of state enterprises?
     STE32 Does the agreement require state enterprises not to distort trade?
     STE37 Does the agreement indicate the geographical market where the objectionable conduct or the e¤ect
             takes place?
     Subsidies
     SUB07 Does the agreement introduce any ceiling to permitted subsidies?
     SUB09 Does the agreement include any speci…c regulation of agricultural subsidies?
     SUB10 Does the agreement include any speci…c regulation of …sheries subsidies?
     Technical Barriers to Trade
     TBT02 Technical Regulations - Is mutual recognition in force?
     TBT05 Technical Regulations - Are there speci…ed existing standards to which countries shall harmonize?
     TBT06 Technical Regulations - Is the use or creation of regional standards promoted?
     TBT07 Technical Regulations - Is the use of international standards promoted?
     TBT08 Conformity Assessment - Is mutual recognition in force?




37
     TBT10 Conformity Assessment - Do parties participate in international or regional accreditation agencies?
     TBT14 Conformity Assessment - Is the use or creation of regional standards promoted?
     TBT15 Conformity Assessment - Is the use of international standards promoted?
     TBT29 Standards - Is mutual recognition in force?
     TBT32 Standards - Are there speci…ed existing standards to which countries shall harmonize?
     TBT33 Standards - Is the use or creation of regional standards promoted?
     TBT34 Standards - Is the use of international standards promoted?
     Trade Facilitation and Customs
     TF42    Does the agreement regulate customs and other duties collection?
     TF43    Does the agreement require the sharing of customs revenues?
     TF44    Do trade facilitation provisions simplify requirements for proof of origin?
     TF45    Does trade facilitation provisions simplify procedures to issue proof of origin?
More Details on HDFE-PPML-Lasso Estimation
The minimization problem that de…nes the three-way PPML-lasso is
                                      "
                                        1X
          ( b ; b; b; b) := arg min             exp(x0ijt + it + jt + ij )
                                ; ; ;   n i;j;t
                                                                                 p
                                                                                             #
                            1X                                               1 Xb
                                    yijt x0ijt +    it   +   jt   +   ij   +         j    kj     ;    (3)
                            n i;j;t                                          n k=1 k

where bk , to be precisely de…ned below, is identical to 1 except when the plug-in method
is used.
    The …rst-order conditions (FOCs) for this problem are
                 1X
           b it :        yijt bijt = 0;                                                 8i; t;
                 n j
                 1X
           bjt :         yijt bijt = 0;                                                 8j; t;
                 n i
                 1X
           bij :         yijt bijt = 0;                                                 8i; j;
                 n t
                   X
           b :1           yijt bijt xijt;k
                                             1b
                                                 sign( ^ k ) = 0;                      k = 1:::p;
             k
                 n i;j;t                     n k

where bijt denotes ijt evaluated at b , b, b, b. Notice that the penalty only a¤ects the
FOCs for the main covariates of interest. The FOCs for the …xed e¤ects are exactly the
same as they would be in unpenalized PPML. That said, further simpli…cation is still
needed because it is generally not possible to estimate all of the parameters directly,
with or without the penalty. Instead, we …rst need to “concentrate out”the …xed e¤ect
parameters. That is, instead of minimizing (3) over all of the parameters, we treat b it ,
bit , and bit as functions of b that are implicitly de…ned by their FOCs. The resulting
“concentrated”minimization problem is
                             "
                                 X
            b := arg min 1             exp x0ijt + b it ( ) + bjt ( ) + bij ( )
                               n i;j;t
                                                                            p
                                                                                    #
                 1X                                                    1  X
                                                                              b j j ;
                         yijt x0ijt + b it ( ) + bjt ( ) + bij ( ) +                  (4)
                 n i;j;t                                               n k=1 k    k



such that    is now the only argument we need to solve for. The FOC for each bk
associated with this modi…ed problem is:
       X
bk : 1       yijt   exp x0ijt b + b it b + bjt b + bij b                      eijt;k
                                                                              x
                                                                                       1b
                                                                                           sign(bk ) = 0;
     n i;j;t                                                                           n k

                                                   38
where
                                                 d b it;k dbit;k dbij;k
                          eijt;k := xijt;k +
                          x                              +      +                          (5)
                                                   d       d      d
captures both the direct and indirect e¤ects of a change in on the conditional mean of
yijt .
     To explain how we deal with the …xed e¤ects, assume for the moment that we know
the true values of ijt := exijt + it + jt + ij that we will eventually estimate. If that is the
case, then the penalized PPML solution ( ; ; , ) is also the solution to the following
weighted least squares problem
                "                                                     p
                                                                                #
                   1 X                                         2   1 X
                                                                         b j j ;
            min                zijt    it               x0ijt    +
                  2n i;j;t ijt                 jt   ij
                                                                   n k=1 k    k



where
                                          yijt          ijt
                                 zijt =                       + log   ijt
                                                  ijt

is the transformed dependent variable that is used to motivate estimation via iteratively
re-weighted least squares (IRLS). The convenient thing about this representation of the
problem is that we can rewrite it as
                          "                             p
                                                                   #
                            1X                     2
                                                       X
                                                           bk j k j ;
                     min                     e0ijt
                                        eijt x
                                        z            +                                (6)
                            2 i;j;t ijt                k=1

where z            eijt are respectively de…ned as the “partialed-out” versions of xijt and
          eijt and x
zijt , which are obtained by within-transforming xijt and zijt with respect to it, jt; and
ij and weighting by ijt . The within-transformation steps involved in computing z         eijt
and x  eijt are the same as in Correia, Guimarães, and Zylkin (2020) and can be computed
quickly using the methods of Gaure (2013). Furthermore, one can show that the x           eijt
that appears in (6) is consistent with the de…nition given for x  eijt;k in (5).
      The nice thing about expressing the problem as in (6) is that it now resembles a simple
penalized regression problem. It can thus be quickly solved using the coordinate descent
algorithm of Friedman, Hastie, and Tibshirani (2010). Furthermore, though we do not
know the correct estimation weights (the ijt s) beforehand, we can follow the approach
of Correia, Guimarães, and Zylkin (2020) by repeatedly updating them until convergence
after each new estimate of , as in IRLS estimation. Altogether, our algorithm closely
follows Correia, Guimarães, and Zylkin (2020) and otherwise only involves swapping out
their weighted least squares step for a penalized weighted least squares step, as shown
in (6). In principle, this algorithm can be easily modi…ed to other settings that feature
multi-way …xed e¤ects in order to simplify estimation.

More Details on Plug-in Lasso
Rather than relying on out-of-sample performance, the Belloni, Chernozhukov, Hansen,
and Kozbur (2016) “plug-in”lasso method chooses the penalty parameters and bk using

                                                   39
statistical arguments. Their speci…c framework is a simple linear panel data model, but
their reasoning involves modifying the standard lasso penalty to re‡ ect the variance of
the score. These concepts are quite general; thus, we can modify their approach to take
into account the more complex case of a nonlinear model with multiple …xed e¤ects.
    The key condition in choosing these penalty parameters is that they should satisfy
the following inequality for all k :
                b
                    k        1X
                        c            (yijt    exp(x0ijt +        it   +        jt   +         eijt;k
                                                                                         ij ))x        8k;   (7)
                n            n i;j;t

for some c > 1. Intuitively,
                            1X
                                    (yijt    exp(x0ijt +    it   +        jt   +             eijt;k
                                                                                        ij ))x
                            n i;j;t

is the absolute value of the score for k . When evaluated at k = 0, it tells us to what
degree moving each k away from zero will a¤ect the …t of the model. If it does not
produce a su¢ cient improvement in …t as compared to the penalty bk , then regressor
xijt;k will not be selected.
    Next, suppose that the observations associated with trade agreements are partitioned
into G clusters indexed by g = 1; : : : ; G, and let o = (i; j; t) serve as the unique index
for each observation. Set
                                           !2
                    2    1 X X                  1 XXX
                   b =             e
                                   x o;kbo    =               x     eo0 ;kbobo0 ;
                                                               eo;k x
                    k
                         n g   o2g
                                                n g o2g o0 2g

where bo = bijt = yijt exp(x0ijt b + b it + bjt + bij ), but can also be obtained as bo = bijt =
bijt (z    e0ijt b). By inspection, this expression provides an estimate of the variance of
      eijt x
the score for k under the assumption that errors are correlated within their respective
                                                         2
clusters. Under suitable regularity conditions, bk          2
                                                            k = op (1) uniformly in k , where k
                                                                                               2
                      2
is the analogue of bk evaluated at the true values of ijt . By choosing bk in this way we
ensure that the score for k when evaluated at zero must be large as compared to its
standard deviation in order for regressor k to be selected.
     The choice of then involves setting a value that is su¢ ciently large that the statis-
tical probability an irrelevant regressor is selected is small. By the maximal inequality
for self-normalized sums (see Jing, Shao, and Wang, 2003), it follows that
                                  1 1 P
                           Pr bk p   n
                                                eijt;k ijt m
                                          i;j;t x
                                                               = o(1);
                                  Pr (N (0; 1) m)
for jmj = o(n1=6 ), thus establishing a bound for the tails of the normalized sum. This
suggests that by choosing a that is su¢ ciently large to dominate a p-dimensional stan-
dard normal, the inequality in (7) is satis…ed. Hence, following Belloni, Chernozhukov,
Hansen, and Kozbur (2016), we set
                                           p
                              = plug = 2c n 1 (1        =2p) ;

                                                     40
where c = 1:1 and = 0:1= log(n).
    As discussed in the main text, after the lasso step, we then perform an unpenalized
PPML estimation using the selected covariates, a so-called “post-lasso” regression. Let
bP L be the estimator of the parameters associated with the s selected covariates. Such
an estimator is said to have the “oracle property”if the asymptotic distribution of bP L
coincides with that of the estimator we would obtain if we knew exactly which coe¢ cients
were equal to zero, i.e., for large enough samples we would have bP L;k = 0 if and only
if k = 0 for k = 1; :::; p. Hence, for estimators with the oracle property, asymptotically
the post-lasso model is indeed the right model. In general, the lasso does not satisfy the
oracle property. Nevertheless, under some additional regularization conditions, the use
of the plug-in lasso method just described ensures the following “near-oracle” property
for bP L ,                                  r                        !
                                              s 2 max (log n; log p)
                         b           = Op                              ;
                           PL
                                   1                    n
and hence the post-lasso estimates are consistent at a rate that di¤ers from the oracle
rate only up to the log factor max (log n; log p).
     In practice, the plug-in lasso method mainly requires adding one additional step to
the procedure used for the estimation of the PPML-lasso with high-dimensional …xed
e¤ects described before. Though the bk penalty terms are not known beforehand, they,
too, can be iterated on in the same fashion as ijt . Simply use the most recent values of
bijt (obtained using post-lasso PPML) in each iteration to construct new values for bk .
It also requires an initial value for bijt . For this, we …rst estimate a three-way gravity
model with a single dummy for PTA using PPML.

More Details on Cross-Validation
As discussed in the main text, the idea behind cross-validation (CV) is to repeatedly hold
out a subset of the sample during estimation and then use it to validate the resulting
estimates. In our setup, rather than holding out observations in an unstructured way,
we keep together all observations for which a given agreement is in e¤ect, and hold out
subsets of agreements. Doing so allows us to obtain estimates for the all the …xed e¤ects
in the model.
    To describe the implementation of CV, suppose that the observations associated with
trade agreements are partitioned into G subsets. Each resulting hold-out sample g will
have ng observations, where ng is the number of observations associated with agreements
that are held out in partition g . Because our variables of interest are all dummies, a
problem that may occur is that over some subsamples some regressors may not be present,
but that is less likely to happen when G is large.
    The CV approach sets all regressor-speci…c penalty weights bk equal to 1. Let bL;g ( )
be the lasso estimator obtained via the minimization of (4) when holding out the ng




                                            41
observations contained in partition g . De…ne the CV bandwidth as
                     2
                       1 X 1 X
                          G

       CV =  arg min 4                   yijt
                  2    G g=1 ng
                                    (i;j;t)2g
                                                                                                        2
             exp x0ijt bL;g ( ) +    it
                                          b
                                              L;g (   ) +   jt
                                                                 b
                                                                     L;g (   ) +   ij
                                                                                        b
                                                                                            L;g (   )       :

Since CV is based on the minimization of the average MSE over di¤erent subsamples, we
expect it to deliver a much more lenient variable selection. There is some disagreement
over whether dummy variables, such as the ones used in our application, should be
standardized before applying the CV lasso. This consideration is in contrast to the plug-
in lasso, since standardization of the covariates simply causes the bk terms to be re-scaled
without otherwise a¤ecting estimation in that case. We have computed CV lasso results
with and without …rst standardizing and found that the results with standardization are
noticeably more similar to the plug-in lasso results. Thus, our preference is to work with
standardized dummy covariates.


References
 Anderson, J. and E. Van Wincoop (2003). “Gravity with gravitas: A solution to the
      border puzzle,” American Economic Review, 93, 170-192.
 Baier, S.L. and J.H. Bergstrand (2007). “Do free trade agreements actually increase
      members’international trade?,” Journal of International Economics, 71, 72-95.
 Baier, S.L., J.H. Bergstrand, and M.W. Clance (2018). “Heterogeneous e¤ects of eco-
      nomic integration agreements,” Journal of Development Economics, 135, 587-608.
 Baier, S.L., J.H. Bergstrand, and M. Feng (2014). “Economic integration agreements
      and the margins of international trade,” Journal of International Economics, 93,
      339-350.
 Baier, S.L, Y.V. Yotov, and T. Zylkin (2019). “On the widely di¤ering e¤ects of free
      trade agreements: Lessons from twenty years of trade integration,” Journal of
      International Economics, 116, 206-228.
 Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). “Sparse models and
      methods for optimal instruments with an application to eminent domain,” Econo-
      metrica, 80, 2369-2429.
 Belloni, A., V. Chernozhukov, C. Hansen, D. Kozbur (2016). “Inference in high dimen-
      sional panel models with an application to gun control,” Journal of Business &
      Economic Statistics, 34, 590-605.
 Correia, S., P. Guimarães and T. Zylkin (2020). “Fast Poisson estimation with high
      dimensional …xed e¤ects,” STATA Journal, 20, 90-115.
 Dhingra, S., R. Freeman, and E. Mavroeidi (2018). “Beyond tari¤ reductions: What
      extra boost to trade from agreement provisions?,” LSE Centre for Economic Per-
      formance Discussion Paper 1532.

                                                      42
Drukker, D.M and D. Liu (2019). “A plug-in for Poisson lasso and a comparison of
    partialing-out Poisson estimators that use di¤erent methods for selecting the lasso
    tuning parameters,”mimeo.
Falvey, R., N. Foster-McGregor (2022). “The breadth of preferential trade agreements
    and the margins of exports,” Review of World Economics, 158, 181-251.
Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization paths for generalized
     linear models via coordinate descent,” Journal of Statistical Software, 33, 1-22.
Gaure, S (2013). “OLS with multiple high dimensional category variables,” Computa-
   tional Statistics & Data Analysis 66, 8-18.
Gourieroux, C., A. Monfort, A. Trognon (1984). “Pseudo maximum likelihood methods:
   Applications to Poisson models,” Econometrica, 52, 701-720.
Hastie, T., R. Tibshirani, and J.H. Friedman (2009). The elements of statistical learn-
    ing: Data mining, inference, and prediction. New York (NY): Springer.
Hofmann, C., A. Osnago, M. Ruta (2017). “Horizontal depth. A new database on the
    content of preferential trade agreements,” World Bank Policy Research Working
    Paper 7981.
Jing, B.Y., Q.M. Shao, and Q. Wang (2003). “Self-normalized Cramér-type large devia-
     tions for independent random variables,” The Annals of Probability, 31, 2167-2215.
Kohl, T., S. Brakman, H. Garretsen (2016). “Do trade agreements stimulate interna-
    tional trade di¤erently? Evidence from 296 trade agreements,” The World Econ-
    omy, 39, 97-131.
Larch, M., J. Wanner, Y.V. Yotov, T. Zylkin (2019). “Currency unions and trade:
    a PPML re-assessment with high dimensional …xed e¤ects,” Oxford Bulletin of
    Economics and Statistics, 81, 487-510.
Lunn, A.D. and S.J. Davies (1998). “A note on generating correlated binary variables,”
   Biometrika, 85, 487-490.

Mattoo, A., A. Mulabdic, and M. Ruta (2017). Trade creation and trade diversion in
    deep agreements. Policy Research Working Paper Series 8206, The World Bank.
Mattoo, A., N. Rocha, M. Ruta (2020). “Handbook of deep trade agreements.”Wash-
    ington, DC: World Bank.
Mulabdic, A., A. Osnago, and M. Ruta (2017). “Deep integration and UK-EU trade
    relations,”World Bank Policy Research Working Paper Series 7947.
Mullainathan, S. and J. Spiess, (2017). “Machine learning: An applied econometric
    approach,” Journal of Economic Perspectives, 31, 87-106.
Prusa, T., R. Teh, and M. Zhu (2022). “PTAs and the incidence of antidumping
    disputes,”available at https://tinyurl.com/PTA-PTZ-2022.
Regmi, N. and S. Baier (2020). “Using machine learning methods to capture hetero-
   geneity in free trade agreements,”mimeograph.
Santos Silva, J.M.C. and S. Tenreyro (2006). “The log of gravity,”Review of Economics
    and Statistics, 88, 641-658.

                                          43
Stammann, A. (2018). “Fast and feasible estimation of generalized linear models with
    high-dimensional k -way …xed e¤ects,”arXiv:1707.01815.
Tibshirani, R. (1996). “Regression shrinkage and selection via lasso,” Journal of the
    Royal Statistical Society, Ser B. 59, 267-288.
Wainwright, M.J. (2009). “Sharp thresholds for high-dimensional and noisy sparsity re-
    covery using `1 -constrained quadratic programming (Lasso),” IEEE Transactions
    on Information Theory, 55, 2183-2202.
Weidner, M., T. Zylkin (2021). “Bias and consistency in three-way gravity models,”
    Journal of International Economics, 132, 103513.
Wüthrich, K. and Y. Zhu, (2021). “Omitted variable bias of Lasso-based inference
   methods: A …nite sample analysis,” Review of Economics and Statistics, forth-
   coming.
Yotov, Y.V., R. Piermartini, J.-A. Monteiro, M. Larch (2016). An advanced guide to
    trade policy analysis: The structural gravity model. Geneva: World Trade Organi-
    zation.
Zhao, P. and B. Yu (2006). “On model selection consistency of lasso,” Journal of
    Machine Learning Research, 7, 2541-2563.
Zou, H. (2006). “The adaptive lasso and its oracle properties,” Journal of the American
    Statistical Association, 101, 1418-1429.
Zou, H. and T. Hastie, (2005). “Regularization and variable selection via the elastic
    net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology),
    67, 301-320.




                                         44