Policy Research Working Paper 9629 Machine Learning in International Trade Research Evaluating the Impact of Trade Agreements Holger Breinlich Valentina Corradi Nadia Rocha Michele Ruta J.M.C. Santos Silva Tom Zylkin Development Economics Development Research Group & Macroeconomics, Trade and Investment Global Practice April 2021 Policy Research Working Paper 9629 Abstract Modern trade agreements contain a large number of important provisions and quantifying their impact on trade provisions besides tariff reductions, in areas as diverse as flows. The proposed methods have the advantage of not services trade, competition policy, trade-related investment requiring ad hoc assumptions on how to aggregate individ- measures, or public procurement. Existing research has ual provisions and offer improved selection accuracy over struggled with overfitting and severe multicollinearity prob- the standard lasso. The analysis finds that provisions related lems when trying to estimate the effects of these provisions to technical barriers to trade, antidumping, trade facilita- on trade flows. This paper builds on recent developments tion, subsidies, and competition policy are associated with in the machine learning and variable selection literature to enhancing the trade-increasing effect of trade agreements. propose novel data-driven methods for selecting the most This paper is a product of the Development Research Group, Development Economics and the Macroeconomics, Trade and Investment Global Practice. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/prwp. The authors may be contacted at h.breinlich@surrey.ac.uk, v.corradi@surrey.ac.uk, nrocha@worldbank.org, mruta@worldbank.org, jmcss@surrey.ac.uk, and tzylkin@richmond.edu. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Machine Learning in International Trade Research – Evaluating the Impact of Trade Agreements Holger Breinlichy Valentina Corradiz Nadia Rochax Michele Ruta{ J.M.C. Santos Silvak Tom Zylkin Originally published in the Policy Research Working Paper Series on April 2021. This version is updated on May 2022. To obtain the originally published version, please email prwp@worldbank.org. KEY WORDS: Lasso, Machine Learning, Preferential Trade Agreements, Deep Trade Agreements. JEL CLASSIFICATION: F14, F15, F17. Research for this paper has been in part supported by the World Bank’ s Multidonor Trust Fund for Trade and Development. The …ndings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its a¢ liated organizations, or those of the Executive Directors of the World Bank or the governments they represent. We gratefully acknowledge …nancial support through ESRC grant EST013567/1, and thank Scott Baier, Maia Linask, Yoto Yotov, and seminar participants at the World Bank Economics of Deep Trade Agreements Seminar Series for useful comments. Alvaro Espitia, Diego Ferreras-Garrucho, Jiayi Ni, and Nicolas Apfel provided excellent research assistance. The usual disclaimer applies. An R package (penppml) implementing penalized PPML regressions with high-dimensional …xed e¤ects is available from CRAN. y University of Surrey, CEP and CEPR. Email: h.breinlich@surrey.ac.uk z University of Surrey. Email: v.corradi@surrey.ac.uk x World Bank. Email: nrocha@worldbank.org. { World Bank. Email: mruta@worldbank.org. k University of Surrey. Email: jmcss@surrey.ac.uk. University of Richmond. Email: tzylkin@richmond.edu. 1 Introduction International trade is of vital importance for modern economies, and governments around the world try to shape their countries’ export and import patterns through numerous interventions. Given the di¢ culties facing multilateral trade negotiations through the World Trade Organization (WTO), in the last two decades, countries have increasingly turned their focus to preferential trade agreements (PTAs) involving only one or a small number of partners. At the same time, attention has shifted from the reduction of import tari¤s to the role of non-tari¤ barriers and behind-the-border policies, such as di¤erences in regulations, technical standards, or intellectual property rights protections. Accord- ingly, modern trade agreements contain a host of provisions besides tari¤ reductions, in areas as diverse as services trade, competition policy, trade-related investment measures, or public procurement (Hofmann, Osnago, and Ruta, 2017). Against this background, researchers and policy makers interested in the e¤ects of trade agreements face di¢ cult challenges. In particular, recent research has tried to move beyond estimating the overall impact of PTAs and to establish the relative im- portance of individual trade agreement provisions in determining an agreement’ s overall impact (e.g., Kohl, Brakman, and Garretsen, 2016, Mulabdic, Osnago, and Ruta, 2017, Dhingra, Freeman, and Mavroeidi, 2018, Regmi and Baier, 2020, and Falvey and Foster- McGregor, 2022). However, such attempts face the di¢ culty that the large number of provisions, and the fact that similar provisions appear in di¤erent trade agreements, create severe multicollinearity problems, which make it very di¢ cult to identify the ef- fects of individual provisions. Traditional methods such as gravity regressions of trade ‡ ows on dummies for individual provisions are not able to deal with such multicollinear- ity. Instead, researchers have grouped or aggregated provisions in di¤erent ways. For example, Mattoo, Mulabdic, and Ruta (2017) use the count of provisions in an agree- ment as a measure of its ‘ depth’, hence implicitly giving equal weight to each measure. Dhingra, Freeman, and Mavroeidi (2018) overcome multicollinearity problems by group- ing services, investment, and competition provisions and examining the e¤ect of these “provision bundles”on trade ‡ ows. In this paper, we build upon recent developments in the machine learning and variable selection literature to propose novel data-driven methods to select the most important provisions and quantify their impact on trade ‡ ows. These methods address di¢ culties arising from the high degree of correlation between individual PTA provisions, without requiring ad hoc assumptions on how to aggregate individual provisions. Though, to be clear, they do not completely answer the question of “which provisions matter for trade?” , our proposed methods do lead to substantial improvements in our ability to …nd the more relevant provisions while narrowing down the large number of potential explanatory variables. We start by proposing an extension of the well-known lasso (Least Absolute Shrink- age and Selection Operator) method for variable selection (see, e.g., Hastie, Tibshirani, and Friedman, 2009) to the case of nonlinear models with high-dimensional …xed e¤ects, which have become standard in the analysis of trade ‡ ows (see, e.g., Yotov, Piermartini, Monteiro, and Larch, 2016). Speci…cally, we use a Poisson pseudo-maximum likelihood 2 (PPML) version of the lasso and show how to choose the tuning parameter of this es- timator using either cross-validation or the “plug-in” (or “theory-driven” ) approach of Belloni, Chernozhukov, Hansen, and Kozbur (2016), which accounts for heteroskedastic- ity and clustered errors.1 We apply our PPML-lasso estimators to a comprehensive data set on PTA provisions recently made available by the World Bank (Mattoo, Rocha and Ruta, 2020). Impor- tantly, this database is very detailed, to the point where the number of provision variables we consider is larger than the number of PTAs we observe in our data. In addition, due to template e¤ects and possible synergies between groups of provisions, the 305 provision variables in our data can be highly correlated with one another. We …nd that the number of provisions selected when using the PPML-lasso with the tuning parameter chosen by cross-validation is too large for the model to have a meaningful interpretation and that, in contrast, the number of provisions identi…ed when using the plug-in penalty is too small to allow us to be con…dent that it includes the majority of relevant provisions.2 To address these issues, we introduce two additional methods that seek to identify potentially important variables that may have been missed in an initial lasso step based on the plug-in penalty. One of the methods, that we call “iceberg lasso” , involves regress- ing each of the provisions selected by the plug-in lasso on all other provisions, with the purpose of identifying relevant variables that were initially missed due to their collinear- ity with the provisions selected in the initial step. The other method, termed “bootstrap lasso”, augments the set of variables selected by the plug-in lasso with the variables selected when the plug-in lasso is bootstrapped. As we show using simulations, these new methods present a favorable balance between the parsimony of the plug-in lasso and the lenience of cross-validation methods in small-to-moderate data sets where the true causal variables may be highly correlated with an unknown number of other variables. To provide some headline results, the PPML-lasso based on cross-validation selects 133 provisions as being relevant, whereas using the plug-in penalty we …nd that only 8 provisions are associated with enhancing the trade-increasing e¤ect of trade agreements. In turn, the iceberg lasso procedure identi…es a set of 42 provisions and, depending on the cuto¤ used, the bootstrap lasso identi…es between 30 and 74 provisions that may be impacting trade. Therefore, our iceberg lasso and bootstrap lasso methods select sets of provisions that are small enough to be interpretable and large enough to give us some con…dence that they include the more relevant provisions, something that is con…rmed by the simulation evidence we provide. Reassuringly, both the iceberg lasso and bootstrap lasso select similar sets of provisions, mainly related to technical barriers to trade, anti- dumping, trade facilitation, subsidies, and competition policy. Having identi…ed the set of provisions that are more likely to have an impact on trade, we also discuss how our 1 An R package (penppml) implementing penalized PPML regressions with high-dimensional …xed e¤ects is available from CRAN and can be installed with install.packages("penppml"). For more details see https://github.com/tomzylkin/penppml. 2 Our simulation results in Section 4 suggest that the lasso with a penalty parameter chosen by the plug-in method often fails to select the relevant regressors. A similar result, in a di¤erent context, is reported by Wüthrich and Zhu (2021). 3 …ndings can be used to estimate the e¤ects of di¤erent PTAs and to predict the impact of future ones, as well as the risks associated with such exercises. Our work contributes to several di¤erent literatures. Most directly, we contribute to the large and growing literature on the e¤ects of PTAs on trade ‡ ows. As previously discussed, recently this literature has tried to decompose the overall PTA e¤ect by dis- entangling the e¤ects of individual trade agreement provisions. The new methods we propose allow us to select the most important provisions and to quantify their impact on trade ‡ ows, while avoiding the need to make essentially arbitrary assumptions about how to aggregate individual provisions (see Mattoo, Mulabdic, and Ruta, 2017; Dhingra, Freeman, and Mavroeidi, 2018). In addition, we contribute to the machine learning literature interested in variable selection and prediction. In particular, we extend and adapt existing work by Belloni, Chernozhukov, Hansen, and Kozbur (2016) on the use of the lasso in the presence of heteroskedasticity and clustered errors, to make it applicable to the context of interna- tional trade ‡ ows and trade agreements. As noted above, this requires an extension of their original method to the estimation of nonlinear models with high-dimensional …xed e¤ects using PPML. The iceberg lasso and bootstrap lasso that we propose build on the results obtained using the plug-in penalty and identify additional sets of provisions that may have a causal e¤ect on trade. Both methods add to the information provided by the standard lasso approaches and, as illustrated in our simulations, are better able to identify the provisions that have a causal e¤ect. Therefore, these new methods can potentially be useful in other contexts, especially when the available sample is relatively small and contains a large number of highly-correlated potential explanatory variables. Finally, we contribute to a small existing literature that has used machine learning and other related methods to study the e¤ects of trade agreements in a gravity context. For example, Regmi and Baier (2020) use an unsupervised learning method to group PTAs by textual similarity, so as to provide a more nuanced notion of PTA depth. Following from a similar motivation, Hofmann, Osnago, and Ruta (2017) propose an earlier depth measure for PTAs based on principal components analysis applied to their provisions data. In contrast, Baier, Yotov, and Zylkin (2019) use a two-step methodology where pair-speci…c PTA e¤ects are estimated in a …rst stage and then predicted out of sample using country- and pair-speci…c variables. The rest of this paper is structured as follows. Section 2 presents the data on PTA provisions and provides a descriptive analysis of these data, highlighting a number of stylized facts about the provisions present in recent trade agreements. Section 3 intro- duces the variable selection problem in the three-way gravity model context and explains how we implement PPML-lasso estimation with high-dimensional …xed e¤ects. Section 4 presents the results of a simulation study comparing the relative performance of di¤erent lasso methods in a simpli…ed setting with high correlation between regressors. Section 5 applies our methods to our database on PTA provisions and shows which individual provisions are the strongest predictors of trade ‡ ows. Section 6 concludes and technical details are gathered in an Appendix. 4 2 Data Our analysis combines data on international trade ‡ ows from Comtrade with the new database on the content of PTAs that has been collected by Mattoo, Rocha and Ruta (2020). On trade, we use merchandise trade exports between 1964 and 2016 from 220 exporters to 270 importers. Country pairs without export information are considered as zeros. The database on the content of trade agreements includes information on 282 PTAs that have been signed and noti…ed to the WTO between 1958 and 2017. The data focus on the sub-sample of 17 policy areas that are most frequently covered in trade agreements – these are areas that are close or above the 20 percent share of the trade agreements that have been mapped in Hofmann, Osnago, and Ruta (2017). These policy areas range from environmental laws and labor market regulations, that are covered in roughly 20 percent of the PTAs, to areas such as rules of origin and trade facilitation that are present in over 80 percent of the agreements (see Figure 1). Figure 1: Share of PTAs that cover selected policy areas Note: Figure shows the share of PTAs that cover a policy area. Source: Mattoo, Rocha and Ruta (2020). For each agreement and policy area, the database provides granular information on the speci…c provisions covering stated objectives and substantive commitments, as well as aspects relating to transparency, procedures and enforcement. The coding exercise focuses on the legal text of the agreements and therefore excludes information on the actual implementation of the commitments included in the agreements.3 3 In this data set, information coming from secondary law (the body of law that derives from the principles and objectives of the treaties) has not been coded. This is of particular importance for agreements such as the EU, since most policy areas covered have used secondary law such as regulations, directives, and other legal instruments to pursue integration. 5 Table 1: Distribution of essential provisions by policy area Number of Number of Policy Area provisions Essential provisions Share Anti-dumping and Countervailing Duties 53 11 28:8% Competition Policy 35 14 40:0% Environmental Laws 48 27 56:3% Export Taxes 46 23 50:0% Intellectual Property Rights 120 67 55:8% Investment 57 15 26:3% Labor Market Regulations 18 12 66:7% Movement of Capital 94 8 8:5% Public Procurement 100 5 5:0% Rules of Origin 38 19 50:0% Sanitary and Phytosanitary Measures 59 24 40:7% Services 64 21 32:8% State-Owned Enterprises 53 13 24:5% Subsidies 36 13 36:1% Technical Barriers to Trade 34 19 55:9% Trade Facilitation and Customs 52 11 21:2% Visa and Asylum 30 3 10:0% Total 937 305 32:6% To alleviate the problems caused by the high dimensionality of the data and the high level of correlation across the provisions included in the agreements, the analysis pre- sented in this paper focuses on a sub-set of “essential”provisions. This includes the set of substantive provisions (those that require speci…c integration/liberalization commit- ments and obligations) plus the disciplines among procedures, transparency, enforcement or objectives, which are viewed as indispensable and complementary to achieving the substantive commitments. Non-essential provisions are referred to as “corollary” .4 The share of essential provisions in the total number of provisions included in an agreement ranges from less than 10 percent for public procurement, movement of capital and visa and asylum, to more than 50 percent for policy areas such as environmental laws and labor market regulations. Overall, the sub-set of essential provisions represents almost one-third (305/937) of the total number of provisions coded in this exercise (see Table 1). The coverage of essential provisions also varies widely across trade agreements and disciplines, indicating that not all PTAs cover the same set of essential provisions. As shown in Table 2, more than 3/4 of agreements cover 25 percent or less of essential provisions included in policy areas such as environmental laws, anti-dumping, sanitary and phytosanitary measures, and technical barriers to trade. Conversely, for policy areas such as visa and asylum, rules of origin, and trade facilitation and customs, more than 70 percent of the mapped agreements cover between 25 and 75 percent of essential 4 The classi…cation into essential and corollary in the database is based on experts’knowledge and, hence, has an element of subjectivity. 6 provisions. With the exception of services and investment, coverage of more than 75 percent of essential provisions is rare and happens in less than 15 percent of the mapped agreements. One important caveat regarding this data set is that it does not cover all of the trade agreements that have been in force during the period under study. Speci…cally, our information on provisions is limited to agreements that are in e¤ect in present day, i.e., excluding any agreements that are no longer in e¤ect. For this reason, we drop observations associated with agreements no longer in e¤ect. This means that the e¤ects of newer agreements are identi…ed by changes in trade relative to when that pair did not have any agreement rather than relative to pre-existing agreements. The majority of the observations that are dropped are due to pre-accession agreements that new European Union (EU) members sign before joining the EU. Thus, to use one of these cases as an example, Italy-Croatia is included in our data for years 1992-2000 (after Croatian independence and before the initial EU-Croatia PTA in 2001) and for year 2016 (after Croatia joins the EU in 2013). The EU is treated di¤erently in our analysis for this reason, as we discuss further in Section 4. To identify agreements no longer in e¤ect, we consult the NSF-Kellogg database created by Je¤ Bergstrand and Scott Baier cross- checked with data from the WTO. The EU and the earlier European Community are treated as the same agreement for these purposes, though it is allowed to evolve as new provisions are added. Table 2: Coverage of essential provisions by policy area Share of agreements covering: Policy Area 0 to 25% 25% to 75% over 75% Anti-dumping and Countervailing Duties 99% 1% 0% Competition Policy 48% 47% 5% Environmental Laws 88% 12% 0% Export Taxes 41% 59% 0% Intellectual Property Rights 76% 23% 1% Investment 6% 64% 30% Labor Market Regulations 68% 17% 15% Movement of Capital 44% 42% 13% Public Procurement 53% 40% 7% Rules of Origin 7% 93% 0% Sanitary and Phytosanitary Measures 87% 13% 0% Services 6% 62% 33% State-Owned Enterprises 45% 54% 1% Subsidies 59% 41% 0% Technical Barriers to Trade 93% 7% 0% Trade Facilitation and Customs 21% 78% 0% Visa and Asylum 27% 70% 3% Note: Coverage ratio refers to the share of essential provisions for a policy area contained in a given agreement relative to the maximum number of essential provisions in that policy area. Source: Mattoo, Rocha and Ruta (2020) 7 3 Determining Which Provisions Matter for Trade We now outline the methodology we use to identify which PTA provisions have the largest impact on bilateral trade. To preview our approach, we will …rst specify a typical panel data gravity model for trade ‡ ows. Following the latest recommendations from the methodological literature (Yotov, Piermartini, Monteiro, and Larch, 2016, Weidner and Zylkin, 2021), we will use a multiplicative model where expected trade ‡ ows are given by an exponential function of our covariates of interest plus three sets of …xed e¤ects. Drawing on this standard framework, we will then consider the estimation challenges that arise when the number of covariates (here, provision variables) is allowed to be very large. As we will discuss, it will be convenient to reformulate the usual estimation problem as a “variable selection”problem, where we suppose that many of the provisions have zero or approximately zero e¤ect. Bringing together these elements will require that we extend recent computational advances in high-dimensional …xed e¤ects estimation to incorporate lasso and lasso-type penalties. It will also require that we introduce our own innovations, the iceberg lasso and bootstrap lasso methods, which we will motivate as providing a balance between “cross- validation” approaches that tend to select too many variables and more parsimonious “plug-in”methods that may select too few. 3.1 The Gravity Model Our starting point for estimation is the following multiplicative gravity model: ijt := E (yijt jxijt ; it ; jt ; ij ) = exp(x0ijt + it + jt + ij ): (1) Here, i, j; and t respectively index exporter, importer, and time. Bilateral trade ‡ ows from exporter i to importer j at time t are therefore given by yijt , xijt are our covariates of interest, and it , jt , and ij are, respectively, exporter-time, importer-time, and exporter-importer (“pair” ) …xed e¤ects. Because of the three …xed e¤ects, the model in (1) is often called the “three-way gravity model” . Intuitively, the exporter-time and importer-time …xed e¤ects it and jt may be thought of as controlling for changes over time in the “gravitational pull” that the exporter and importer each exert on world trade ‡ ows. More formally, these …xed e¤ects can be shown to depend on the market sizes of the two countries as well as on what Anderson and van Wincoop (2003) call “multilateral resistance” , a theoretical measure of each country’ s connectedness to the overall trade network. The inclusion of pair …xed e¤ect ij was suggested by Baier and Bergstrand (2007), who convincingly argue that estimates of the e¤ect of trade agreements and other similar variables would otherwise be biased due to omitted cross-sectional heterogeneity. In terms of a trade model, this omitted heterogeneity is often motivated as coming from unobserved trade costs. An important point about (1) is that it motivates estimating the model in its original nonlinear form using PPML; see Gourieroux, Monfort and Trognon (1984). In principle, one could instead use a linear model after taking logs, but Santos Silva and Tenreyro 8 (2006) have pointed out that this estimator is generally inconsistent and recommended that (1) should instead be estimated by PPML. Though the resulting model is nonlinear with three sets of high-dimensional …xed e¤ects, estimation is feasible due to recent computation innovations by Correia, Guimarães, and Zylkin (2020) and others.5 Weidner and Zylkin (2021) have recently established the consistency and asymptotic distribution of the three-way PPML estimator, and Yotov, Piermartini, Monteiro, and Larch (2016) recommend it as the workhorse method for estimating the e¤ects of trade policies. It is frequently applied to the context of trade agreements in particular. Having established these details, our focus is on the set of covariates, xijt . In most applications in the trade agreements literature, xijt is often either a single variable— i.e., a dummy for the presence of a trade agreement— or minor variants thereof, such as intro- ducing interactions with either the depth of the agreement or the bilateral characteristics of the two countries (Baier, Bergstrand, and Feng, 2014; Baier, Bergstrand, and Clance, 2018). However, a major estimation challenge that arises in our setting is that we must treat the number of provisions as being very large. As we will show, in our data set this high dimensionality, combined with the relatively small number of PTAs, creates strong multicollinearity that results in implausibly large and uninterpretable estimates when a standard estimator is used. Furthermore, the estimated model has poor predictive performance due to over…tting. We therefore must discuss how the standard gravity estimation approach must be modi…ed in order to deal with this additional source of high dimensionality. 3.2 Variable Selection and Gravity The starting point for our methodological innovations is to suppose that only a handful of our provision variables have a non-negligible e¤ect on trade ‡ows. To be more precise, we have p = 305 essential provision variables, coded as dummies, of which a subset s < p are assumed to have non-zero e¤ects, where s is typically small with respect to the sample size.6 We do not know s beforehand, nor do we know the identities of any of the s provisions that substantively a¤ect trade. Our goal then is to use statistical methods along with the model described in (1) in order to identify these provisions. Because of the high dimensionality of xijt , experimenting with di¤erent subsets of provisions to see which has the best performance is unlikely to be fruitful. Instead, we adopt a penalized regression (or “regularization” ) approach that involves appending a penalty term to the Poisson pseudo-likelihood one would use to estimate the unpenalized gravity model. The idea is that the penalty term “shrinks” all estimated coe¢ cients towards zero and forces some of them to be exactly equal to zero. The higher the 5 Correia, Guimarães, and Zylkin (2020) and Stammann (2018) have each proposed algorithms for estimating nonlinear …xed e¤ects models based on iteratively re-weighted least squares (IRLS). Heuris- tically, this type of algorithm exploits the linearity of the weighted least squares step in the IRLS algorithm to wipe out the …xed e¤ects in each iteration, then uses an application of the Frisch-Waugh- Lovell theorem to update the weights, repeating until convergence. For a di¤erent approach, see Larch, Wanner, Yotov, and Zylkin (2019). 6 Note that of the 305 provisions in our data, 8 are always equal to zero. Therefore, the e¤ective number of provisions we consider is 297. 9 penalty, the fewer the variables that are found to have non-zero coe¢ cients and are therefore “selected” . By design, the variables that are selected should be those that exert the strongest in‡ uence on the …t of the model; coe¢ cients for variables that are not as relevant should end up getting shrunken to zero completely. Because of its computational feasibility, the most frequently used approach to this type of variable selection problem is the lasso, introduced by Tibshirani (1996). In our setting, the penalized objective function that de…nes the three-way PPML-lasso is ! p 1 X 1 Xb PL( ; ; ; ) = yijt ln ijt + j k j; (2) n i;j;t ijt n k=1 k | {z } | {z } 1 PPML pseudo likelihood Lasso penalty 0 where n is the number of observations,7 as in (1) above, ijt = e it + jt + ij +xijt is the conditional mean, and 0 and bk 0 are tuning parameters that determine the penalty. As indicated in (2), the …rst term in this expression is the standard PPML objective function one would minimize in order to estimate the three-way gravity model. Thus, the PPML-lasso nests PPML as a special case when is set to zero. The second term in (2) is a modi…ed lasso penalty that allows for regressor-speci…c penalty weights as opposed to having as the only tuning parameter as in the standard lasso. Intuitively, larger penalties increasingly shrink the estimated -coe¢ cients towards zero. The coe¢ cients for any variables that do not su¢ ciently increase the likelihood are set to exactly zero, thereby giving us a way of identifying which variables to include in the …nal model. For some illustration, if we consider ! 1, the only way to minimize PL is to set all bk s equal to zero, meaning that no variables are selected. As in Belloni, Chernozhukov, Hansen, and Kozbur (2016), we will use the regressor-speci…c bk penalty terms to iteratively re…ne the model while also re‡ ecting any heteroskedasticity and within-cluster correlation featured in the data. Importantly, the …xed e¤ects parameters ; , and are not penalized. This is mainly because there is no reason to believe that most of the …xed e¤ects parameters are actually zero. In addition, it turns out they do not pose special issues for computation. This is because they do not depend on the penalty. As such, for any given , the …xed e¤ects can be obtained by solving their usual PPML …rst-order conditions from the standard unpenalized regression approach. In practice, this means that the …xed e¤ects can actually be dealt with in the exact same manner as in Correia, Guimarães, and Zylkin (2020). More details on the computational methods are provided in the Appendix, but, basically, we use the original HDFE-IRLS algorithm of Correia, Guimarães, and Zylkin (2020) to take care of the …xed e¤ects but replace the weighted linear regression step from that algorithm with a weighted lasso regression.8 7 Naturally, the number of observations will depend on the number of countries for which we have data and on the number of years we observe them. For simplicity, we do not make that relation explicit. 8 For the lasso regression step, we use the coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010). 10 3.3 Implementing the Lasso The next question of course is how to determine the tuning parameters and bk . As a starting point, the two existing approaches we will …rst examine are the “plug-in”lasso of Belloni, Chernozhukov, Hansen, and Kozbur (2016) and the traditional cross-validation approach, both of which we have modi…ed to …t the demands of the three-way PPML setting. As we will discuss, each of these methods has its strengths and weaknesses. Therefore, we will then turn to describing two extensions of the plug-in lasso, which we call the “iceberg lasso” and “bootstrap lasso” , that are intended to address one of the plug-in lasso’s key shortcomings in this context. Plug-in Lasso The plug-in lasso is so-named because it speci…es appropriate functional forms for the penalty parameters based on statistical theory and then uses plug-in estimates for these parameters. It is therefore a “theory-driven”approach to the variable selection problem, whereas cross-validation, discussed next, is a more traditional machine learning method that relies on out-of-sample prediction. The plug-in lasso was …rst proposed by Belloni, Chen, Chernozhukov, and Hansen (2012), though the speci…c implementation we build on is the “cluster lasso”method of Belloni, Chernozhukov, Hansen, and Kozbur (2016), which allows for correlated errors within clusters. Without delving too much into technical details, which we defer to the Appendix, variable selection using the plug-in lasso can be thought of as involving the following three ingredients: i. The absolute value of the score for each k when evaluated at 0, ii. The standard error of the score for each k, iii. Values for and bk set high enough so that the absolute value of the score for k must be large relative to its standard error in order for regressor xijt;k to be selected. Intuitively, the value of the score re‡ects the impact that a small change in k has on the …t of the model. When evaluated at 0, it tells us how much the …t of the model improves when we make k non-zero. The standard logic of the lasso is that this improvement in …t must be large relative to the penalty in order for bk to be non-zero. One of the main innovations of the plug-in lasso is to allow the regressor-speci…c penalty b to adjust to re‡ ect the standard error of the score. This way, we counteract the k possibility that regressors could be mistakenly selected due to estimation noise rather than because of their true impact on the model. These regressor-speci…c penalties play an important role in the presence of heteroskedasticity, which of course is an important feature of trade data. Because the provision sets in xijt vary by agreement, and because we expect errors to be serially correlated over time, we use the cluster lasso approach to constructing these weights as in Belloni, Chernozhukov, Hansen, and Kozbur (2016). Speci…cally, we cluster all observations belonging to pairs that form agreements by the 11 agreement they eventually belong to, including before the agreement begins. Other observations are clustered by pair. A principal advantage of the plug-in lasso is that it is very parsimonious in terms of the number of variables it selects. As shown by Drukker and Liu (2019), the plug-in method o¤ers superior performance versus cross-validation approaches in …nite samples, in large part because these other methods tend to select too many variables. Furthermore, the “post-lasso” estimates obtained using unpenalized PPML on the covariates selected by the plug-in lasso have a “near-oracle”property that ensures they will capture the correct model if the sample is su¢ ciently large relatively to the number of potential regressors (see Belloni, Chen, Chernozhukov, and Hansen, 2012).9 However, the plug-in lasso’ s parsimony can also be a weakness it that it may select too few variables. In general, it attempts to select a small number of variables that are most useful for predicting the outcome. However, in data settings where there are a substantial number of regressors that are highly correlated, as is the case with our provisions data, it is possible that the plug-in lasso will wrongly select a regressor that does not a¤ect the outcome but is strongly correlated with another regressor that does, since either (or perhaps both) can have similar predictive value for …tting the model. We discuss this issue in more detail when we introduce our extensions of the plug-in lasso. Cross-Validation As an alternative to the plug-in method, we also consider a more traditional approach based on cross-validation. Under cross-validation, one repeatedly holds out some of the data and chooses in order to maximize the predictive …t of the model when evaluated on the held-out data. The regressor-speci…c bk do not play a role and are set equal to 1. Because of the size of the data and the nature of our model, implementing this ap- proach presents some interesting challenges. A standard implementation would be a “k -fold”approach that randomly partitions the sample into k folds and then uses k 1 subsets to estimate the parameters and the excluded subset to evaluate the predictive ability of the model. To adapt this idea to our setting, we validate our model by repeat- edly dropping the observations corresponding to randomly selected groups of agreements in our data, and then use their provisions to predict trade for the dropped observations, similar to the approach taken by Baier, Yotov, and Zylkin (2019). In this case, all …xed e¤ects are always present in each practice sample, so that we can always form the nec- essary predictions for the omitted trade ‡ ows associated with the PTA that have been dropped.10 9 The “oracle” property of estimators such as the adaptive lasso of Zou (2006) refers to their ability to correctly recover which parameters are zero and non-zero in a setting where the number of potential regressors is …xed and the number of observations is large. The “near-oracle” property of the plug-in lasso is similar, but its rate of convergence is slower and depends on the number of potential regressors because in the setting considered by Belloni, Chen, Chernozhukov, and Hansen (2012) the number of potential regressors is allowed to grow with the sample size. 10 It may, however, happen that some provisions are not included in the agreements used in the estimation sample. This is less likely to happen if k is large, and therefore we use k = 25: 12 The main advantage of cross-validation is that it is explicitly designed to optimize pre- dictive performance. Thus, it may o¤er a conceptual advantage where forecasting tasks are concerned. However, a known weakness of the standard lasso with cross-validation is that it often errs on the side of selecting too many variables that are not relevant.11 Furthermore, it does not take into account heteroskedasticity when performing the se- lection, and it generally does not have either an oracle or near-oracle property in large samples. For these reasons, cross-validation is not our preferred method for answering the question of which provisions matter for trade; we consider it mainly to illustrate the basic mechanics of the lasso and as a check on our plug-in results.12 3.3.1 Extensions of the plug-in lasso One important feature of the lasso is that it selects variables that are good predictors of the outcome, but these are not necessarily variables that have a causal impact on the outcome. Indeed, Zhao and Yu (2006) show that only when the so-called “irrepre- sentability condition” is valid can we expect the variables selected by lasso to have a causal interpretation; the condition essentially imposes limits on the degree of collinear- ity between the variables with a causal e¤ect on the outcome and the other candidate regressors (see also Wainwright, 2009). As we have noted, in the case of our data set, there is a very high degree of collinearity between some of the variables, and therefore we cannot expect the irrepresentability condition to hold. Furthermore, for the plug-in lasso especially, which tends to select a very parsimonious model, we should be worried whether the selected provisions mask the e¤ects of a potentially more complex set of other provisions that are often included in the same agreements as the provisions that are selected. To address this important complication, we now introduce two methods that add variables to the set of regressors selected by the plug-in lasso, and in the next section we evaluate their performance in a simulation experiment; we call these methods the “iceberg lasso” and the “bootstrap lasso”. The Iceberg Lasso Simply put, the iceberg lasso involves performing a subsequent set of plug-in lasso regressions in which each of the provisions selected by the plug-in lasso estimator is regressed on all of the provisions that were excluded; the set of variables selected by the iceberg lasso is the union of the set selected in the …rst step with the sets 11 In linear models, tuning using cross-validation is analogous to selection based on the Akaike information criterion, which ensures that the probability of selecting too few variables goes to zero but does not eliminate the possibility of selecting too many. Relatedly, Drukker and Liu (2019) …nd that selecting using cross-validation also leads to the inclusion of too many regressors in Poisson regressions. In our own application, we too …nd that the cross-validation method selects many more provisions than the plug-in method. 12 Alternatively, we could consider the adaptive lasso (Zou, 2006), which adds a second tuning para- meter and is known to deliver consistent variable selection. However, in our application we have found that the adaptive lasso is similar to the standard cross-validation lasso in that it is much too lenient and it keeps too many regressors that are not relevant. The simulations reported in the next section suggest that this is likely to be the case in relatively small samples. 13 selected in each of the regressions of the second step. The purpose of the second-step regressions is to identify bundles of provisions that are highly correlated with the ones selected in the …rst step, and therefore may be representable by them, in the sense of Zhao and Yu (2006). That is, each of the variables selected by the PPML-lasso with the plug-in tuning parameter may be just “the tip of the iceberg” of a bundle of variables that have a causal impact on trade, and the lasso regressions in the second step may help to identify these bundles. As such, the iceberg lasso may be interpreted as a data-driven alternative to the method used in Dhingra, Freeman, and Mavroeidi (2018) to construct provision bundles.13 The Bootstrap Lasso It is well documented that in small to moderate samples the set of variables selected by the lasso can be somewhat unstable, in the sense that it is very sensitive to perturbations of the sample (see, e.g., Mullainathan and Spiess, 2017). We use this feature of the lasso to try to alleviate the tendency of the plug-in lasso to select too few variables. In what we call the bootstrap lasso, we apply the plug-in lasso to an additional set of B 1 samples obtained by bootstrap, and de…ne the set of variables selected by this method as the variables that are more frequently selected in the B samples considered. Doing so has several conceptual bene…ts. First, because this method is likely to uncover variables that substitute for the orig- inally selected variables in approximating the patterns found in the data in di¤erent versions of the sample, the augmented set of variables it selects is likely to contain more of the relevant variables than the initial set selected by the plug-in lasso. Second, the frequency with which each variable is selected provides useful information about the sta- bility of its selection and thus the degree of con…dence we should have in its importance to the model. Third, averaging estimates and predictions across bootstrap samples may reduce over…tting due to the sampling error in the original data; in the machine learning literature, this approach is known as “bootstrap aggregating” , or “bagging” for short (see, e.g., Hastie, Tibshirani, and Friedman, 2009). Naturally, the performance of the bootstrap lasso will depend on B and on the frequency cuto¤ used to select the variables, with lower cuto¤s increasing the proportion of relevant variables selected but also the number of irrelevant variables included in the model. In our application, we use B = 250 and restrict our attention to variables that are selected with a frequency exceeding 5% or 1%.14 13 The iceberg lasso complements the approach adopted by Regmi and Baier (2020), who use machine learning tools to construct groups of provisions and then use these clusters in a gravity equation. The main di¤erence between the two approaches is that Regmi and Baier (2020) use what is called an unsupervised machine learning method, which uses only information on the provisions to form the clusters. In contrast, the iceberg lasso selects the provisions using a supervised method that considers the impact of the provisions on trade, and then adds another step which can be interpreted as unsupervised learning. 14 In the simulations we use B = 20 and use only the 5% cuto¤. 14 3.4 Discussion and caveats Having described the ideas behind our methods, several further caveats are in order. First, by construction, not all of the provisions selected by the iceberg lasso and the bootstrap lasso can be said to have causal e¤ects. Whether or not these methods are more informative than other methods that are already known to over-select regressors is an empirical matter and the answer will depend on the application. Second, in general, we need to be very humble about potential causal interpretation of our results. We view our approach as a statistical method to select a group of variables that is likely to include the ones most relevant to the …t of the three-way gravity model. This of course requires taking the model to be an appropriate representation of the determinants of trade. The three-way gravity model has the considerable advantage that it isolates a particular variation in the data that is empirically relevant for the study of trade agreements, namely the within-pair variation that is time-varying and independent of country-speci…c changes in trade. However, the initial PPML-lasso with the tuning parameter selected by the plug-in method is likely to omit relevant variables, and that obviously complicates interpretation of those estimates. The additional steps in the iceberg lasso and in the bootstrap lasso are explicitly designed to address this latter issue and should at least partially alleviate this problem, at the cost of possibly selecting some variables that e¤ectively have little or no impact on trade. 4 Simulation Evidence In this section we report the results of a simulation exercise investigating the …nite-sample properties of the variable-selection methods discussed before. The simulation design we use covers a range of scenarios that, to di¤erent degrees, combine two important features of our application: a relatively small sample and a high degree of collinearity between several potential explanatory variables. The results we obtain, therefore, provide information on the performance of the di¤erent methods in conditions similar to those we face, and illustrate how these performances change when we progressively move towards less challenging environments. In all experiments, the n observations of the dependent variable are generated as y = exp (1 + x1 + z + ") ; where and are parameters and x1 , z , and " are independent random draws from the standard normal distribution. In the estimation, performed by PPML-lasso, " is not included as a regressor (it is the error term), z is always included as a regressor whose coe¢ cient is not penalized, and we use di¤erent methods to select other regressors from a set of p potential explanatory variables x1 ; : : : ; xp . Therefore, in this design, x1 plays the role of the presumably small number of provisions that e¤ectively a¤ect trade, x2 ; : : : ; xp represent the provisions that have no impact on trade, and z mimics the role of the …xed e¤ects that explain a signi…cant share of the variation of trade and are included without penalty. 15 The parameters and determine the relevance of x1 and the signal-to-noise ratio: because gravity equations typically have an excellent …t, we set = 0:2 and = 0:3, which ensures that model has a reasonably high R2 and that the e¤ect of x1 is neither too small (which makes its role very di¢ cult to detect) nor too large (in which case all approaches have an excellent performance). The p potential explanatory variables are obtained as random draws from the nor- mal distribution; the …rst variables x1 ; : : : ; x are equicorrelated with correlation co- e¢ cient , and the remaining ones are independent of all other variables. All regres- sors have zero mean and variance 1 and we perform simulations p with 2 f5; 10; 20g, 2 f0:75; 0:90; 0:99g, n 2 f250; 1000; 4000g, and set p to 5 d ne, where d e denotes the ceiling function; that is, depending on the value of n, p is either 80, 160, or 320.15 In these simulations we considered each of the four methods presented before: cross- validation lasso, plug-in lasso, the iceberg lasso, and the bootstrap lasso. The bootstrap lasso is performed with B = 20 and we include in the set of selected variables any variable that is selected in at least one sample (that is, we use a cuto¤ of 5%). Additionally, we also considered the adaptive lasso of Zou (2006), with the penalty parameter chosen by cross-validation in both steps.16 Unlike the other methods we consider, the adaptive lasso has the so-called oracle property, implying that asymptotically it will choose the right set of regressors, and therefore it provides an interesting benchmark against which the performance of the other methods can be judged.17 We repeat the simulations 1; 000 times and study both the ability of each method to correctly select x1 as a regressor and their predictive performance. 4.1 Variable selection For each of the cases considered, Table 3 presents the percentage of times the regressor x1 is selected and, in parentheses, the average number of regressors selected by each method. The results in Table 3 reveal that the various methods can have very di¤erent performances. Starting with the ability of each method to correctly select x1 as a regressor, we …nd that for n = 250 the lasso with penalty chosen by the plug-in method (PI) is the method with the worst results, and its performance deteriorates quickly as and increase. The adaptive-lasso (AL) leads to better results, but its performance is also very poor when = 0:99. Lasso with the penalty chosen by cross-validation (CV) provides a substantial improvement, but it also struggles for larger values of . The bootstrap lasso (BL) is at 15 A noticeable di¤erence between the simulation design we use and our application is that in the sim- ulations the potential explanatory variables have a continuous distribution whereas in the application they are dummies. We preformed some experiments where the potential explanatory variables are dum- mies generated using the method described by Lunn and Davies (1998) and found broadly comparable results. However, we prefer to report the results obtained using the normally distributed variables be- cause when dummies are used we frequently encounter numerical issues and cases of perfect collinearity that make it more di¢ cult to keep track of the variables selected. 16 We also performed simulations using Zou and Hastie’ s (2005) elastic-net. However, those results are not particularly interesting and are not reported to conserve space and to simplify the exposition. 17 Note, however, that the plug-in lasso has a related near-oracle property. 16 least as successful as CV, but clearly dominates it for the higher values of . Finally, the iceberg lasso (IL) is marginally outperformed by CV and BL when = 0:75, outperforms CV but is again marginally outperformed by BL when = 0:9, but has a substantial advantage over all other methods for = 0:99.18 Table 3: Percentage of times x1 is selected & average number of variables selected = 0:75 = 0:90 = 0:99 n =5 = 10 = 20 =5 = 10 = 20 =5 = 10 = 20 250 CV 100:0 99:7 99:3 96:6 91:8 85:5 55:2 37:7 23:4 (8:65) (8:55) (8:74) (8:87) (8:66) (8:64) (8:52) (8:22) (7:93) AL 99:7 99:4 97:9 93:9 87:4 80:4 45:3 29:4 17:7 (7:22) (7:21) (7:05) (7:34) (7:21) (7:05) (6:99) (6:72) (6:26) PI 91:6 89:9 88:1 80:6 72:1 63:7 41:1 26:8 16:9 (1:26) (1:52) (1:89) (1:45) (1:73) (2:06) (1:23) (1:33) (1:41) BL 100:0 100:0 99:8 96:6 98:4 96:7 90:4 79:2 64:2 (11:11) (12:81) (15:27) (11:31) (13:25) (15:66) (11:27) (12:77) (14:03) IL 95:7 95:9 95:2 95:9 95:8 93:0 95:3 93:4 80:1 (4:80) (9:14) (15:97) (4:81) (9:43) (17:00) (4:78) (9:32) (15:65) 1000 CV 100:0 100:0 100:0 100:0 100:0 99:9 81:0 69:8 56:4 (9:43) (9:59) (10:05) (9:76) (10:10) (10:69) (9:92) (10:11) (10:51) AL 100:0 100:0 100:0 100:0 99:7 99:7 68:3 54:8 40:8 (3:93) (4:19) (4:49) (4:71) (5:22) (5:85) (5:37) (5:97) (6:22) PI 99:8 99:8 99:7 99:2 98:4 97:5 71:4 55:9 41:4 (1:31) (1:54) (1:88) (1:63) (2:02) (2:57) (1:75) (2:02) (2:34) BL 100:0 100:0 100:0 100:0 100:0 100:0 98:0 93:70 87:1 (8:88) (10:89) (13:91) (9:26) (11:67) (15:23) (9:36) (11:85) (14:81) IL 100:0 100:0 100:0 100:0 100:0 100:0 100:0 100:0 98:8 (5:01) (10:00) (19:22) (5:00) (10:01) (19:69) (5:01) (10:01) (19:72) 4000 CV 100:0 100:0 100:0 100:0 100:0 100:0 99:0 97:8 94:9 (10:46) (10:85) (11:24) (10:78) (11:28) (11:88) (11:18) (12:06) (12:63) AL 100:0 100:0 100:0 100:0 100:0 100:0 91:9 86:0 79:1 (1:00) (1:00) (1:00) (1:03) (1:00) (1:03) (1:18) (1:30) (1:70) PI 100:0 100:0 100:0 100:0 100:0 100:0 98:0 93:9 88:1 (1:23) (1:43) (1:68) (1:53) (1:96) (2:42) (2:00) (2:60) (3:18) BL 100:0 100:0 100:0 100:0 100:0 100:0 100:0 99:9 99:8 (7:86) (9:91) (13:03) (8:44) (11:04) (14:94) (8:93) (11:94) (16:27) IL 100:0 100:0 100:0 100:0 100:0 100:0 100:0 100:0 100:0 (5:00) (10:00) (19:99) (5:00) (10:00) (20:00) (5:01) (10:00) (20:00) The performance of all methods improves for the larger sample sizes, but the IL maintains its advantage in the more challenging cases with = 0:99, with BL having a very similar performance. Overall, the two extensions of the PI we consider, the BL and IL, lead to greatly improved ability of identifying the relevant regressor, and there is generally little to choose between them, except in the extreme cases with n = 250 and = 0:99 where the IL has a very clear advantage. The results for the average number of variables selected are also interesting. In all cases considered, CV tends to lead to a high average number of selected regressors. On the other extreme, PI is generally the most parsimonious except for when n is large, in which case the oracle property of AL starts to become salient. Turning now to the 18 Part of the reason why in some cases IL does not perform well is that sometimes PI selects no regressors at all, and in those cases IL cannot improve on it but BL can. 17 extensions of the PI method we consider, we observe that the average number of regres- sors selected by BL is always reasonably high and that, for the values of we consider, the average number of variables selected by the IL increases with , suggesting that the method performs as intended. Naturally, this behavior will be less pronounced for lower values of , and we have con…rmed that in unreported simulations. In summary, for very large samples, the adaptive lasso with penalty parameter se- lected by cross-validation is the preferred method; this is justi…ed both by our simulation results and by its oracle property. However, for small to medium samples, and especially with high correlation between potential explanatory variables, the adaptive lasso is out- performed by other methods. In these cases, the choice of method depends on whether we favor selecting the relevant regressors or having a parsimonious model. If parsimony is paramount, the lasso with penalty parameter selected by the plug-in method is dif- …cult to beat. However, if selecting the relevant regressor is important, the bootstrap lasso and the iceberg lasso are safe bets, with the iceberg lasso being clearly preferable only for smaller samples where there is extremely high collinearity between the relevant variable and other potential controls. 4.2 Prediction We now consider the predictive ability of the models obtained with the di¤erent variable- selection methods. To that end, for each replica of the simulations we generated 100 additional observations and used the di¤erent models to predict these observations. In this context, we can consider both lasso predictions, using the penalized lasso estimates, and post-lasso predictions, using unpenalized estimates.19 We computed penalized and unpenalized predictions for all approaches and found that for CV and AL penalized predictions tend to dominate unpenalized ones, and the reverse holds for PL, IL, and BL. Table 4 summarizes these results and reports the mean square error (MSE) of the prediction error for each of the models considered. To conserve space, we only report the results obtained with the penalized predictors for the CV and AL, and unpenalized predictors for PI, IL, and BL. For comparison, the table also presents the MSE of the predictions obtained with the unpenalized PPML estimates of the models that includes all p regressors and with the PPML estimates of the “oracle” model that just includes x1 . The results in Table 4 show that the predictions obtained with the unpenalized estimator of the full model are clearly outperformed by all lasso-based predictions, with the di¤erence being particularly stark in the smaller sample. The results also suggest that the predictive performance of the di¤erent methods depends little on the values of 19 The unpenalized predictions for the IL are computed from the PPML estimates of a model including the full set of variables selected by IL; for BL they are computed as the average of the predictions corresponding to the post-lasso PPML estimates in each sample. The penalized predictor for the IL is obtained from a plug-in lasso based on the full set of variables selected by IL; for BL, the penalized predictor is the average of the predictions obtained with the penalized estimates in each of the bootstrap samples. 18 and , but generally improves with n. The exception to this is the IL, for which we see a small but systematic drop in performance as increases. This is not surprising because the method is designed to select all the regressors that are su¢ ciently correlated with the ones identi…ed by the PI, and therefore for large the IL selects many irrelevant predictors. Perhaps the most striking feature of the results in Table 4 is, however, the excellent performance of PI, which can be comparable to that of the oracle model even in cases where PI often fails to identify x1 as a predictor. It is also noteworthy that the per- formance of the BL is also very good and better than that of the IL, especially for the larger values of . For the larger sample, however, there is little to choose between the di¤erent lasso methods, but AL has the best performance. Table 4: MSE for prediction errors = 0:75 = 0:90 = 0:99 =5 = 10 = 20 =5 = 10 = 20 =5 = 10 = 20 n = 250 CV 6:85 6:83 6:86 6:87 6:88 6:88 6:83 6:83 6:80 AL 7:27 7:23 7:22 7:29 7:26 7:24 7:17 7:18 7:08 PI 6:57 6:53 6:66 6:59 6:63 6:71 6:53 6:52 6:52 BL 6:63 6:60 6:66 6:64 6:62 6:66 6:57 6:53 6:53 IL 6:71 6:83 7:21 6:71 6:84 7:25 6:72 6:85 7:23 All regressors 10:98 10:98 10:98 10:98 10:98 10:98 10:98 10:98 10:98 Oracle 6:39 6:39 6:39 6:39 6:39 6:39 6:39 6:39 6:39 n = 1000 CV 6:34 6:35 6:35 6:34 6:34 6:35 6:33 6:32 6:34 AL 6:34 6:31 6:30 6:35 6:39 6:40 6:39 6:41 6:47 PI 6:19 6:19 6:22 6:18 6:19 6:22 6:16 6:17 6:20 BL 6:19 6:18 6:21 6:18 6:18 6:21 6:16 6:16 6:18 IL 6:22 6:31 6:48 6:22 6:31 6:47 6:22 6:31 6:48 All regressors 8:44 8:44 8:44 8:44 8:44 8:44 8:44 8:44 8:44 Oracle 6:19 6:19 6:19 6:19 6:19 6:19 6:19 6:19 6:19 n = 4000 CV 6:37 6:37 6:37 6:36 6:37 6:38 6:37 6:38 6:38 AL 6:34 6:34 6:34 6:33 6:33 6:34 6:34 6:34 6:34 PI 6:34 6:35 6:36 6:34 6:35 6:35 6:33 6:34 6:35 BL 6:35 6:36 6:37 6:36 6:37 6:37 6:36 6:37 6:38 IL 6:34 6:35 6:43 6:34 6:35 6:43 6:34 6:35 6:43 All regressors 7:39 7:39 7:39 7:39 7:39 7:39 7:39 7:39 7:39 Oracle 6:34 6:34 6:34 6:34 6:34 6:34 6:34 6:34 6:34 Note: The table reports the mean square error of the prediction error obtained using penalized predictors for the CV and AL, and unpenalized predictors for PI, IL, and BL. For comparison, the table also presents the mean square error of the predictions obtained with the model with all regressors and with the “oracle” model that just includes the relevant regressor. 19 4.3 Summary of the …ndings The simulation results presented above, which con…rm and extend the …ndings of Drukker and Liu (2019), have important implications for our work. Given that in our application we only have data on 282 trade agreements,20 we cannot expect any of the methods considered to be able to precisely identify the set of provisions that matter for trade. The task of identifying the correct set of explanatory variables is particularly challenging in our application because many of the provisions have very strong correlations with others, and there are even cases of perfect collinearity. In this challenging context, the iceberg lasso and bootstrap lasso emerge as providing a good compromise between parsimony and the ability to identify the relevant variables. The iceberg lasso has the practical advantage of being easier to implement and not requiring the choice of additional parameters, such as the number of bootstrap samples. Consequently, the iceberg lasso is our preferred approach to select the relevant variables, but the bootstrap lasso is a credible alternative that can be used, at least, as a robustness check. Additionally, the bootstrap lasso is the only approach we have considered that can provide information about model uncertainty, but exploring that is beyond the scope of this paper. If the objective of the researcher is to accurately predict the trade impact of a given PTA, the preferred approach is to compute the predictions using the post-lasso estimates obtained with the plug-in penalty. Indeed, this approach performs extremely well in all cases, and is only marginally outperformed by the adaptive lasso in the larger sample we considered.21 However, the bootstrap lasso is also a credible alternative in this context and it can serve as a useful robustness check. 5 Empirical Results In this section, we present the lasso results obtained using the methods described and studied in the previous sections. We …rst present results for the plug-in method before brie‡y discussing the results obtained using cross-validation. We then turn to the iceberg and bootstrap lasso results, which each build in their own way on the selection done by the plug-in lasso. We also include a brief discussion of using these methods for prediction. 20 Note that the information on the e¤ect of the di¤erent provisions is limited by the relatively small number of PTAs that are observed. Therefore, despite having a large number of observations, we e¤ectively only have a small sample to identify the e¤ect of the di¤erent provisions. 21 One may wonder why the PPML-lasso with the tuning parameter chosen by the plug-in method predicts so well, even if it often fails to select the right regressor. The answer, of course, is that when the purpose is simply to predict the outcome, the results change little if the regressor with a causal impact is replaced by another that is highly correlated with it. 20 5.1 Plug-in Lasso Results Table 5 presents results for the plug-in lasso and post-lasso regressions discussed before.22 In column (1), we start by presenting the results of a traditional PPML gravity estimation with a dummy for the presence of a PTA between the trading partners. This shows that we can replicate the usual …nding that PTAs lead to a signi…cant increase in trade ‡ ows. Speci…cally, we …nd that the PTAs in our data increase trade by 14% (exp (0:131) 1 = 0:14). Column (2) then shows the results of the plug-in PPML-lasso regression, showing only the coe¢ cients that are found to be non-zero. Using this approach, the lasso selects 8 provisions related to anti-dumping, competition policy, technical barriers to trade (TBT), and trade facilitation. Broadly speaking, these variables all can be rationalized as having intuitive e¤ects on trade. The selected anti-dumping and competition policy provisions create more certainty as to how disciplinary investigations and proceedings will be carried out in these policy areas.23 This increased certainty may increase entry by foreign exporting …rms. The inclusion of provisions related to technical barriers to trade and trade facilitation is likewise intuitive, but the selection of TF45, which facilitates obtaining certi…cates of origin, seems of particular note in that it highlights the costs of complying with rules of origin. It is worth noting that the plug-in PPML-lasso selects TBT2 and TBT29, two provisions that are perfectly collinear in our data set. This illustrates both the ability of the method to select variables that are perfectly collinear as well as the challenges faced when trying to interpret the results in this setting. We next estimate a “post-lasso”PPML regression— a standard PPML regression us- ing only the provisions that were selected in the previous step. These post-lasso PPML results, presented in column (3), show that some of the selected provisions have large ef- fects when estimated in the conventional way. For example, the inclusion of anti-dumping provision AD14, which requires that anti-dumping proceedings establish “material in- jury”to domestic producers, is associated with an increase in trade ‡ ows of about 42% (exp (0:349) 1 = 0:42). Interestingly, not all of the provisions selected by the lasso step are found to be statistically signi…cant in the post-lasso step. This apparent contradic- tion arises for two reasons. First, the lasso focuses on the contribution of each variable to the pseudo-likelihood function, which is not the same as testing whether its coe¢ cient is statistically di¤erent from zero. Second, because the lasso shrinks all coe¢ cients to- wards zero simultaneously, it reduces the in‡ uence of the collinearity between them and can allow individual provisions that are not signi…cant in the conventional regressions to speak more loudly. In column (4), we re-estimate the model using the same covariates as column (3) but now re-adding our original PTA dummy from column 1. In this case, the coe¢ cient on PTA captures any e¤ect on trade ‡ ows that is not already captured by the provision variables that were selected by the lasso. With this in mind, we take the insigni…cant 22 Both the PPML standard errors and the plug-in lasso estimates account for clustering, which is done at the agreement level for observations that correspond to agreements, and at the pair level for the remaining observations. 23 For more on the e¤ect of anti-dumping provisions, see Prusa, Teh, and Zhu (2022). 21 Table 5: PPML, PPML-lasso, and post-lasso PPML results for plug-in approach Dependent variable: Bilateral Trade Flows (1964-2016, every 4 years) PPML Lasso Post-lasso PPML PPML (1) (2) (3) (4) (5) PTA 0:131 0:008 0:087 (0:044) (0:062) (0:041) EU 0:658 (0:087) AD14. Anti-dumping –Material Injury 0:329 0:349 0:347 (0:117) (0:119) CP23. Competition Policy –Transparency / Coordination 0:002 0:118 0:118 (0:077) (0:078) 0:142 0:184 0:182 22 (0:142) (0:144) TBT2 / TBT29. Mutual Recognitiony s: use International Standards TBT7. Technical Reg’ 0:016 0:032 0:034 (0:078) (0:080) TBT8. Conformity Assessment: Mutual Recognition 0:028 0:123 0:124 (0:099) 0:099 TBT33. Standards: use Regional Standards 0:109 0:113 0:116 (0:061) (0:064) TF45. Issuance of Proof of Origin 0:000 0:089 0:095 (0:032) (0:053) Gravity equations with exporter-time, importer-time, and exporter-importer FE, estimated by PPML using 316,317 observations. Columns labelled “Post-lasso” report PPML coe¢ cients for all variables selected by a plug-in lasso method in a prior step. All other columns report further experiments using PPML. Cluster-robust standard errors are reported in parentheses. * p < 0:10 , ** p < :05 , *** p < :01. yTBT2 is perfectly collinear with TBT29: TBT2 refers to mutual recognition of technical regulations; TBT29 refers to mutual recognition of standards. and near-zero coe¢ cient on PTA in column (4) as an encouraging indication that the selected provisions completely explain the average PTA e¤ect reported in column (1). Next, column (5) returns to our original simple model from column (1) but adds a dummy variable for the EU.24 Our reasons for treating the EU separately from other agreements are three-fold. First, we suspect that not all of the EU’ s e¤orts to promote trade are captured in how their provisions variables are coded in our data. There could also be unobserved e¤ects that are channeled through the EU’ s secondary law process, in which the EU’ s governing institutions are empowered to pass new regulations and direc- tives on an ongoing basis. Second, our provisions data set does not include agreements that are no longer in e¤ect. For the most part, the agreements that cannot be included are EU pre-accession agreements, which obviously are subsumed by the EU agreement once each new member joins the EU. As discussed in Section 2, we deal with this data issue in practice by dropping all observations associated with obsolete agreements. Nonetheless, this could lead to biased estimates of the EU agreement and the provisions associated with it. Third, the latest EU agreement has in place six of the eight provisions selected in column 2 (all except AD14 and TBT7); thus, we want to make sure we are not simply picking up an “EU e¤ect”in the provisions that are selected. As the PPML results in column (5) show, the estimated EU e¤ect is large, several times that of non-EU PTAs in fact. However, when we treat the EU as a possible predictor in the lasso, we …nd that is not selected and consequently the set of provision variables selected is identical to that in column (2), which is our preferred set to work with in the subsequent iceberg lasso and bootstrap lasso analyses. 5.2 Cross-Validation Lasso Results As discussed before, the plug-in approach to choosing penalty parameter tends to choose a relatively small set of regressors and may fail to pick the “correct” regressors. For comparison, we now discuss the choice of regressors when we use the cross-validation approach.25 Figure 2 shows how the out-of-sample mean P square error (MSE) varies with the log of the tuning parameter, which is scaled by ijt yijt so that the results do notP depend on the scale of the data. At the optimal value of the tuning parameter, = ijt yijt = 0:00025, the cross-validation approach selects 128 provisions to have non-zero e¤ects. Additionally, some of the selected provisions are perfectly collinear with variables that are not selected; if we take this into account, the e¤ective number of provisions selected is 133, which is many more than what we found using the plug-in approach. For more illustration, Figures 3 and 4 show the corresponding regularization paths for selected provisions.26 That is, the …gures show how the value of the estimated (post- lasso) coe¢ cient on the selected provisions changes as we P vary . As expected, fewer provisions are selected as we increase and, for values of = ijt yijt around 0:01, which 24 We use EU as shorthand for the EU and EC agreements. 25 As explained before and in the Appendix, the cross validation is performed clustering by agreement. 26 In each panel of the …gures, the fourth set of estimates from the right corresponds to the variables selected by the cross-validation method. 23 is forty times larger than the optimal value, we generally see a close correspondence between the results in Figures 3 and 4 and those that we found earlier using the plug-in method. Figure 2: Cross-validation MSE vs. tuning parameter 300 250 Cross-validation MSE 150 200 100 -10 -8 -6 -4 Log of the scaled tuning parameter Note, however, that it is not necessarily the case that the set of provisions selected at lower levels of includes the set of provisions selected at higher levels. For example, Figure 3 shows that provision AD14, which was one of the provisions selected by the plug-in approach, is selected with a negative coe¢ cient for the smallest value of we consider, drops out when we increase the penalty, and is selected with a positive coef- …cient for higher values for . Intuitively, for small values of , the procedure selects many provisions, and the high collinearity between the variables selected makes it dif- …cult to precisely identify their e¤ect. As we increase , some provisions are dropped; because many provisions are correlated with AD14, it can be dropped without signi…cant deterioration of the out-of-sample forecasts during cross-validation, and hence it is no longer selected. It is only when the provisions correlated with AD14 are purged from the model as increases even more, that AD14 on its own gains predictive power and is again included. Overall, the plug-in and cross-validation approaches lead to the selection of very di¤erent sets of trade agreement provisions. While some provision, such as TBT07 or TF45 are selected by both approaches, others, such as AD14, are only selected by the plug-in method, and many provisions are only selected using cross-validation, such as anti-dumping provisions AD05 and AD06. Furthermore, we also see in Figures 3 and 4 that many of the estimated e¤ects for the provisions selected by cross-validation are not plausible when interpreted on their own. These observations re‡ ect the known shortcomings of the cross-validation approach that we stated earlier and found support for in our simulations. 24 25 Figure 3: Regularization path for selected provisions (AD, ET, CP, STE, SUB, ENV, LM, and MIG) 26 Figure 4: Regularization path for selected provisions (IPR, TBT, SPS, SER, ROR, TF, INV, MOC, and PP) 5.3 Iceberg Lasso Results As previously mentioned, we cannot be certain whether the variables selected by the lasso have a causal e¤ect on trade or are simply highly correlated with the variables that have a causal e¤ect. In this section, we investigate this issue further by carrying out the iceberg lasso analysis we proposed earlier. That is, for each of the provisions from our preferred set of estimates (those from the third column of Table 5), we run an additional plug-in lasso regression where we regress each selected provision on all of the provisions excluded by our …rst-stage lasso.27 As discussed, the purpose of these auxiliary regressions is to construct bundles of provisions that, at least when combined together, are likely to have a causal impact on trade ‡ ows when included in trade agreements. As we have noted, the reader should be cautioned that we will not be able to say with high certainty whether a given provision is important for promoting trade, but, as we will see, this method gives us signi…cantly increased parsimony versus relying on cross-validation. Furthermore, as we have seen from our simulations, it should also give us more con…dence in the results. Table 6 presents the results of our iceberg lasso analysis. The …rst two rows of Table 6 list each of the eight provisions selected by the …rst-stage plug-in lasso, as well as their estimated impact on trade ‡ ows from column (3) of Table 5. The subsequent rows of Table 6 report all provisions that were not selected by the lasso in the …rst step but are identi…ed in the second step of the iceberg lasso; we also report the correlation of each of these provisions with the selected provision in the …rst row. Finally, the last row reports the R2 of the regression of each selected provision on the corresponding correlated provisions. For example, column (1) shows that anti-dumping provision AD14 is highly correlated with two further anti-dumping provisions (AD06 and AD08), as well as with one provision on environmental protection (ENV42); the R2 of the regression of AD14 on these three provisions is 0:95. The results in Table 6 show that the iceberg-lasso identi…es a total of 42 (= 8 + 34) distinct provisions that are likely to be associated with increased trade. This …nding con- trasts with the 133 provisions identi…ed by the cross-validation lasso and the 8 provisions selected by the plug-in lasso. Therefore, as in the simulations in the preceding section, the iceberg lasso appears to provide a good compromise between the cross-validation lasso, which selects so many provisions that makes it di¢ cult to interpret its results, and the plug-in lasso, which is likely to miss important provisions. Looking in more detail at the results in Table 6, we …nd that provision AD14 is correlated with other anti-dumping provisions; this correlation is not surprising because all these provisions ful…ll a similar purpose, which is to increase transparency in the use of anti-dumping duties. In that sense, one conclusion to be drawn from this exercise is that anti-dumping provisions are likely to increase trade ‡ ows, although we cannot say which of them has the biggest e¤ect. Table 6 shows that, more surprisingly, AD14 is also strongly correlated with ENV42. This correlation seems to be due to what might 27 These linear plug-in lasso regressions are performed using only the 34; 370 observations for which PTAs are in force. This is because the provisions are identically zero for the remaining observations, which therefore are not informative about the relations of interest. As a consequence, the clustering now is only by agreement. 27 Table 6: Iceberg lasso results (1) (2) (3) (4) (5) (6) (7) AD14 CP23 TBT02/29 TBT07 TBT08 TBT33 TF45 (+41:7%) (+12:5%) (+20:2%) (+3:2%) (+13:1%) (+12:0%) (+9:3%) AD06 (0.98) AD06 (0.40) AD06 (-0.07) AD06 (0.51) SUB10 (0.84) AD11 (-0.05) AD06 (0.16) AD08 (0.98) AD08 (0.40) AD08 (-0.07) AD08 (0.51) TF42 (0.93) ENV44 (-0.02) AD08 (0.16) ENV42 (0.98) CP22 (0.80) CP14 (0.61) ENV42 (0.51) MOC26 (-0.10) AD11 (0.08) CP24 (0.89) CP21 (0.77) ENV44 (0.08) PP08 (-0.01) CP15 (0.71) ENV42 (0.40) CP22 (0.80) SPS21 (0.16) SUB07 (0.08) ENV19 (0.40) PP08 (0.05) ENV22 (-0.01) SUB07 (0.10) TBT05 (0.69) ENV27 (0.50) SPS24 (-0.05) ENV42 (-0.07) TBT15 (0.68) TBT06 (0.98) ENV42 (0.16) STE31 (0.54) ENV44 (-0.01) TBT34 (0.93) TBT14 (0.89) MOC26 (0.16) TBT10 (-0.01) SPS11 (-0.00) TBT15 (0.58) STE37 (0.06) TF42 (0.65) STE32 (0.66) TBT32 (0.69) SUB07 (0.03) 28 TF43 (-0.04) SUB09 (0.78) TBT34 (0.42) SUB10 (0.28) TF44 (0.38) SUB10 (0.90) TF42 (0.64) TF44 (0.98) TF42 (0.98) 0:95 0:82 0:97 0:86 0:86 0:97 0:96 Notes: Table shows PTA provisions associated with increases in bilateral trade ‡ ows (row 1), together with the estimated increase in trade ‡ ows (row 2), as well as other provisions that predict the provision in row 1 (rows 3-15; numbers in brackets are raw correlations with the provision from line 1). The last row displays the R2 of the regression of each selected provision on the corresponding correlated provisions. be called a template e¤ect, that is, the tendency of important trading blocs such as the EU and the US to use similar provisions in all their agreements. For example, most agreements signed by the EU include provisions on anti-dumping and the environment, hence leading to a high correlation between the corresponding provisions in our data.28 The same provisions that were found to be correlated with AD14 also have a reason- ably high correlation with CP23, which serves to promote transparency in competition policy. That said, the variables with the strongest correlations with CP23 are other competition policy provisions, namely CP22 and CP24. Thus, it seems likely that the presence of provisions on competition policy is behind the observed trade increasing ef- fect of CP23, although we are again unable to say which provision exactly is driving this e¤ect. We …nd that TBT07 also has a substantial correlation with the above-mentiond AD6, AD8, and ENV42 provisions but, not surprising, the strongest correlations are with other TBT provisions (TBT15, TBT34) that also relate to the use of international standards. Thus, it seems that provisions encouraging the use of international standards in the area of technical barriers to trade are likely to be behind the trade increases associated with provision TBT07, although we cannot say which of the individual TBT provisions is driving the observed e¤ect. As for the other TBT provisions selected in the …rst step, TBT02/29, TBT08, and TBT33, they are all strongly related to TF42, a trade facilitation provision, with TBT02/29 being also correlated with provisions related to competition policy (CP14. CP21, and CP22), state-owned enterprises (STE32), and subsidies (SUB09 and SUB10), and TBT33 to other TBT provisions such as TBT06 and TBT14. This set of results makes clear that provisions related to TBT are likely to have a signi…cant trade facilita- tion e¤ect, but we are not able to identify precisely which ones are relevant. The plug-in PPML-lasso also selects a provision related to the simpli…cation of proce- dures to issue proof of origin (TF45), and this provision is highly correlated with TF44, which relates to the simpli…cation of requirements for proof of origin. As noted above, Table 6 also indicates that other trade facilitation provisions are correlated with some of the provisions selected in the …rst stage; this is true for CP23, TBT33, and especially for TBT02/29 and TBT8. Thus, our results suggest that trade facilitation procedures, par- ticularly those related to rules of origin, are likely to play an signi…cant role in increasing trade ‡ ows. Finally, the iceberg lasso also identi…es provisions from other areas that help predict the provisions identi…ed in the …rst step. For example, provisions in policy areas such as movement of capital and public procurement are related to TBT33, but these types of provisions are associated with smaller raw correlations. By the logic of the lasso, it is likely that these provisions are informative for predicting the presence of TBT33 in a relatively small number of agreements where other provisions with higher raw correlations are not found. In summary, although it is not possible to identify with certainty which provisions are most important for increasing trade, our results allow us to …nd a relatively small 28 In our data, ENV42 is perfectly colinear with AD06 and AD08. 29 bundle of provisions that are likely to have the desired e¤ect. In particular, provisions related to TBTs, anti-dumping, trade facilitation, subsidies, and competition policy are likely to enhance the trade-increasing e¤ect of trade agreements. 5.4 Bootstrap Lasso Results As an alternative to the iceberg lasso, we now present the results obtained with the bootstrap lasso. Tables 7 and 8 summarize the results obtained from 250 bootstrap samples. The resampling process treats pairs belonging to the same agreement as be- longing to the same cluster, treating pairs as clusters otherwise. In each replication, we performed selection using plug-in lasso and record which variables are selected and their their post-lasso PPML coe¢ cient estimates. Table 7: Bootstrap lasso results Provisions with largest Provisions selected average coe¢ cients most frequently AD14 0:079 AD14 0:372 CP23 0:065 CP23 0:320 CP22 0:063 TBT07 0:308 AD05 0:055 SPS06 0:228 TBT07 0:054 TBT08 0:208 TBT02/29 0:048 SUB12 0:184 TBT08 0:038 TBT02/29 0:168 SUB12 0:030 TBT33 0:160 TBT34 0:029 CP22 0:156 SPS06 0:028 TBT34 0:152 TF42 0:027 TBT06 0:148 TBT33 0:023 AD05 0:140 TF41 0:023 CP21 0:124 TBT06 0:021 TF45 0:116 CP21 0:020 ENV33 0:116 Notes: Bootstrap plug-in lasso performed using cluster-bootstrap resampling with 250 replications. The numbers shown are (left) the 15 largest average post-lasso coe¢ cient estimates across all replications, and (right) selection frequencies for the 15 most fre- quently selected provisions. Table 7 presents the average coe¢ cients for the provisions with the 15 largest average coe¢ cient across all replications (on the left) as well as the selection frequencies for the 15 most frequently selected provisions (on the right). It is worth noting that even the provisions that are selected most frequently in relative terms are selected less that half of the time. For example, AD14 is the most commonly selected provision, and it has the largest coe¢ cient estimate of the variables selected by the plug-in lasso (see Table 5), but it is only selected in 37% of replications. This illustrates that, as discussed before, 30 we should only have limited con…dence that AD14 is the provision the delivers the e¤ect indicated by the original plug-in estimates for AD14. At the same time, if we take the method literally, AD14 is found quantitatively to be more likely to matter than other provisions. Overall, the results in Table 7 are reassuring in that they broadly con…rm our earlier …ndings using the iceberg lasso. Indeed, most of the provisions in Table 7 were previously identi…ed as potentially relevant by the iceberg lasso. Moreover, there are multiple provisions related to anti-dumping, competition policy, trade facilitation, and TBTs that tend to be selected with relatively high frequency and have relatively high average coe¢ cients (when averaged across all the bootstrap replications), reminiscent of the provision groupings that were indicated with the iceberg method. Table 8: Bootstrap lasso results: Summarizing results by provision category Number of provisions Number of provisions Sum of average selected more than selected more than post-lasso e¤ects 5% of the time 1% of the time across categories Anti-dumping 3 5 0:171 Competition Policy 3 5 0:151 Environment 1 5 0:017 Export Taxes 2 5 0:049 Investment 0 2 0:020 IPR 0 5 0:019 Labor Markets 0 0 0:000 Migration 1 1 0:012 Movement of Capital 1 2 0:023 Public Procurement 0 1 0:013 Rules of Origin 1 4 0:021 Services 0 1 0:004 SPS 1 10 0:062 State aid 2 2 0:011 Subsidies 5 7 0:076 TBTs 8 13 0:237 Trade Facilitation 2 5 0:064 Total 30 74 0:951 Note: The table documents the categories in which provisions were most likely to be selected and the total of the average coe¢ cients of each provision within each category. Table 8 further summarizes the bootstrap lasso results by documenting the broad provision categories in which provisions were most likely to be selected as well as the sum of the average coe¢ cients within each category. These results, therefore, show which provision categories, when taken as a whole, are likely to have the biggest impact on trade. The category with the biggest total impact turns out to be TBTs, followed by anti-dumping and competition policy. Next after that are subsidies, sanitary and phytosanitary measures, trade facilitation, and export taxes. Overall, the di¤erences between categories seem to comport with intuition (very small impacts for services and 31 labor markets, for example). They also are, again, broadly in line with the …ndings obtained with iceberg lasso. 5.5 Predicting the e¤ect of trade agreements Having identi…ed sets of provisions that are more likely to positively a¤ect trade ‡ ows, it is natural to think of ways to use this information to evaluate the e¤ects of di¤erent PTAs, and even to predict the impact of new ones. In the reminder of this section we discuss ways to perform these prediction exercises and the associated caveats.29 The simulation results presented in Section 4 suggest that, in small to moderate sam- ples, the most reliable predictions are the ones based on the (post-lasso) PPML estimates of a model whose regressors are the provisions selected by the plug-in lasso. This kind of prediction can easily be obtained using the results in column (3) of Table 5. For example, we have noted that the latest EU agreement includes all the provisions selected by the plug-in lasso, with the exception of AD14 and TBT7. Therefore, the e¤ect of the latest EU agreement is estimated to be 87% (exp (0:118 + 0:184 + 0:123 + 0:113 + 0:089) 1 = 0:87). This result is comparable to the e¤ect estimated when the EU dummy is included in the model as in column (5) of Table 5, which is 86% (exp (0:618) 1 = 0:86).30 In results that are summarized in the third column of Table 9, we repeat this exercise for each of the PTAs in our data.31 As in Baier, Yotov, and Zylkin (2019), we …nd a wide variety of e¤ects, ranging from very large impacts in agreements such as the Eurasian Economic Union, which includes all of the selected provisions, to no e¤ect at all in agreements that do not include any, such as ASEAN.32 In comparison with column 1 of this Table, which describes results for PPML with the full set of provision variables, we see an immediate advantage of using the plug-in method to model PTA heterogeneity: it greatly cuts down on over…tting. The range spanned by the estimates obtained with the full set of provision reaches implausibly large positive and negative values at the extremes, and their standard deviation is thousands of times that of the estimates produced using plug-in lasso. As shown in column 2, over…tting may also be a problem for the predictions generated by cross-validation lasso, which also lead to some implausible estimates. These results resonate with what we found in the simulations reported in Section 4, where both the model with all regressors and the model with regressors selected by cross-validation performed poorly. We next consider the performance of the two extensions of the plug-in lasso we have proposed, the iceberg lasso and the bootstrap lasso. The iceberg lasso has the advantage 29 As in Section 4, in this section we compute penalized predictions when using cross-validation, and post-lasso unpenalized predictions for the plug-in, iceberg, and bootstrap lasso. For the bootstrap lasso, the predictions are obtained by averaging the post-lasso predictions in each of the bootstrap samples. 30 Of course, using the delta method it is possible to obtain con…dence intervals for these e¤ects. However, such con…dence intervals do not take into account model uncertainty, which is likely to be the main source of uncertainty in this context. We consider this issue below. 31 Note that the average estimated e¤ect is 13:8%, which is very close to the estimated PTA e¤ect of 14:0% corresponding to result in column 1 of Table 7. 32 In contrast to Baier, Yotov, and Zylkin (2019), we are able to identify heterogeneity across di¤erent PTAs but not within PTAs. 32 that it is likely to select more of the provisions with a causal impact than the plug-in lasso. Moreover, it performed reasonably well as a predictive method in our simulations. However, as is apparent from column 4 of Table 9, in this application, predictions based on iceberg lasso lead to some unrealistic estimates. Intuitively, the provisions selected by the iceberg lasso will, by design, include multiple regressors that are highly collinear with one another. Therefore, although it may be possible to estimate the joint e¤ect of these variables with reasonable precision, the same is unlikely to be the case for each individual e¤ect. This implies that the iceberg lasso is likely to be a good predictor of the e¤ect of PTAs that include all these variables, but it may lead to unreliable results for PTAs that only include a subset of the highly collinear provisions. Table 9: Summarizing Estimates of Heterogeneous PTA E¤ects (1) (2) (3) (4) (5) All variables CV Plug-in Iceberg Bootstrap Descriptive statistics Min 81:2% 50:4% 0:0% 62:8% 0:0% Max > 1e6% 387:0% 144:4% 284:9% 101:0% Mean 328774:6% 32:1% 13:8% 17:2% 12:5% Median 26:4% 14:4% 9:3% 6:7% 7:2% Stdev. 300514:7pp 63:0pp 20:7pp 42:4pp 15:3pp Correlations PPML 1 0:146 0:054 0:233 0:041 CV 0:146 1 0:391 0:550 0:513 Plug-in 0:054 0:391 1 0:507 0:925 Iceberg 0:233 0:550 0:507 1 0:679 Bootstrap 0:041 0:513 0:925 0:679 1 Estimated partial e¤ects for selected PTAs EU 104:9% 105:4% 87:1% 101:6% 64:2% EEA 80:4% 90:5% 9:3% 94:4% 18:3% Eurasian Econ. Union 21:8% 71:8% 144:4% 38:5% 101:0% NAFTA 77:9% 77:5% 79:9% 81:5% 52:9% MERCOSUR 145:5% 115:9% 42:1% 76:2% 39:6% ECOWAS 469:6% 379:2% 9:3% 23:3% 19:4% ASEAN 1:8% 9:4% 0:0% 0:0% 3:3% This table summarizes estimated partial e¤ects for individual PTAs produced by the dif- ferent methods we consider. The column labelled “All variables”refers to an unpenalized PPML regression with all 305 provision variables. The other columns refer to variants of the lasso discussed in Section 3. The predictions based on the bootstrap lasso performed well in our simulations, and this approach also shows promise here. As shown in column 5 of Table 9, the PTA estimates produced by bootstrap lasso lead to less extreme predictions and have the lowest dispersion of any the methods we consider, consistent with what would be expected for a method based on bootstrap aggregation. Though they are highly correlated with 33 the estimates produced by plug-in lasso, the selected PTA estimates shown in the bottom panel of Table 9 reveal that the estimated e¤ects obtained with the plug-in lasso and bootstrap lasso can di¤er substantially for individual PTAs. It should be noted that the bootstrap lasso is the only approach we have considered that can provide information about model uncertainty. Indeed, as a by-product of the bootstrap sampling procedure, it can provide con…dence intervals showing how sensitive predictions of individual PTA e¤ects are to the particular sample that is used in the estimation. We have not rigorously evaluated the validity of such con…dence intervals for bounding prediction uncertainty, but it is certainly an avenue worth exploring. In summary, the plug-in lasso is our preferred method to estimate the e¤ect of in- dividual PTAs, but the bootstrap lasso may be a worthwhile check at the very least. The results of this exercise, however, need to be treated with some caution. As we have repeatedly noted, the results of the plug-in lasso do not have a causal interpretation. Therefore, their accuracy for predicting e¤ects of individual PTAs will depend, at least to some extent, on whether the selected provisions themselves have a causal impact on trade or serve as a signal of the presence of provisions that have a causal e¤ect. When this condition holds, the predictions based on this method are likely to be reasonably accurate and, indeed, the simulation results reported in Section 4 show that this ap- proach can work well even in situations where the variables having a causal impact on the outcome are not selected by the plug-in lasso. That said, it is possible to envision scenarios where predictions based on the plug-in lasso fail dramatically. For example, it could be the case that a PTA is incorrectly measured to have zero impact despite having many of the true causal provisions. 6 Conclusions In this paper, we have proposed new methods for assessing the impact of individual trade agreement provisions on trade ‡ ows. While other work in this area has relied on summary measures of agreement depth or on speci…c provision bundles of interest, our approach is instead to study the rich provision content of PTAs as a variable selection problem. By combining the three-way PPML estimator that is popular in the study of PTAs with lasso methods for variable selection, we are able to identify a relatively parsimonious set of provisions that are most likely to impact trade. While these provi- sions span a range of policy areas, our results generally support the conclusion that a select number of provisions related to technical barriers to trade, anti-dumping, trade facilitation, subsidies, and competition policy are most e¤ective at promoting trade as compared to other types of provisions that appear in PTAs. In spite of the obvious appeal that lasso methods have in this context, we need to be clear that interpreting their results requires some important caveats. In particular, we know that it is possible that even our preferred lasso methods may fail to discover important trade-promoting provisions, and that they are almost certain to lead to the inclusion of provisions that are not relevant. The iceberg lasso and bootstrap lasso 34 methods do, however, improve upon both the standard cross-validation lasso and the plug-in lasso as variable selection methods. In terms of broader applications, our methods are not limited to just PTAs or even just to trade. There are many other contexts in which the iceberg lasso and bootstrap lasso methods we have introduced could be helpful tools for researchers wishing to de- termine which of a large number of variables are worth focusing on as most relevant for the outcome. Furthermore, by integrating the lasso into a nonlinear model with high- dimensional …xed e¤ects, we show how machine learning methods for variable selection and related tasks can be utilized in much more generalized settings than what had been possible previously. 35 Table A1: Provisions selected by the iceberg lasso Anti-dumping AD06 If there are no sales in the normal course of trade in the domestic market of the exporting country AD08 Cost of production in the country of origin plus a reasonable amount AD11 Price e¤ects of dumped imports AD14 Requirement to establish material injury to domestic producers Appendix Competition Policy Provisions list CP14 Does the agreement require the establishment or existence of competition policy (either economy wide or sector speci…c)? CP15 Does the agreement prohibit/regulate cartels/concerted practices? CP21 Does the agreement regulate mergers and acquisitions? CP22 Does the agreement contain provisions that promote predictability? CP23 Does the agreement contain provisions that promote transparency? CP24 Does the agreement contain provisions that promote the right of defense? Environmental Laws ENV19 Does the agreement regulate pollution by ships? 36 ENV22 Does the agreement regulate …shing subsidies? ENV27 Does the agreement promote renewable energy and improving energy e¢ ciency? ENV42 Does the agreement require states to comply with the UN Conference on Environment and Develop- ment? ENV44 Does the agreement require states to comply with the International Energy Program? Movement of Capital MOC26 Does the transfer provision explicitly exclude “good faith” and non-discriminatory application of its laws related to prevention of deceptive and fraudulent practices? Public Procurement PP08 Does the agreement contain explicit provisions on MFN treatment of third parties? Sanitary and Phytosanitary Measures SPS11 Does the agreement promote the creation of concerted/regional standards? SPS21 Risk Assessment: Is there reference to international standards/procedures? SPS24 Is the burden of justifying non-equivalence on the importing country? Table A1 (cont’ d): Provisions selected by the iceberg lasso State-Owned Enterprises STE31 Does the agreement prohibit anti-competitive behavior of state enterprises? STE32 Does the agreement require state enterprises not to distort trade? STE37 Does the agreement indicate the geographical market where the objectionable conduct or the e¤ect takes place? Subsidies SUB07 Does the agreement introduce any ceiling to permitted subsidies? SUB09 Does the agreement include any speci…c regulation of agricultural subsidies? SUB10 Does the agreement include any speci…c regulation of …sheries subsidies? Technical Barriers to Trade TBT02 Technical Regulations - Is mutual recognition in force? TBT05 Technical Regulations - Are there speci…ed existing standards to which countries shall harmonize? TBT06 Technical Regulations - Is the use or creation of regional standards promoted? TBT07 Technical Regulations - Is the use of international standards promoted? TBT08 Conformity Assessment - Is mutual recognition in force? 37 TBT10 Conformity Assessment - Do parties participate in international or regional accreditation agencies? TBT14 Conformity Assessment - Is the use or creation of regional standards promoted? TBT15 Conformity Assessment - Is the use of international standards promoted? TBT29 Standards - Is mutual recognition in force? TBT32 Standards - Are there speci…ed existing standards to which countries shall harmonize? TBT33 Standards - Is the use or creation of regional standards promoted? TBT34 Standards - Is the use of international standards promoted? Trade Facilitation and Customs TF42 Does the agreement regulate customs and other duties collection? TF43 Does the agreement require the sharing of customs revenues? TF44 Do trade facilitation provisions simplify requirements for proof of origin? TF45 Does trade facilitation provisions simplify procedures to issue proof of origin? More Details on HDFE-PPML-Lasso Estimation The minimization problem that de…nes the three-way PPML-lasso is " 1X ( b ; b; b; b) := arg min exp(x0ijt + it + jt + ij ) ; ; ; n i;j;t p # 1X 1 Xb yijt x0ijt + it + jt + ij + j kj ; (3) n i;j;t n k=1 k where bk , to be precisely de…ned below, is identical to 1 except when the plug-in method is used. The …rst-order conditions (FOCs) for this problem are 1X b it : yijt bijt = 0; 8i; t; n j 1X bjt : yijt bijt = 0; 8j; t; n i 1X bij : yijt bijt = 0; 8i; j; n t X b :1 yijt bijt xijt;k 1b sign( ^ k ) = 0; k = 1:::p; k n i;j;t n k where bijt denotes ijt evaluated at b , b, b, b. Notice that the penalty only a¤ects the FOCs for the main covariates of interest. The FOCs for the …xed e¤ects are exactly the same as they would be in unpenalized PPML. That said, further simpli…cation is still needed because it is generally not possible to estimate all of the parameters directly, with or without the penalty. Instead, we …rst need to “concentrate out”the …xed e¤ect parameters. That is, instead of minimizing (3) over all of the parameters, we treat b it , bit , and bit as functions of b that are implicitly de…ned by their FOCs. The resulting “concentrated”minimization problem is " X b := arg min 1 exp x0ijt + b it ( ) + bjt ( ) + bij ( ) n i;j;t p # 1X 1 X b j j ; yijt x0ijt + b it ( ) + bjt ( ) + bij ( ) + (4) n i;j;t n k=1 k k such that is now the only argument we need to solve for. The FOC for each bk associated with this modi…ed problem is: X bk : 1 yijt exp x0ijt b + b it b + bjt b + bij b eijt;k x 1b sign(bk ) = 0; n i;j;t n k 38 where d b it;k dbit;k dbij;k eijt;k := xijt;k + x + + (5) d d d captures both the direct and indirect e¤ects of a change in on the conditional mean of yijt . To explain how we deal with the …xed e¤ects, assume for the moment that we know the true values of ijt := exijt + it + jt + ij that we will eventually estimate. If that is the case, then the penalized PPML solution ( ; ; , ) is also the solution to the following weighted least squares problem " p # 1 X 2 1 X b j j ; min zijt it x0ijt + 2n i;j;t ijt jt ij n k=1 k k where yijt ijt zijt = + log ijt ijt is the transformed dependent variable that is used to motivate estimation via iteratively re-weighted least squares (IRLS). The convenient thing about this representation of the problem is that we can rewrite it as " p # 1X 2 X bk j k j ; min e0ijt eijt x z + (6) 2 i;j;t ijt k=1 where z eijt are respectively de…ned as the “partialed-out” versions of xijt and eijt and x zijt , which are obtained by within-transforming xijt and zijt with respect to it, jt; and ij and weighting by ijt . The within-transformation steps involved in computing z eijt and x eijt are the same as in Correia, Guimarães, and Zylkin (2020) and can be computed quickly using the methods of Gaure (2013). Furthermore, one can show that the x eijt that appears in (6) is consistent with the de…nition given for x eijt;k in (5). The nice thing about expressing the problem as in (6) is that it now resembles a simple penalized regression problem. It can thus be quickly solved using the coordinate descent algorithm of Friedman, Hastie, and Tibshirani (2010). Furthermore, though we do not know the correct estimation weights (the ijt s) beforehand, we can follow the approach of Correia, Guimarães, and Zylkin (2020) by repeatedly updating them until convergence after each new estimate of , as in IRLS estimation. Altogether, our algorithm closely follows Correia, Guimarães, and Zylkin (2020) and otherwise only involves swapping out their weighted least squares step for a penalized weighted least squares step, as shown in (6). In principle, this algorithm can be easily modi…ed to other settings that feature multi-way …xed e¤ects in order to simplify estimation. More Details on Plug-in Lasso Rather than relying on out-of-sample performance, the Belloni, Chernozhukov, Hansen, and Kozbur (2016) “plug-in”lasso method chooses the penalty parameters and bk using 39 statistical arguments. Their speci…c framework is a simple linear panel data model, but their reasoning involves modifying the standard lasso penalty to re‡ ect the variance of the score. These concepts are quite general; thus, we can modify their approach to take into account the more complex case of a nonlinear model with multiple …xed e¤ects. The key condition in choosing these penalty parameters is that they should satisfy the following inequality for all k : b k 1X c (yijt exp(x0ijt + it + jt + eijt;k ij ))x 8k; (7) n n i;j;t for some c > 1. Intuitively, 1X (yijt exp(x0ijt + it + jt + eijt;k ij ))x n i;j;t is the absolute value of the score for k . When evaluated at k = 0, it tells us to what degree moving each k away from zero will a¤ect the …t of the model. If it does not produce a su¢ cient improvement in …t as compared to the penalty bk , then regressor xijt;k will not be selected. Next, suppose that the observations associated with trade agreements are partitioned into G clusters indexed by g = 1; : : : ; G, and let o = (i; j; t) serve as the unique index for each observation. Set !2 2 1 X X 1 XXX b = e x o;kbo = x eo0 ;kbobo0 ; eo;k x k n g o2g n g o2g o0 2g where bo = bijt = yijt exp(x0ijt b + b it + bjt + bij ), but can also be obtained as bo = bijt = bijt (z e0ijt b). By inspection, this expression provides an estimate of the variance of eijt x the score for k under the assumption that errors are correlated within their respective 2 clusters. Under suitable regularity conditions, bk 2 k = op (1) uniformly in k , where k 2 2 is the analogue of bk evaluated at the true values of ijt . By choosing bk in this way we ensure that the score for k when evaluated at zero must be large as compared to its standard deviation in order for regressor k to be selected. The choice of then involves setting a value that is su¢ ciently large that the statis- tical probability an irrelevant regressor is selected is small. By the maximal inequality for self-normalized sums (see Jing, Shao, and Wang, 2003), it follows that 1 1 P Pr bk p n eijt;k ijt m i;j;t x = o(1); Pr (N (0; 1) m) for jmj = o(n1=6 ), thus establishing a bound for the tails of the normalized sum. This suggests that by choosing a that is su¢ ciently large to dominate a p-dimensional stan- dard normal, the inequality in (7) is satis…ed. Hence, following Belloni, Chernozhukov, Hansen, and Kozbur (2016), we set p = plug = 2c n 1 (1 =2p) ; 40 where c = 1:1 and = 0:1= log(n). As discussed in the main text, after the lasso step, we then perform an unpenalized PPML estimation using the selected covariates, a so-called “post-lasso” regression. Let bP L be the estimator of the parameters associated with the s selected covariates. Such an estimator is said to have the “oracle property”if the asymptotic distribution of bP L coincides with that of the estimator we would obtain if we knew exactly which coe¢ cients were equal to zero, i.e., for large enough samples we would have bP L;k = 0 if and only if k = 0 for k = 1; :::; p. Hence, for estimators with the oracle property, asymptotically the post-lasso model is indeed the right model. In general, the lasso does not satisfy the oracle property. Nevertheless, under some additional regularization conditions, the use of the plug-in lasso method just described ensures the following “near-oracle” property for bP L , r ! s 2 max (log n; log p) b = Op ; PL 1 n and hence the post-lasso estimates are consistent at a rate that di¤ers from the oracle rate only up to the log factor max (log n; log p). In practice, the plug-in lasso method mainly requires adding one additional step to the procedure used for the estimation of the PPML-lasso with high-dimensional …xed e¤ects described before. Though the bk penalty terms are not known beforehand, they, too, can be iterated on in the same fashion as ijt . Simply use the most recent values of bijt (obtained using post-lasso PPML) in each iteration to construct new values for bk . It also requires an initial value for bijt . For this, we …rst estimate a three-way gravity model with a single dummy for PTA using PPML. More Details on Cross-Validation As discussed in the main text, the idea behind cross-validation (CV) is to repeatedly hold out a subset of the sample during estimation and then use it to validate the resulting estimates. In our setup, rather than holding out observations in an unstructured way, we keep together all observations for which a given agreement is in e¤ect, and hold out subsets of agreements. Doing so allows us to obtain estimates for the all the …xed e¤ects in the model. To describe the implementation of CV, suppose that the observations associated with trade agreements are partitioned into G subsets. Each resulting hold-out sample g will have ng observations, where ng is the number of observations associated with agreements that are held out in partition g . Because our variables of interest are all dummies, a problem that may occur is that over some subsamples some regressors may not be present, but that is less likely to happen when G is large. The CV approach sets all regressor-speci…c penalty weights bk equal to 1. Let bL;g ( ) be the lasso estimator obtained via the minimization of (4) when holding out the ng 41 observations contained in partition g . De…ne the CV bandwidth as 2 1 X 1 X G CV = arg min 4 yijt 2 G g=1 ng (i;j;t)2g 2 exp x0ijt bL;g ( ) + it b L;g ( ) + jt b L;g ( ) + ij b L;g ( ) : Since CV is based on the minimization of the average MSE over di¤erent subsamples, we expect it to deliver a much more lenient variable selection. There is some disagreement over whether dummy variables, such as the ones used in our application, should be standardized before applying the CV lasso. This consideration is in contrast to the plug- in lasso, since standardization of the covariates simply causes the bk terms to be re-scaled without otherwise a¤ecting estimation in that case. We have computed CV lasso results with and without …rst standardizing and found that the results with standardization are noticeably more similar to the plug-in lasso results. Thus, our preference is to work with standardized dummy covariates. References Anderson, J. and E. Van Wincoop (2003). “Gravity with gravitas: A solution to the border puzzle,” American Economic Review, 93, 170-192. Baier, S.L. and J.H. Bergstrand (2007). “Do free trade agreements actually increase members’international trade?,” Journal of International Economics, 71, 72-95. Baier, S.L., J.H. Bergstrand, and M.W. Clance (2018). “Heterogeneous e¤ects of eco- nomic integration agreements,” Journal of Development Economics, 135, 587-608. Baier, S.L., J.H. Bergstrand, and M. Feng (2014). “Economic integration agreements and the margins of international trade,” Journal of International Economics, 93, 339-350. Baier, S.L, Y.V. Yotov, and T. Zylkin (2019). “On the widely di¤ering e¤ects of free trade agreements: Lessons from twenty years of trade integration,” Journal of International Economics, 116, 206-228. Belloni, A., D. Chen, V. Chernozhukov, and C. Hansen (2012). “Sparse models and methods for optimal instruments with an application to eminent domain,” Econo- metrica, 80, 2369-2429. Belloni, A., V. Chernozhukov, C. Hansen, D. Kozbur (2016). “Inference in high dimen- sional panel models with an application to gun control,” Journal of Business & Economic Statistics, 34, 590-605. Correia, S., P. Guimarães and T. Zylkin (2020). “Fast Poisson estimation with high dimensional …xed e¤ects,” STATA Journal, 20, 90-115. Dhingra, S., R. Freeman, and E. Mavroeidi (2018). “Beyond tari¤ reductions: What extra boost to trade from agreement provisions?,” LSE Centre for Economic Per- formance Discussion Paper 1532. 42 Drukker, D.M and D. Liu (2019). “A plug-in for Poisson lasso and a comparison of partialing-out Poisson estimators that use di¤erent methods for selecting the lasso tuning parameters,”mimeo. Falvey, R., N. Foster-McGregor (2022). “The breadth of preferential trade agreements and the margins of exports,” Review of World Economics, 158, 181-251. Friedman, J., T. Hastie, and R. Tibshirani (2010). “Regularization paths for generalized linear models via coordinate descent,” Journal of Statistical Software, 33, 1-22. Gaure, S (2013). “OLS with multiple high dimensional category variables,” Computa- tional Statistics & Data Analysis 66, 8-18. Gourieroux, C., A. Monfort, A. Trognon (1984). “Pseudo maximum likelihood methods: Applications to Poisson models,” Econometrica, 52, 701-720. Hastie, T., R. Tibshirani, and J.H. Friedman (2009). The elements of statistical learn- ing: Data mining, inference, and prediction. New York (NY): Springer. Hofmann, C., A. Osnago, M. Ruta (2017). “Horizontal depth. A new database on the content of preferential trade agreements,” World Bank Policy Research Working Paper 7981. Jing, B.Y., Q.M. Shao, and Q. Wang (2003). “Self-normalized Cramér-type large devia- tions for independent random variables,” The Annals of Probability, 31, 2167-2215. Kohl, T., S. Brakman, H. Garretsen (2016). “Do trade agreements stimulate interna- tional trade di¤erently? Evidence from 296 trade agreements,” The World Econ- omy, 39, 97-131. Larch, M., J. Wanner, Y.V. Yotov, T. Zylkin (2019). “Currency unions and trade: a PPML re-assessment with high dimensional …xed e¤ects,” Oxford Bulletin of Economics and Statistics, 81, 487-510. Lunn, A.D. and S.J. Davies (1998). “A note on generating correlated binary variables,” Biometrika, 85, 487-490. Mattoo, A., A. Mulabdic, and M. Ruta (2017). Trade creation and trade diversion in deep agreements. Policy Research Working Paper Series 8206, The World Bank. Mattoo, A., N. Rocha, M. Ruta (2020). “Handbook of deep trade agreements.”Wash- ington, DC: World Bank. Mulabdic, A., A. Osnago, and M. Ruta (2017). “Deep integration and UK-EU trade relations,”World Bank Policy Research Working Paper Series 7947. Mullainathan, S. and J. Spiess, (2017). “Machine learning: An applied econometric approach,” Journal of Economic Perspectives, 31, 87-106. Prusa, T., R. Teh, and M. Zhu (2022). “PTAs and the incidence of antidumping disputes,”available at https://tinyurl.com/PTA-PTZ-2022. Regmi, N. and S. Baier (2020). “Using machine learning methods to capture hetero- geneity in free trade agreements,”mimeograph. Santos Silva, J.M.C. and S. Tenreyro (2006). “The log of gravity,”Review of Economics and Statistics, 88, 641-658. 43 Stammann, A. (2018). “Fast and feasible estimation of generalized linear models with high-dimensional k -way …xed e¤ects,”arXiv:1707.01815. Tibshirani, R. (1996). “Regression shrinkage and selection via lasso,” Journal of the Royal Statistical Society, Ser B. 59, 267-288. Wainwright, M.J. (2009). “Sharp thresholds for high-dimensional and noisy sparsity re- covery using `1 -constrained quadratic programming (Lasso),” IEEE Transactions on Information Theory, 55, 2183-2202. Weidner, M., T. Zylkin (2021). “Bias and consistency in three-way gravity models,” Journal of International Economics, 132, 103513. Wüthrich, K. and Y. Zhu, (2021). “Omitted variable bias of Lasso-based inference methods: A …nite sample analysis,” Review of Economics and Statistics, forth- coming. Yotov, Y.V., R. Piermartini, J.-A. Monteiro, M. Larch (2016). An advanced guide to trade policy analysis: The structural gravity model. Geneva: World Trade Organi- zation. Zhao, P. and B. Yu (2006). “On model selection consistency of lasso,” Journal of Machine Learning Research, 7, 2541-2563. Zou, H. (2006). “The adaptive lasso and its oracle properties,” Journal of the American Statistical Association, 101, 1418-1429. Zou, H. and T. Hastie, (2005). “Regularization and variable selection via the elastic net,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67, 301-320. 44