44278 World Bank's ENTERPRISE SURVEY UNDERSTANDING THE SAMPLING METHODOLGY January 15, 2007 available at: www.enterprisesurveys.org Introduction The World Bank's Enterprise Surveys (ES) collect data from key manufacturing and service sectors in every region of the world. The Surveys use standardized survey instruments and a uniform sampling methodology to minimize measurement error and to yield data that are comparable across the world's economies. Most importantly, the Enterprise Surveys are designed to provide panel data sets. Because panel data is one of the best ways to pinpoint how and which of the changes in the business environment affect firm-level productivity and job creation over time and across countries, the Enterprise Survey team has made panel data a top priority. The use of properly designed survey instruments and a uniform sampling methodology enhances the credibility of World Bank analysis and the recommendations that stem from this analysis. The World Bank's Enterprise Survey aims to achieve the following objectives: · provide statistically significant investment climate indicators that are comparable across countries; · assess the constraints to private sector growth and job creation; · build a panel of firm-level data that will make it possible to track changes in the business environment over time, thus allowing impact assessments of reforms; and · stimulate dialogue on reform opportunities. This note provides information to implementing contractors and researchers on the sampling methodology. Two complementary documents, the Implementation Note and the Questionnaire Note complete the documentation. The Implementation Note is geared to survey field managers, field supervisors and enumerators. The Questionnaire Note provides a detailed explanation of the questions contained in the questionnaire. 2. What is in an Enterprise Survey questionnaire To generate internationally comparable data, the questions in the Core questionnaire are asked in all countries and for all industries where the survey is implemented. In addition to this Core instrument, the Manufacturing Module and Services Module questions are asked to establishments in the manufacturing and services sectors, respectively. Attachments A, B and C contain the Core, Core plus Manufacturing Module, and Core plus Services Module, respectively. Attachment D is the Screener Questionnaire. The Screener Questionnaire is used to screen those establishments that do not fit the sampling criteria and should not be surveyed. The Core instrument is comprised of eleven (11) sections. The first group deals with the characteristics of the business and the investment climate in which it operates including: · Section A ­ Control Information · Section B ­ General information: ownership, start-up. · Section C ­ Infrastructure and Services: power, water, transport, and communication technologies. · Section D ­ Sales and Supplies: imports, exports, supply and demand conditions. · Section E ­ Degree of Competition: price and supply changes, competitors. · Section G ­ Land: land ownership, land access issues. · Section I ­ Crime: extent and losses due to crime. · Section J ­ Business-Government Relations: quality of public services, consistency of policy, regulatory compliance costs (management time, bribes); and · Section M ­ Investment Climate Constraints: evaluation of general obstacles. 2 These eight sections contain qualitative questions, asking for the manager's opinion on the business environment and for his motivation for business decisions. Section F, Capacity: use of production capacity, hours of operation, is a section only included in the Manufacturing Module. The last three sections of the questionnaire deal with facts and figures specific to the transactions businesses make in order to operate. More specifically, these sections contain questions on production costs, investment flows, balance sheet information and workforce statistics. These sections include: · Section K ­ Finance: sources of finance, terms of finance, financial services. · Section L ­ Labor: worker skills training, skill availability, employment, education levels of workers; and · Section N ­ Productivity: Numbers and figures needed to estimate productivity. 3. SAMPLING METHODOLOGY The sampling methodology of the World Bank's Enterprise Survey generates samples sizes appropriate to achieve two main objectives: first, to benchmark the investment climate of individual economies across the world and, second, to conduct firm performance analyses focusing on determining how investment climate constraints affect productivity and job creation in selected sectors. To achieve both objectives the sampling methodology: · generates a sample representative of the whole economy that substantiates assertions about the whole economy, not only about the manufacturing sector. The overall sample should include, in addition to selected manufacturing industries, services industries and other relevant sectors of the economy; and · generates large enough sample sizes for selected industries to conduct statistically robust analyses with levels of precision at a minimum 7.5% precision for 90% confidence intervals about:1 i. Estimates of population proportions (percentages), at the industry level; and ii. Estimates of the mean of log of sales at the industry level. 3.1 Stratification Since the overall sample reflects the composition of the economy, the population of industries to be included in the Enterprise Surveys include the following list (according to ISIC, revision 3.1): all manufacturing sectors (group D), construction (group F), services (groups G and H), and transport, storage, and communications (group I). Also, to limit the surveys to the formal economy the sample frame for each country should include only establishments with five (5) or more employees. (A separate module for micro-enterprises complements the main survey in countries with large informal economies.) The universe of industries is stratified into several manufacturing industries, two services industries, and a residual. The number of manufacturing industries to be defined as an individual stratum in each country will be chosen according to the Gross National Income (GNI) level of each country. The number of strata by type of country will be: 1A 7.5% precision of an estimate in a 90% confidence interval means that we can guarantee that the population parameter is within the 7.5% range of the observed sample parameter, except in 10% of the cases. 3 Size GNI # of manuf. # of services Rest of the Total sample as of 2005 industries industries economy size Small $5-25 billion 2 1 1 480 Medium $25-80 billion 3 1 1 600 Large $80-200 billion 4 2 1 840 Very large >$200 billion 6 2 1 1080 One representative services industry will also be sampled for all countries: retail trade (ISIC 52). For large and very large economies the second services industry will be wholesale trade (ISIC 51). The rest of the economy will be grouped together in a residual stratum (including other manufacturing, other services, construction, and other sectors). Including the residual stratum guarantees that, provided that individual observations are properly weighted, inferences can be made for the whole economy. To keep comparability with previous surveys and across countries, two (2) manufacturing industries will be selected in all countries: manufacture of food products and beverages (ISIC 15), and manufacture of wearing apparel and fur (ISIC 18). Additional industries are chosen at the two-digit ISIC level depending on the characteristics of the economy as summarized in three (3) variables: contribution to value added, employment, and number of firms. The final decision of industries will be made on a country by country basis trying to keep similar industries across countries to facilitate cross-country comparability. In small economies, there may not be enough establishments to stratify at the two-digit level. In that case, a sample of 240 firms randomly chosen from the whole manufacturing sector will be selected plus 120 retail trade firms plus 120 cases for the rest of the economy; the rest of the economy will include construction, other services, hotels and restaurants, transport, storage and communications. In some cases, it may be necessary to interview the whole population. 4. SAMPLE SIZE When determining the minimum sample size for each stratum, the size required for proportions may differ from the size required for mean of log sales. Choosing the maximum of the two guarantees that both of the following requirements are met: · The following table exhibits minimum sample sizes for different population sizes for estimates of proportions with 5% and 7.5% precision in 90% confidence intervals (assuming maximum variance).2 2-1 1 N -1 1 k 2n = N + where N=population size, P=population proportion, Q=1-p, N PQ z1 - 2 k=desired level of precision, z1- 2 is the value of the normal standard coordinate for a desired level of confidence, 1-. 4 Table 1: Sample Sizes Required with 5% and 7.5% Precision and 90% Confidence Sample Population Sample Size size Size 5% 7.5% 50 42 36 100 73 55 200 115 75 300 143 86 400 162 93 500 176 97 600 187 100 700 195 103 800 202 105 900 208 106 1000 213 107 1250 223 110 1500 229 111 1750 234 113 2000 238 113 2500 244 115 3000 248 116 5000 257 117 10000 263 119 50000 269 120 100000 270 120 With the 5% precision, the minimum sample size tends to a sample size of 270, as population size increases; with 7.5% precision the sample size tends to 120. Note that if the population size of an industry falls below 1,500, the required sample size for proportions may be reduced considerably (see figure 1 and 2). Though a 5% precision would be most desirable, a precision of 7.5% is in line with current budget constraints. 5 Figure 1: Optimal sample size 5% precision, 90% confidence interval 300 250 200 zeis lep 150 m Sa 100 50 0 100 200 300 400 500 750 1000 1500 2000 3000 5000 10000 50000 100000 Population size Figure 2: Optimal Sample Size 7.5 precision, 90% confidence interval 140 120 100 eziS 80 lep m Sa 60 40 20 0 0 1000 2000 3000 4000 5000 6000 Population Size To determine the minimum sample size required for a given level of precision about estimates of the mean of a chosen variable--such as log of sales--it is necessary to have an estimate of the variance for the variable in question. Where a survey has been implemented in the past, this information - variance of sales- can be obtained from historical data. Since most enterprise surveys have focused 6 on the most important manufacturing industries, there should be enough observations to obtain an accurate estimate. Because sales have a skewed distribution, the required sample size for inferences about its mean is typically too large.3 However, it is standard practice to work with sales in log form which takes away its large variability. By transforming the sales variable, the minimum sample sizes are typically below those required for assertions about proportions. The following table illustrates the sample sizes required for 3 different examples of countries, large, medium, and small, using actual numbers of firms in manufacturing for three (3) representative countries (Poland, Romania, and Moldova, respectively) and estimates of the number of firms in services and in the rest of the economy. Population Sample Sample Size Prop. 7.5% Mean 5% Large economy (4 manuf. Industries) 15 Manufacture of food products and beverages 31,212 120 52 18 Manufacture of other wearing apparel and accessories 40,017 120 73 17 Manufacture of textiles 11,200 119 54 29 Manufacture of machinery and equipment 21,309 120 91 Chosen services sector: retail and wholesale 245,644 120 107 Chosen services sector: IT 5,849 118 104 All others (other manufacturing, other services, construction, etc. 355,231 120 107 Total 837 589 Medium economy (3 manufacturing industries) 15 Manufacture of food products and beverages 12,061 119 71 18 Manufacture of other wearing apparel and accessories 5,251 118 62 17 Manufacture of textiles 3,417 116 85 Chosen services sector: retail and wholesale 24,408 120 84 All others (other manufacturing, other services, construction, etc. 46,623 120 101 Total 91,760 593 403 Small economy (3 manufacturing industries) 15 Manufacture of food products and beverages 851 105 95 18 Manufacture of other wearing apparel and accessories 134 64 70 17 Manufacture of textiles 73 46 43 Chosen services sector: retail and wholesale 1,364 111 80 All others (other manufacturing, other services, construction, etc. 1,759 113 124 Total 4,181 438 411 As the table shows, working with the minimum sample size required for proportions with a 7.5% precision, in general guarantees the minimum sample size required for inferences about the mean of log sales with a more demanding level of precision of 5%. This result depends on the value of the coefficient of variation being less than 0.5. Checking on the existing survey information, the coefficient of variation of log of sales for all industries is typically below 0.5 in all countries. Note that by following this procedure, the overall level of precision for the whole economy is not likely much better than 7.5%. For example, with overall sample sizes 837, 593 and 438, the overall level of precision on inferences about proportions for the whole economy, provided that observations are properly weighted, is 2.84, 3.37% and 3.71%. 2 -1 1 k 3n = N + where CV is the coefficient of variation of variable y. z1-2CVy 7 In sum, whenever previous data are available, the information on the log of observed sales could be used to check that the minimum sample size required for assertions about both proportions and the mean of sales is indirectly met when choosing the sample size required for proportions. When there is no previous data, these results suggest that working with the sample size defined in terms of the formula for proportions provide a safe alternative for assertions about the mean of sales. 4.1 Additional Levels of Stratification and Sample Selection Including other dimensions of interest, such as location or size, adds another dimension to the sampling strategy. If the second dimension is included only to guarantee variability, choosing firms randomly from a sample frame generated including all the economy will resemble the distribution of the population. For example, if location is the second dimension, making sure that the sample frame includes all desired locations will be enough to meet this requirement. In the global enterprise surveys a required second level of stratification is firm size defined as: small (between 5 and 19 employees), medium (between 20 and 99 employees), and large (100 and more employees). Regional variability will be considered by including the main industrial areas in a country. Only for very large economies the sample will be stratified at the regional level to the extent that statistically significant estimates can be made at the regional level. Adding a second dimension of stratification requires that sample sizes be distributed along the second dimension in order to achieve the desired minimum level of precision. This must be done without compromising the minimum sample sizes required for the first dimension of stratification. A good starting point for such distribution is a proportional allocation of the optimal sample size for a first-level stratum across all second-level strata. Adjustments to the proportional allocation are then needed to reach the required level of precision for each stratum. An example of how to distribute the sample for a large economy is included below. This example also includes the computation of the weights which are indispensable to make assertions about the whole population. 4.2 Non-Response, Panel Data and Attrition A potential problem of the World Bank's Enterprise Surveys is that in the majority of the cases the resulting data sets represent only firms that were willing to participate in the survey. Firms' systematic refusal to participate may compromise the random nature of the sample. In most cases, this problem has been tackled by substituting with willing participants. Regardless of the solution undertaken, it is important to determine the non-response rate from the overall population and to distinguish it from substitutions emerging from problems of the sample frame such as firms with unknown location and/or firms that have gone out of business. For this reason, it is crucial to prepare a field-work report containing the following information: 8 Stratum Non-Response Out of scope Target Wrong or Out of Sample Substitutes Complete Incomplete changed business/ Industry Size Refusals classification impossible to locate Ind.1 Medium Large Small Ind. 2 Medium Large This report is essential not only to record response rates per strata but also to identify problems with the sample frame in order to adjust the design weights. An additional objective of the global roll-out is to build a panel of firms by re-interviewing them at regular intervals of time. Every region will be surveyed every 3 years to achieve this objective. For this reason, it is imperative that every implementing firm submit all the contact information of the participating firms to facilitate their location in future iterations of the survey. This information is kept by the World Bank and not the implementing firm. If legal restrictions or internal bylaws require measures to guarantee the confidentiality of firms' identities, names and addresses can be kept separately from the main data set. For surveys beyond the first iteration attrition becomes a major concern. This problem compounds the non-response bias present in most enterprise surveys. It is important to allocate resources to minimize attrition. It is also important to identify it and differentiate it from non response emerging from firms going out of business. Observationally, both manifest themselves as non response; in reality, one reflects a structural characteristic of the economy, firms dropping out of the market, and the other reflects a potentially endogenously defined reaction by firms' managers; it could be that less productive firms systematically reject the survey, that firms more affected by negative features of the investment climate refuse to participate, or that refusals are the result of the previous experience with the survey. Econometric techniques allow to test and correct for this potential endogenous attrition. Attrition may seriously compromise sample sizes per industry/size stratum. Consequently, substitutions may be needed following the original sample design to reach the target sample size per stratum. Substitutions by stratum attempt to reconstruct the original sample design of the survey at the first iteration. If in later iterations a representative sample of the new characteristics of the economy is desired, this will add an additional layer of analysis to the sample design. The structure of an economy changes over time and the relative distribution of different sectors may vary drastically. Potential solutions are using a rotating panel or a split panel. A split panel combines both rotating and non-rotating panel. Both solutions, however, require additional resources to preserve the benefits obtained from pure panel data. 9 Table A1: Sample sizes to reach 7.5% precision by sector and location LARGE ECONOMY: EX. POLAND Population Proportional Distribution Modified Distribution Mazowie lskie Mazowie lskie Mazowie lskie ckie (Katowice Other ckie (Katowice Other ckie (Katowice Other N (Warsaw) ) Lódzkie locations (Warsaw) ) Lódzkie locations Total (Warsaw) ) Lódzkie locations Total 4 Chosen manufacturing industries 15 Manufacture of food products and beverages 31,212 4620 4558 2,784 19250 18 17 11 74 120 18 17 11 74 120 18 Manufacture of other wearing apparel and acces 40,017 6188 4879 8,923 20027 19 15 27 60 120 19 15 27 60 120 17 Manufacture of textiles 11,200 1284 3469 3,469 2978 14 37 37 32 119 14 37 37 32 119 29 Manufacture of machinery and equipment 21,309 3312 1390 1,173 15434 19 8 7 87 120 19 15 15 72 121 Chosen services sector: retail and wholesale 245,644 38180 16024 13522 177919 19 8 7 87 120 19 15 15 72 121 Chosen services sector: IT 5,849 909 382 322 4236 18 8 6 85 118 18 15 15 69 117 All others (other manufacturing, other services, construc 375,852 58418 24517 20690 272228 19 8 7 87 120 19 15 15 72 121 TOTAL 731,083 112,911 55,218 50,883 512,072 124 100 101 512 837 124 129 134 451 838 Required sample for precision 7.5 120 120 120 120 Actual precision (second level of stratification) 7.4% 8.2% 8.2% 3.6% 7.4% 7.2% 7.1% 3.9% Population Modified Distribution Weigths Mazowiec lskie Mazowiec lskie Other kie (Katowice Other kie (Katowice locatio (Warsaw) ) Lódzkie locations (Warsaw) ) Lódzkie ns Total 4 Chosen manufacturing industries 15 Manufacture of food products 4620 4558 2,784 19250 18 17 11 74 120 261 261 261 261 18 Manufacture of other wearing 6188 4879 8,923 20027 19 15 27 60 120 334 334 334 334 17 Manufacture of textiles 1284 3469 3,469 2978 14 37 37 32 119 94 94 94 94 29 Manufacture of machinery and 3312 1390 1,173 15434 19 15 15 72 121 178 93 78 214 Chosen services sector: retail and wh 38180 16024 13522 177919 19 15 15 72 121 2043 1068 901 2471 Chosen services sector: IT 909 382 322 4236 18 15 15 69 117 50 25 21 61 All others (other manufacturing, other 58418 24517 20690 272228 19 15 15 72 121 3126 1634 1379 3781 112,911 55,218 50,883 512,072 124 129 134 451 838 120 120 120 120 7.4% 7.2% 7.1% 3.9% 10