Conducting Surveys and Interventions Entirely  Online: A ​Virtual Lab​ Practitioner’s Manual  December, 2020 Nandan Rao (UAB, BGSE) Dante Donati (UPF, BGSE, IPEG) Victor Orozco (DIME, The World Bank)1 1 This research is part of the entertainment-education program of the World Bankʼs Development Impact Evaluation Department (DIME). The views expressed herein are those of the authors and do not necessarily reflect the views of the World Bank. Executive Summary 3 I. Introduction 4 II. Reference Studies 6 PFI - Using Social Media to Change Gender Norms 7 ITALY - Stereotypes and Political Attitudes in the Age of Coronavirus 7 MNM - Targeting and Evaluating Malaria Campaigns Using Social Media 8 III. Recruiting with Online Ads 9 Non-probability Sampling and Poststratification 9 Integrating Recruitment and Surveying 13 IV. Asking questions via Chatbot 15 V. Virtual Lab Study Archetypes 17 VI. Steps for Designing a Virtual Lab Study 20 VI. Lessons From Previous Virtual Lab Studies 23 VII. Future Research 24 Bibliography 26     1    Executive Summary  Online and social media campaigns reach billions of people every day. While commercial companies have built up extensive expertise in using these tools to recruit and build relationships with customers, researchers and policymakers have been slower to take full advantage of these new digital tools. The growth of mobile access in developing countries is opening up many opportunities to use social media tools to pursue development objectives: from surveys and impact evaluations to digital delivery of behavior-change campaigns. Virtual Lab is an open-source set of tools developed by the World Bankʼs Development Impact Department (DIME) for online recruitment, intervention, and surveying via digital advertising and social-media platforms. It was built specifically for researchers and policymakers. It can be used to perform impact evaluations, inform the targeting of large-scale digital ad campaigns, collect and visualize cross-sectional survey data in real time, or build high-frequency longitudinal panel surveys. An online study can only work if it performs inference on groups that can at least be found online, even if they are underrepresented, and only if such inference can be performed entirely from those individuals who are found online, even if they are hard to find. If this is the case, then collecting data via digital advertising and social-media platforms can provide a set of advantages: 1. Costs are naturally low and all solutions are trivially scalable, even beyond country borders. 2. Data is available in real time, for policymaking or to adjust the study design. 3. Recruiting a sample that is representative across a fixed set of measurable variables can be performed efficiently with targeted advertising. 4. If the intervention of interest is itself delivered online (e.g., an online advertising campaign), then performing an impact evaluation on the same platform can be extremely efficient. 5. Targeted advertising allows the recruitment of entirely bespoke and potentially hard-to-reach subpopulations that may be of interest to the policymakers or researchers. This manual provides an introduction to the survey theory underpinning Virtual Lab, lays out detailed guidelines for designing online studies within the platform, shares costs data and lessons learned from recent trials, and lays out promising areas for future research. Virtual Lab can be self-hosted on private or public cloud infrastructure. All code is open source. Further details about this platform, including the code to independently run it, can be found at ​https://vlab.digital/​. 2    I. Introduction  Online and social media campaigns reach billions of people every day. They do so through global ad networks, such as those run by Google and Facebook, that provide increasingly efficient tools to target and engage a significant portion of internet users. While commercial companies have built up extensive expertise in using these tools to recruit and build relationships with customers, researchers are just now starting to use these new digital tools (​Grow et al. 2020​). With smartphone ownership and internet access rates rapidly growing in developing countries, the reach of global ad networks is widespread in these countries. In India, for example, more than 35% of the population regularly accessed the internet through a smartphone and 70 percent of young people between 18 and 34 years old used Facebook in 2018 (Statista 2020). This opens up many opportunities to use social media tools to pursue development objectives: from surveys and impact evaluations to digital delivery of behavior-change campaigns and development programs (e.g., training of frontline workers). Developed by the ​entertainment-education program of the World Bankʼs Development Impact Department (​DIME​), Virtual Lab is an open-source set of tools for online recruitment, intervention, and surveying via digital advertising and social-media platforms. It was built specifically for researchers and policymakers. It can be used to perform impact evaluations, inform the targeting of large-scale digital ad campaigns, collect and visualize cross-sectional survey data in real time, or build high-frequency longitudinal panel surveys. It should be noted than an online study can only work if it performs inference on groups that can at least be found online, even if they are underrepresented, and only if such inference can be performed entirely from those individuals who are found online, even if they are hard to 2 find . Gathering survey data, whether as part of an experiment or opinion polling, necessarily involves two distinct steps: 1. Recruiting respondents. 2. Asking questions. Virtual Lab is an integrated platform for performing both steps. Recruiting respondents is performed via targeted digital advertising. The novelty of the platform, and the idea more 3 generally, rests in using the segmentation capabilities and ad-placement APIs of advertising platforms to stratify and optimize recruitment for statistically efficient analysis. By integrating survey responses into the recruiting platform, ad placement can be optimized based on the answers coming in from respondents in real time. For example, if a particular subpopulation (stratum) is underrepresented or has higher attrition or nonresponse then the budget for ads 2 Collecting data online has one natural and strong disadvantage compared to traditional methods: not everyone is online. If a study must include individuals who do not have access to the internet, or must perform inference on groups entirely absent from the internet, then it cannot be performed online. 3 An API, or application programming interface, is a feature that allows one software application to exchange data with or control another software application, in this case over the internet. 3  targeting that subpopulation can be automatically increased to ensure proper representation from that subgroup in the final sample. While most survey platforms could theoretically be used to ask questions, Virtual Lab provides its own chatbot engine designed especially for longitudinal panel studies and information interventions via messenger apps (currently Facebook Messenger). The chatbot integrates payment flows to send incentives to study participants and includes a number of methods for delivering media-based interventions, such as via videos or links to external websites. Chatbot surveys have a couple of natural advantages to other online surveys: (1) respondents already know and use the messenger applications through which the questions are asked and (2) follow up questions are sent in the same chat thread, with users receiving push notifications on their phone, reducing the friction for longitudinal studies. The primary purpose of this researcherʼs manual is to introduce potential users to Virtual Lab and help them design studies that take advantage of it. We hope, however, that the ideas included here, and the design of this platform more generally, are useful to the research community at large when considering digital study designs and can inform the future development of similar tools. Section II of this manual makes two motivational arguments for online recruitment of respondents. First, that a non-representative sampling frame combined with targeted digital advertising has the potential to be as accurate and efficient as traditional “representative” sampling techniques. In particular, this potential is especially strong whenever a large portion of study invitees refuse to participate or drop out early. Second, that integrating recruitment and data collection in a single platform can open up new and significant opportunities for improved estimation efficiency by adjusting recruitment dynamically based on survey responses. For example, one could learn which variables should be used to stratify at the same time as one is collecting the data. Section III of this manual discusses chatbot surveying. This is a valuable survey mode in a world of mobile phones with unique strengths compared to traditional web-based surveys. It is especially adapted to sending simple and frequent follow up messages that can be used for constructing panel surveys or experience sampling with repeated observations. The rest of the manual is devoted to helping researchers build a study online, using Virtual Lab. We suggest some study designs that are a natural fit for the platform (Section IV) and then provide a step-by-step guide for designing a study (Section V). Following that, we share some results, costs, and numbers from previous studies along with a set of mistakes made and lessons learned (Section VI). Finally, we lay out (plentiful) avenues for future research in this space (Section VII). Open source code ensures full transparency and auditing of security practices. All code is open source and can be self-hosted on any public or private cloud. All public network 4 communication is encrypted by default with TLS and data at rest can be encrypted at the block storage layer provided by the cloud infrastructure. For inquiries and further details about this open source platform, including the code to independently run it, please visit https://vlab.digital/​. 4 The Transport Layer Security protocol provides the encrypted communication that secures all websites accessed via HTTPS. This encryption ensures that data moving over the internet cannot be read, even if it is intercepted. 4    II. Reference Studies  Throughout this manual, we will refer to three initial studies that the research team has completed with Virtual Lab. While these studies do not take advantage of all the possibilities or use-cases we lay out in this manual, they will provide some concrete examples of some features. The studies will be referred to as “PFI,” “ITALY,” and “MNM” respectively in the rest of this manual. All Studies will be publicly available by early 2021. PFI - Using Social Media to Change Gender Norms  A collaboration with the Population Foundation of India (PFI), this study is an online randomized control trial (RCT) testing whether two low-dosage 25-minute edutainment web series delivered through Facebook Messenger are effective at changing gender norms and promoting positive behaviors towards violence against women (VAW) in northern India. The team recruited 18-to-24-year-old youths living in New Delhi and six other cities and randomly assigned them to the two treatment conditions: an entertainment drama web series and a documentary web series. Recruitment was performed on Facebook with a one-week campaign stratified by gender. Approximate gender balance between men and women was achieved at recruitment costs three-times higher for women. As participation incentives, individuals that filled in the baseline and at least one follow up survey were eligible for a raffle to win Samsung Galaxy smartphones or a “selfie” picture with a Bollywood celebrity. Clicking on the ad directed respondents to Facebook Messenger where the survey was administered by the Virtual Lab chatbot. After filling a baseline survey on demographic characteristics and knowledge, attitudes, and beliefs regarding gender norms and violence against women, respondents were sent a series of short videos directly within the chat. Subsequent videos were sent two hours after respondents had watched the previous. Follow up surveys were sent one week after finishing all the videos and again three months after finishing the videos. In follow up surveys, in addition to questions to gauge attitudes, knowledge, and beliefs about gender norms and gender-based violence, users received two calls-to-action: 1) an invitation to visit the websites of various NGOs working on issues relating to gender and 2) an invitation to add a “frame” to their facebook profile with the phrase “end violence against women.” Both the videos and the links used Virtual Labʼs video-hosting and link-sharing integrations which collect data on A) if the users clicked the link and B) if the users clicked “play” to actually watch the video. This allowed researchers to track compliance and estimate the treatment effect on the treated. ITALY - Stereotypes and Political Attitudes in the Age of Coronavirus  This project consists of a panel survey conducted in Italy to study how beliefs about the infectiousness of people from different countries, political attitudes towards China and the EU evolve in response to local exposure to COVID-19. Measures of beliefs and attitudes were collected over six waves during the spread of the pandemic, beginning in late February 2020. 5  Respondents were recruited via Facebook ads targeted to a set of regions in middle and southern Italy which, initially, had little-to-no exposure to the virus. Ads were not stratified and all recruiting finished within a week. Respondents self-reported the postal codes of their homes, which was combined with separate, provincial-level data on the spread of the virus. This was compared to self-reported knowledge about cases in their community. In each wave, respondents were asked to estimate the spread of the disease among different national groups (“consider all the people in the world of Chinese nationality, according to you, how many out of 1 million are infected by Coronavirus?”). While individual estimates naturally varied widely, the research team was able to compare the repeated estimates from the same individual over time to track how estimates for different nationalities changed throughout the evolution of the global pandemic. The team also conducted an information experiment that exposed a random subset of respondents to information about cooperation between EU countries in response to the COVID-19 virus. A set of images, with text containing facts about support Italy had received from specific EU countries during the pandemic, were sent within the Virtual Lab chatbot to two treatment groups. One group received information about Germany and Austria while another about France and the Czech Republic. The control group received no images. Two months after the intervention, respondents were asked their opinion on the solidarity that those specific countries had shown to other members of the EU during the last 20 years. MNM - Targeting and Evaluating Malaria Campaigns Using Social Media  This study is a cluster RCT in India measuring community-level impacts of a social media campaign on malaria incidence, bednet usage, and treatment-seeking behavior. 80 districts in 3 north-Indian states were randomly assigned to treatment and control conditions. Treatment districts received three months of a Facebook ad campaign designed to raise awareness of malaria as well as improve both preventative and treatment-seeking behaviors. The campaign was specifically designed for engagement and social sharing of Facebook posts. Facebook users living in control districts were excluded from the campaignʼs ads. The social media campaign was designed and run by Malaria No More, an international NGO aiming to eliminate deaths by malaria, in association with Upswell, a consulting firm specializing in running social media campaigns for social good. The campaign, as well as the impact evaluation, was funded by Facebookʼs Campaigns for a Healthier World initiative. Virtual Lab was used to recruit a panel of respondents and interview them via chatbot on Facebook Messenger. Respondents were sent messages every two weeks where they were asked to report incidences of fever or malaria as well as report on their daily behavior regarding the usage of bednets or other preventative tools. They were incentivized with two tranches of mobile credit top ups: one third of the total credit was sent after the third follow up and the rest after the endline survey. The mobile credit delivery was handled within the chatbot by Virtual Labʼs Reloadly payment integration. A panel of survey respondents was recruited via Facebook ads. An initial sample of 250 respondents per cluster was recruited, where clusters were defined by administrative “districts.” Clusters varied widely in per-respondent acquisition cost. Virtual Lab was used to generate separate ad sets for each cluster and continuously optimize the daily ad spend to 6  efficiently recruit respondents from all clusters (lowering spend in “cheap” clusters made them even cheaper, increasing spending in “expensive clusters” ensured quicker recruitment of those respondents). Recruitment was also adaptively optimized by Virtual Lab to increase the representation of individuals living in kutcha (mud, tin, and/or straw) dwellings. These respondents reported a higher incidence of malaria than other groups and were additionally under-represented when following a naive recruitment strategy. With optimization, the number of districts with more than 20% of respondents living in kutcha dwellings increased from 29 to 65 while the number with less than 10% decreased from 13 to 0. Cost-per-respondent increased under this optimization, leading to an estimated ~20% increase in total recruitment costs compared to the naive recruitment strategy.   III. Recruiting with Online Ads    Non-probability Sampling and Poststratification  The traditional approach of recruiting survey participants starts with probability sampling: the entire population is in the sampling frame and N individuals are selected with known probabilities. This can be contrasted with non-probability sampling, where the selection probability of individuals is unknown to the researcher. Recruiting via digital advertising, as done by the Virtual Lab platform, is a non-probability sampling technique because you cannot calculate the probability of being shown an ad and surveyed for everyone in your population. Even in probability sampling, and even if the initial frame is perfect and selection probabilities are known, non-random nonresponse implies that survey responses are not themselves representative and thus not unbiased out-of-the-box. For this reason it can be helpful to 5 decompose survey error into (among others) frame error and nonresponse error . Both, however, create the same problem: your sample is biased and responses must be reweighted in order to create an unbiased estimate of a population parameter (​Kolenikov 2016​). Reweighting methods can be separated into three categories: 1. Design weights. If the selection probability of each individual in the sampling frame is known, responses can be weighted by the inverse probability of selection. No external data or modeling assumptions are needed. 2. Nonresponse weights. An estimated probability of nonresponse can be calculated within the survey, given the observed variables recorded in the survey. No external data is required, but the researcher must make modeling assumptions to estimate the nonresponse probability as a function of observable covariates available in the initial sampling frame. 3. Poststratification weights. If external data on the population is available (e.g., census data), response weights for a subpopulation can simply be set to reflect their total 5 ​. This has been discussed heavily in the Total Survey Error framework. See ​Biemer​ (​2010) 7  6 proportion of the population. Techniques such as raking can be used if only marginal distributions are available in the population (​Deville and Särndal 1992​; ​Battaglia, Hoaglin, and Frankel 2009​) and techniques such as multilevel regression and poststratification 7 (MRP) can be used if the number of variables, and hence subpopulations, is large (​Gelman and Little 1997​). These techniques do not require known selection probabilities, only non-zero representation of all subpopulations of interest in the final survey results (and by extension, in the original sampling frame). They do require modeling assumptions over the observable covariates available both in the survey data and in the external data. Non-probability Sampling and Poststratification.​ Design weights and nonresponse weights can together be used to recover an estimate of a parameter in the population represented by the initial sampling frame. Alternatively, poststratification weights can be used to build an estimate of a parameter in the target population. If the target population is different from that of the initial sampling frame, or design weights not available (as in nonprobability sampling), then using some form of poststratification weighting is needed to estimate a population parameter from sample data. 6 Raking, also known as iterative proportional fitting, consists of iteratively looping over each variable and adjusting the weights of individual observations such that the weighted marginal probability of the variable in the sample approximates that of the population. The process goes like this: given a set of variables, start with one of them and reweight your observations so that the weighted distribution of your sample matches the marginal distribution of this variable in your target population. Repeat for the next variable, and then the next, all the way down the line for all your variables. By the time you adjust for the last variable, your first one is likely off again, so you iterate the process until convergence. While the technique provides no theoretical guarantee on the closeness of the joint density, it has been shown to work well in practice. 7 Multilevel regression and poststratification consists of using a Bayesian hierarchical model which has the advantage of being able to estimate cell-specific parameters for populations stratified on many variables, even when the sample size in individual cells is small or non-existent. For example, consider a survey in the US where you would like to include the variables race, gender, and state. You might not have many observations of each combination of race/gender within each state (or even just many observations within each state period). However, if you model states as belonging to higher-level groups (regions), you can estimate the region-level parameters and then regularize the state-level parameters to the corresponding regional-level parameters. This allows the individual state-level parameters to deviate from the region only when there is enough state-level data to justify the deviation. Given estimates for every cell in your stratification, poststratification weighting consists in directly applying the target population weights for the cell. 8  Design weights together with nonresponse weights are sufficient to recover an estimate of the parameter -of interest in the population that makes up the initial sampling frame. Alternatively, poststratification weights with external data are sufficient to estimate a parameter of interest in the population represented by the external data. If the initial sampling frame is the same as the "external data'' about your target population (i.e. both are an official government census), then the two techniques are identical. In practice, however, many surveys start with a sampling frame that is relatively representative but not very covariate-rich. This is the case with random-digit-dialing (RDD), the most common technique used by companies and organizations conducting phone surveys and a staple of opinion polling for many years. Nonresponse rates for RDD surveys have been steadily rising for decades, reaching 91% in the 2010s (​Keeter et al. 2017​; ​Shirani-Mehr et al. 2018​). Because phone numbers do not come with demographic variables that allow for the creation of a nonresponse model, it is standard practice for these surveys to employ poststratification weighting to estimate their population parameter (​Gelman and Little 1997​). What, then, is the value of a representative initial frame if nonresponse is so significant and poststratification techniques needed anyways? Can nonprobability sampling techniques with poststratification replace traditional probability-based sampling measures? If so, in what settings is this feasible and advisable? These questions have been posed by a number of researchers of late. ​Wang et al. (​2015​) applied multilevel regression and poststratification (MRP) to data from an opt-in poll that was made available on the XBox gaming platform to estimate vote share among the two primary candidates for the 2012 US presidential elections. They compare their predictions with polling averages from pollster.com and find that their predictions track the polling averages very closely and indeed even produce better results than the polls in the days running up to the election. ​Goel, Obeng, and Rothschild (​2015​) applied MRP to survey data from a sample of 1000 Amazon Mechanical Turk (AMT) workers and from 1,000 respondents collected via online survey company Pollfish to calculate mean outcomes for a variety of questions from the General Social Survey (GSS) and similar questions from surveys performed by Pew Research. They report that their calculated outcomes from AMT and Pollfish respondents have a mean-absolute deviation (MAD) from GSS/Pew benchmarks of 7.2 and 7.4 percentage points respectively, where the GSS and Pew differ from each other with a MAD of 8.6. Further research from Pew shows a similar error magnitude between online opt-in surveys with poststratification methods and their traditional phone-based survey results, reporting average deviations of 6 percentage points (​Mercer, Lau, and Kennedy 2018​). What do these results imply? The GSS is a rigorous survey in which considerable time and resources are poured into creating an inclusive sampling frame and then getting responses (in-person) from each sampled individual (minimizing nonresponse), coming in at the considerable price tag of $3 per respondent per question (​Goel, Obeng, and Rothschild 2015​). While there is no ground truth in the opinion questions measured in these surveys, the GSS is widely considered the best we have. Phone-based surveys from firms like Pew Research start with a significantly more representative sampling frame (all phone numbers) than the convenience samples studied (MTurkers or opt-in visitors to a set of websites/apps). Despite this advantage, it does not systematically outperform the convenience sample. This would seem to imply that either A) the magnitude of nonresponse error in those techniques overwhelms any improvement in frame error and/or B) that poststratification techniques 9  successfully made up for a significant portion of frame error, reducing the importance of the initial frame in the resulting total survey error. Critically, however, poststratification techniques (like nonresponse modeling) require the researcher to select a set of relevant variables on which to stratify. This need to model the outcome as a function of covariates is happily absent in a pure probability-sampling method with minimal nonresponse. A simple example can help illustrate this point: imagine you are interested in surveying your cityʼs population to know how concerned they are about a particular global pandemic. Consider, additionally, that you collect information on respondentsʼ race and gender, but do not think to collect information on their age. Assume that in reality, older people are more vulnerable to this disease and thus more concerned. If you have a comprehensive sampling frame and a crack team of door-knockers (think GSS), your resulting sample (given sufficient size) will likely be representative for age and thus your lack of information about respondentsʼ age will not affect the estimate of the average level of concern in your city. Now consider you are running an RDD-based phone survey and it turns out that young people are much less likely to take your call. You did not collect information on age (you did not know it would matter!), so poststratification cannot make up for this nonresponse bias and you end up woefully overestimating the concern in your city. Conversely, consider you run an internet-based convenience survey in which young people are overrepresented in your sampling frame. Without information on the age of each respondent in your collected data, poststratification cannot overcome this bias. While the importance of “age” might be obvious in modeling citizensʼ concern regarding a pandemic, the inclusion or exclusion of other variables might not be so obvious a priori (e.g., political affiliation or social media diet). Thus, any technique relying on poststratification implicitly brings a modeling and, in particular, a variable-selection problem. Indeed, ​Mercer, Lau, and Kennedy (​2018​) show in a comparison of poststratification techniques that what matters most for removing the bias of nonprobability samples is not the exact technique used, but rather the variables chosen. This should be a strong cautionary tale to anyone who thinks they have a representative sample but has taken the variables required for representation as given and not chosen them based on the specific outcome. In the following section we will discuss how integrating recruitment and survey responses can allow for the variable selection problem to be formally solved in an online fashion, potentially leading to significant increases in outcome estimation accuracy. An additional disadvantage of a non-inclusive sampling frame is that, while poststratification weighting can make up for certain populations being underrepresented, it can do nothing for populations entirely absent from the sampling frame. Digital advertising, along with the other online convenience methods compared in the studies reported above, by definition exclude from their initial sampling frame everyone without internet access. This could potentially exclude very poor households entirely, as well as those of certain religious or ethical beliefs. This is a deficit that no amount of poststratification weighting can make up for. Many new studies have begun to use Facebook specifically as a recruitment tool and apply poststratification to estimate population quantities (​Zagheni, Weber, and Gummadi 2017​; Perrotta et al. 2020​). We are not aware, however, of any study that has systematically compared recruitment via online advertising to traditional probability-based recruitment such as RDD. 10  We see several advantages that online advertising has as a sampling frame as opposed to other online convenience sampling methods: 1. Large population coverage: Facebook registers 2.7 billion active users (Statista 2020). Google Ads Display Network reports to cover 90% of internet users worldwide across millions of websites (Google 2020). 2. Targeted advertising allows researchers to intentionally pay more to reach under-represented groups. This allows researchers to trade off cost and representativeness in a way that the studied convenience methods (i.e. Xbox live players) do not allow. For example, in the PFI study, gender balance between men and women was achieved by stratifying ads and paying 3 times the cost for women compared to men. Similarly, in the MNM study, balance across regions and dwellings was achieved by paying up to 50 times more for the most expensive strata compared to the cheapest. 3. Real time communication. Digital advertisers expose APIs, which allows software run by the researcher to communicate with the advertising platform in real time and adjust ad placement. While on the one hand this makes it convenient to create hundreds of audiences for the stratified recruitment, the potential extends further. We explore this feature more in the following section, but itʼs worth highlighting that it is novel, not present in traditional sampling frames or considered in any of the research on this topic that we are aware of. The combination of the last two points (targets ads controlled via API) is very powerful: it implies that we can build our own ad-optimization engine to optimize the goals of researchers, policymakers, and the public good. This does not come out-of-the-box from ad platforms, whose built-in ad optimization routines are designed to maximize value under the assumption that diversity of audience (customers) is not in-and-of-itself valuable. While this may be the case in retail, it is not the case for research. The value (information gain) of an individual decreases with the number of similar individuals we already have. This is why Virtual Lab has its own ad optimization engine that uses the available tools to optimize for heterogeneity rather than homogeneity. These additional advantages of digital advertising over the convenience-based methods studied give solid reasons to believe that results could potentially be even better for digital advertising. Future research is needed to test that hypothesis.   Integrating Recruitment and Surveying  Virtual Lab is a platform for recruitment and surveying via digital advertising. As discussed above, this technique presents new opportunities but also challenges that are worth discussion and research. Many important possibilities are opened up by integrating digital recruiting (with all the micro-targeting and dynamic-optimization power of modern digital advertising) with survey answers in a single platform. In particular, the question becomes: what can we gain if we have the ability to adjust recruitment in real time based on initial survey responses that come in? We will consider several increasingly complex (and realistic) scenarios and see how, in each case, a platform which integrates recruitment with surveying (and can adaptively adjust recruitment) can achieve more efficient results than a traditional process. 11  Scenario A: Adjusting for attrition  Consider you have a set of simple, demographic variables on which to stratify your population and a target number of respondents for each subpopulation. For example, consider that you want responses from 500 respondents aged 65 and older and 500 aged 65 and younger. How many should you recruit from each group? If attrition rates are not equal among your subpopulations, and not known ahead-of-time, this question can be hard to answer. If, however, your recruitment platform knows who finishes your survey, it can continue recruiting from each subpopulation until it gets exactly the numbers you want. Virtual Labʼs recruitment engine, for example, can be set up to continue recruiting from each subgroup until you have 500 who finish your survey (however you define “finish”). Scenario B: Adjusting for hard-to-find subgroups  Consider you have a set of variables on which you want to stratify your population, but they are not restricted to traditional demographic variables and instead include variables that you ask in your survey. For example, in MNM, we wanted a sample stratified by the respondentʼs dwelling type (cement dwelling vs. non-cement dwelling). This is not a demographic variable available for sampling from any traditional sampling frame nor was it a variable available for targeting in Facebookʼs ad platform. Traditionally, the only way to stratify on these variables would be to “over-recruit” and hope that none of your subpopulations of interest are overly rare. When you run the analysis, you can either lament that your estimates are a bit unstable due to an unlucky low number or lament that you paid more than you needed to estimate your effect. In an integrated recruitment/survey environment, however, recruitment can continue until exactly the point where individual subpopulation targets are fulfilled, even if the subpopulations are defined by survey variables such as “dwelling type.” Taking it one step further, Virtual Lab uses advanced targeting techniques of digital advertising platforms to target ads to groups of users even if they are not defined by traditional demographic variables available for explicit targeting via the platform. Specifically, ad platforms provide two features that allow this platform to optimize for populations that canʼt be explicitly defined by demographics: A) custom optimization events and B) “lookalike audiences” (also called “similar audiences”). In both cases, Virtual Lab sends the ad platform a list of users who, after answering the survey, are revealed to belong to a certain subpopulation and tells the ad platform to target ads at “people like this.” The ad platform then uses all the (massive) set of private variables at their disposal to determine how to find similar people and target ads to them. In the MNM study, for example, Facebook didnʼt know for certain we were interested in people living in kutcha dwellings, nor could it even know for certain if people lived in kutcha dwellings, but by providing it with a continuously growing list of users that fit that criterion, we were able to continuously reduce the cost-to-acquire for this subgroup throughout the recruitment process. 12  Scenario C: Adjusting for outcome variance  Consider now that you have a set of variables on which to stratify your population but do not know how many finished surveys you need from each subpopulation. If youʼre interested in the expectation of a population value (e.g., average household income), the optimal sample size to allocate to each subpopulation can be described by the Neyman Allocation (​Neyman 1934​; Groves et al. 2010​ ): W hSh nh = n ∑ W hSh h Where two factors influence how many individuals you sample from each subpopulation: nh /n , the share of your population that belongs to that subpopulation and S h , the variance of their outcome. For example, if the outcome is average household income and the strata are defined by profession, you may only need to sample a few “students” while you may need to sample many “managers.” In the general case, however, this variance may not be known ahead-of-time. Once again, if you have a platform with dynamic recruitment, you can adjust the sample size per subpopulation in an adaptive fashion, based on sample estimates of the variance, spending more to recruit respondents from subpopulations with high outcome variance and spending less on more homogenous subpopulations. Scenario D: Adjusting the subgroups themselves  Consider now that you do not actually know the correct set of variables by which to stratify your population. In general, these should be variables with strong associations to the outcome, but that might not be known a priori. Without integrating recruitment and surveying, the best variables for stratification can only be determined a posteriori and even then only with a bit of luck: to consider a new variable for stratification you need to have strong representation across all possible values of that variable. Consider the example of election polling. Despite the industry being well-funded and the techniques well-polished, new variables are constantly being added at the end of every election cycle and the lack of new variables constantly being blamed for previous prediction failures and polling biases. If the process is integrated, however, both the outcome values and the ideal feature representation can be learned simultaneously in one adaptive survey. We believe this is an exciting area of research for both statistical theory and survey platforms to explore.   IV. Asking questions via Chatbot  There are many ways in which one can deliver a survey: face-to-face, phone calls, web forms (Qualtrics), etc. “Chat” is a relatively new form of surveying. Respondents receive questions as text messages in a messaging application on their smartphone and respond by writing their responses as text messages in return. 13  By using the chatbot capabilities provided by many popular messaging apps (Whatsapp, Facebook Messenger, Telegram, Viber, etc.), one can deliver survey questions and collect responses within the apps that many people already use to communicate every day. Very little research exists showing the mode effects of chat (also known as “messenger” or “chatbot” survey designs) as a format for questionnaires and this is an important question for further research. ​Toepoel et al. (​2020​) do present, however, some initial results comparing the two modes. They show that there are indeed differences, including that users provide shorter answers to open-ended questions when provided with a chat design as opposed to a web survey design. There are, however, some key limitations to their study, for example that the majority of their users were on desktop and the messenger application was a custom-built interface that users had never used before. However, despite the fact that it is a new survey mode, there are key advantages of chat that lead us to believe it can be a powerful data-gathering tool: 1. The entire process is asynchronous. Unlike all other methods, users do not stop everything for a contiguous block of uninterrupted time to fill the survey. Instead, they respond to the next question at their own convenience. This could, of course, be considered a disadvantage as well if you donʼt want your respondents getting distracted by their daily lives in the middle of answering your questionnaires. 2. The interface is familiar. Web surveys inherently require respondents to learn a new digital interface (how do I move to the next question, how do I enter my response, etc.). In a chat survey, on the other hand, respondents use an interface they are already familiar with. 3. Seamless follow ups. Sending a follow up, whether itʼs 1 question or 20, after two hours or two months, is experienced by the respondent as “just another” message in a thread which can be read and responded to with minimal friction. Web surveys, on the other hand, have no natural way to follow up with respondents and must resort to external tools such as email. In the ITALY study example, attrition over 6 distinct waves and 3 months was only 50%, with more than 90% of respondents voluntarily completing each wave from a single push notification, without any additional reminders or encouragement. 4. Notifications. Sending messages can be used to “push” notifications to respondents based on external events. For example, when incentive payments have been processed, respondents can be notified directly in the chat (“youʼve been paid!” or “your payment could not be processed, please provide a different number”). In MNM, the implementation of this payment processing inside the bot dropped the failed payment rate from 5% to 0.2%, by communicating errors directly to respondents and allowing them to provide alternative mobile numbers for top up payments. 5. Mobile first. Web surveys are digital imitations of paper surveys. Modern survey software has re-boxed that imitation to work well on small screens like mobile devices. Text messaging, on the other hand, is a mobile idea. Chat surveys replicate a text-message conversation, not a paper survey, and thus provide a mobile-native experience. Thus, while much empirical research is still needed to understand the effects of these differences, the affordances of the design itself allow us to draw some initial hypotheses as to 14  when chat, as a survey mode, might be advantageous. ​Table 1 shows a side-by-side comparison of the two modes and their relative advantages across different features. Table 1. Mode comparison   Chat Survey  Web Survey  Follow ups  Seamlessly integrated. Can be Must use external process many, potentially very frequent (email). Best for few follow ups, and with few questions. preferably with many questions at once. Open-ended responses  Better for relatively short Short or long answers could answers. both work. Mobile vs. desktop  Great on mobile, desktop, and Favors desktop. Cannot switch allows platform switching. platforms mid-survey. Integrated payment  Yes (in Virtual Labʼs chatbot) Yes (depending on platform) Users share media (photos,  Seamless and natural Possible audio)  Users can share content with  Seamless and natural Possible friends or contacts    V. Virtual Lab Study Archetypes  We will consider several study design archetypes/patterns that we think are particularly well-suited to Virtual Lab, along with examples of how they could be (or were) designed. These should be considered as rough sketches of potential research designs meant to encourage 8 brainstorming, rather than detailed blueprints to be followed directly. Individual-level randomized control trials  In this type of study, participants are randomized into treatment groups at the individual level and outcomes are measured via survey responses or recordable online actions (i.e. clicking links or agreeing to share content). The treatment itself could be administered via: 1. A digital ad campaign, where individuals in the study are “retargeted” in the same ad platform used for recruitment and shown ads. The ads are shown to the individuals “in the wild,” anywhere in the ad network (i.e. in their Facebook feed or while browsing websites), and targeted directly to treatment-group respondents. 2. A within-survey intervention, where the treatment is delivered within the survey itself. For example, the survey chatbot might send respondents a video (or several!) to watch, as in the PFI study or might send users images, as in the ITALY study. 8 Where appropriate, it is important to submit research protocols to institutional review boards for ethical approval, as well as collect informed consent from all research participants. 15  3. A real-world intervention. For example, respondents can provide their address and researchers can send bednets to their houses to encourage them to sleep under bed nets in order to prevent mosquito-borne diseases. Clustered randomized control trials  In this type of study, participants are randomized at cluster-level, where clusters might, for example, consist of geographic regions or communities. Outcomes, again, are measured via survey responses or recordable online actions. The treatment is administered to the entire cluster and this can take many forms: for example, any advertising campaign targeted by region or region-specific government policies. An example of this type of study is MNM. The treatment was an ad campaign meant to promote preventative behaviors and proper treatment and one of the primary outcomes of interest was malaria incidence. Because there are potentially large geographic spillovers from engaging in malaria-preventative behavior (malaria spreads from person to person) and high correlation of malaria-incidence within regions, it made sense to randomize at cluster level (in this case, the administrative “district”). Virtual Lab was used to generate control and treatment regions in the Facebook ad platform that could then be used by the advertising team to target their ads everywhere except for control districts. Survey-response targeted ad campaign  Rather than (or in addition to) using Virtual Lab to perform impact evaluation, it can be used to directly improve the targeting of the main ad campaign itself. In this pattern, a survey is run in parallel to the main ad campaign. The responses to the survey questions divide respondents into ideal audience and not-ideal audience categories. This information is returned to the ad platform which generates a model to predict new individuals who would likely form part of the ideal audience if asked the survey questions (this is done with the models traditionally used to predict “high-value customers” for retail advertisers). This prediction model is then used to target the main ad campaign at those more likely to be the ideal audience. Consider as an example that you are running a vaccination campaign. Ideally, you only want to advertise to those who have not yet been vaccinated. A survey can be used to gather data on who has been vaccinated and the ad platform creates a model to predict those who have not yet been vaccinated. With that, you can target your ads specifically at those likely to not yet have been vaccinated! This could be performed continuously so that the audience is updated in real time. This has the potential to drastically improve the cost-effectiveness of the campaign, as those who actually see the ads are more likely to be those who might actually benefit from it. 16  Quick population surveys with real time data visibility (dashboard)  The ability to instantly deploy ads, quickly recruit respondents, and immediately visualize response data in a dashboard can be extremely useful for any policymaker or researcher who needs to gather data on new outcomes of interest. This can have advantages over going to a pre-existing pool of respondents in the following scenarios: 1. When the variables that define “representativeness,” or subpopulations of interest, for the new outcome are different then those around which previous respondent pools were created. In this case, creating a new respondent pool, stratified around a different set of variables, can be done extremely quickly with digital advertising. 2. When the subpopulations of interest are unknown, as is likely the case if the outcome is new or of recent interest, an integrated recruitment/data-collection platform like Virtual Lab could theoretically jointly learn the important stratification covariates along with the collected outcome data. Consider, for example, a policymaker wanting to understand the social impacts of new social-distancing measures to counter Covid-19. Populations of particular interest might be those who have pre-existing health conditions as well as those who are employed in heavily impacted sectors. Such a sample likely does not exist in any survey companyʼs portfolio, but could be created with the targeting tools of digital advertising. Population  panels  with  high-frequency  waves and real time data visibility  (dashboard)  Building high-frequency data based on low-touch interactions is a perfect fit for Virtual Lab and the chatbot survey engine. Consider a panel of respondents that provide an answer to a repeated question on a weekly, or even daily, basis (“Do you have a fever?” or “How worried are you about ___?”). As an example, in MNM, respondents were asked several questions bi-weekly during the malaria season, including “did you or anyone in your family have malaria in the past two weeks?” With hundreds of responses each day, researchers were able to track in real time the evolution of the malaria self-reports across multiple geographic regions. It should be noted that the choice to use a panel vs. a repeated one-off survey design will depend on the outcome of interest, but in both cases the Virtual Lab platform allows for quickly deploying a recruiting and monitoring solution. 17    VI. Steps for Designing a Virtual Lab Study  1. Select population parameter(s)  Digital advertising provides a lot of flexibility and power in sample recruitment. This means, however, that it is especially important to be explicit about the population and population parameter that you wish to estimate with the sample you recruit. In individual-level experiments, your population may be the sample itself and the treatment effect estimated for the sample only. Many behavioral lab experiments are (for better or worse) designed this way. If your experiment seeks to test something that could be a policy, however, you might want to estimate your effect of the policy on the population that would be affected by the policy. 2. Decide how to measure parameters based on users’ online interactions  Decide how to measure that parameter from a sample of individuals based on their interactions within a survey, chat application, or website. While the simplest outcome is an answer to a survey question, it can also be possible to collect data on users actions, however. While you cannot track everything a user does on their phone, you can send them to a website that you control, collect data on their actions there, and link those actions back to them. This opens up many possibilities for measured outcomes. For example, you can direct them to a page with streaming videos and record how much they watched. You can also invite them to click on a link to an external website and record whether or not they clicked on it. Even if you canʼt directly record the actions, with a bit of leg work and creativity you can potentially still recover the variables. In PFI, for example, the survey chatbot asked users if they would like to add a “frame” to their Facebook profile pictures supporting putting an end to violence against women. If users answered “yes,” they were sent a link to a specific frame. While the survey could record usersʼ intentions (the “yes” answer), there was no automated way to check if they did, in the end, change their profile picture. However, the researchers could manually go through the profile pictures of the respondents and record who had added the frame. 3. Pick stratification variables  These can be explicit demographic variables available via the digital advertising platform or they can be variables you will collect yourself in your survey. In general, stratification variables should be discrete (or discretized) variables that exhibit strong dependence with your outcome and/or interaction with your treatment. Ideally, the strata defined by these variables can be used for both recruitment and analysis, allowing you to target recruitment to maximize statistical power. In the case of a well defined population of interest, you should have measures for your stratification variables (or a subset of them) in the population. For example, if youʼre interested 18  in population-level outcomes for a countryʼs voting population, voter registration records with information on covariates such as age, gender, location, etc. might be available. Targeted digital advertising implies that many of the problems traditionally left to post-stratification techniques to solve can be worked on significantly through an adaptive recruitment process itself: spending extra advertising money on respondents in under-populated cells and saving money on respondents in would-be-overpopulated cells. In addition to stratifying recruitment on poststratification variables, you might have some variables for which you do not have population data but still wish to use in your primary analysis, as would be the case when estimating heterogeneous treatment effects, for example. In that case, you would also want to stratify recruitment on those variables. Geographic variables are easily forgotten in online experiments. If, however, you have any reason to believe your outcome is dependent on urban/rural divides or varies from major cities to smaller cities, then it is likely extremely important to consider stratifying on geography as well. Create a desired target sample size for each stratum. Virtual Lab will use the marketing API of the digital advertising platform to generate separate ad sets for each individual stratum and optimize their spend and duration to recruit the target number of respondents. Keep in mind: greater stratification naturally implies higher advertising costs, so be sure to estimate your budget accordingly. Consider your stratification variables in two groups: those which are available via digital advertising directly (demographic data such as age, gender, location, etc., call these “demographic variables”) and those which you will collect in your questionnaire (“custom variables”). This platform can use variables from either group to stratify and recruit, however, with custom variables the targeted advertising will get more efficient the larger the sample size. More details on targeted advertising over custom variables are available in the documentation on the Virtual Lab website. 4. Panel? Cross-section? Repeated cross-section?  Digital advertising allows for continuous, reliable, automated recruitment. If you are not interested in panel data per se but want data over time, you can still get repeated “waves.” Turning the recruitment process on and off is as easy as clicking a button, which greatly reduces the friction of repeating surveys. Repeated cross-sectional surveys will likely have less attrition and lower costs than the same data collected in panel format. Chat, however, is extremely well suited to panel data as it allows for natural follow up questions. Unlike in traditional panel survey design, you have extreme flexibility over how you design the timing of your follow ups. Want to send a follow-up question four hours later? Four days? Four months? All are equally possible. This flexibility, combined with the fact that users can, at any time, put down your survey and pick it up again at their convenience later, requires you to think differently about the respondent experience and design your survey accordingly. 19  5. Select an incentive strategy  You may want to incentivize respondents to answer your questions or participate in your experiment. Two common schemes are A) respondents are entered into a lottery or B) respondents each receive a small individual reward. In the case of individual rewards, Virtual Lab includes a set of payment-provider integrations and an integrated payment system that allows the chatbot to notify respondents when payments have been processed, inform them of any errors, and advance in the chat only after payment has successfully been processed. This greatly reduces (or eliminates!) the back-office work required to successfully deliver incentives to respondents in a large study. A current list of payment provider integrations can be found in the platformʼs online documentation on the official website. Attrition is a major concern for any study, but especially so when the involvement is more demanding, such as in multiple-wave panel data or experiments with time-intensive interventions. Integrating individual rewards into the chat can be a great way to build trust with respondents and can potentially help to reduce attrition. Even a very small reward, successfully delivered early in the survey process, can help legitimize the process and build trust, reducing attrition. Large payments at the end of the study (i.e., after the final follow up) increase the incentive to complete the study, which can similarly reduce attrition. Table 2 compares lottery vs individual reward incentive schemes in the context of digital advertising. Table 2. Comparison of incentives Lottery  Individual Reward  Up-front (advertising) costs  Potentially high if the lottery is Potentially low if incentive is not that attractive. attractive. Incentive (reward) costs  Potentially low if lottery prize is Potentially high if incentive is cheap expensive Total Costs  Depends Depends Building Trust and Attrition  Hard to build trust. Attrition Possible to build trust if the potentially high as a result. reward can partially be delivered early in the process. This can reduce attrition. Costs increase/decrease  Lottery cost is fixed. Advertising Yes. You pay per respondent, linearly with respondents  costs are paid at the beginning which implies payment is and costs donʼt change, proportional to the number of regardless of how many respondents who complete the respondents complete the survey/experiment. survey/experiment. 20    6. Design and Test  Consider running several pilot surveys to improve respondent experience and reduce attrition. Virtual Labʼs dashboards allow you to monitor how respondents finish each wave of your survey and see which questions they tend to get stuck on. Randomize survey length and intensity and see how that changes attrition and especially attrition conditional on potential covariates or outcomes of interest. Viewing, in a real-time dashboard, the behavior of your respondents allows you to iterate survey design quickly and efficiently.   VI. Lessons From Previous Virtual Lab Studies  Table 3 compares costs and attrition numbers for the three example studies cited in the manual. Itʼs worth noting some major drivers of cost: Stratification increases representation but costs more​. This tradeoff highlights the danger of not stratifying, low costs of digital recruiting can be attractive but it comes at a cost! Recruitment time: Short recruitment periods force higher per-day ad budgets and higher per-impression ad costs. Cost-per-acquisition does not generally stay constant at different levels of daily budget. As the budget increases, the cost-per-acquisition increases as well. This is a natural consequence of the auction format of digital advertising, ads are competing with other advertisers for a limited number of daily viewers. Attrition in each wave during panel studies can be very high and this naturally increases advertising cost exponentially. Techniques to reduce attrition have been discussed in the previous section, but we do not yet have rigorous evidence to show what exactly works. ITALY experienced extremely low attrition compared to the other two example studies. This might be driven by the difference in the countries, but also by the fact that ITALY took place during a country-wide Covid-19 confinement and the subject of the survey was Covid-19. Potentially, respondents had both more time on their hands to fill surveys and were also interested in the subject matter. Table 3. Stratification, Recruitment Time and Attrition PFI  ITALY  MNM  Clicks on ad  33,000 3,500 302,000 Cost per click (US$)  0.66 0.10 0.20 Incentives  Lottery ticket Amazon voucher Mobile credit Stratification of  Yes No Yes recruiting campaign  Type of study  Longitudinal (RCT) Longitudinal Longitudinal (RCT) 21  Individuals’  60 minutes 60 minutes 40 minutes participation time  Length of study  4 months 3 months 6 months (base-to-endline)  Initial sample size  5,200 1,220 18,800 Final sample size  620 (12%) 600 (50%) Study In Progress (percentage of initial  sample size)    VII. Future Research  Recruiting respondents with digital ads and asking them questions via chatbot is a relatively new approach. Virtual Lab was built as a research tool to take advantage of these new technologies and test out new techniques. As such, there is much work to be done! Methods to measure bias of ads and ad optimization  Digital ad platforms generally have a very large population onto which they can target ads. The sample of individuals that are both shown ads and respond to them, however, is not a random sample from that population. It is worth pointing out several factors (among many) driving that non-randomness: A) the bids of other companies for users, which is based on those users value to them B) the prediction engine of the ad platform which determines who is most likely to click on your ad and C) the ad creative (image and copy) itself, which greatly determines who actually clicks on the ad. If those factors lead to a strongly non-random sample from the potential population, then understanding how they relate to your outcome or interact with your treatment is crucial to knowing whether they introduce a bias in-and-of-themselves. Luckily, all of those factors can be partially controlled by the researcher: A) bids can be increased to outbid other advertisers, B) daily budgets can be increased, which forces ad platforms to show your ad to more people, including those who may not be those they predict most likely to click and C) ad creative 9 (image and text) can be swapped out and tested against each other. Coming up with a reasonable methodology for doing these sensitivity checks is absolutely vital to doing research with respondents recruited via digital advertising. Mode effects and Attrition of Digital Advertising  Attrition is a major concern in any survey process, but in messenger surveys and/or digitally-recruited participants little is known about specific mode effects on attrition. Our 9 ​Note that, as opposed to traditional A/B testing done with creative by ad companies, here we are not suggesting that ads be compared in order to choose the one that is most effective, but rather to be compared in downstream analysis to see if the ads recruit qualitatively different audiences. For example, if the intervention shows a differential treatment effect on the audiences recruited by different ads or the estimated population parameter differs in the audiences recruited by the different ads. 22  experience with the example studies in India and Italy suggest that potentially these modes might be “easy in, easy out”: individuals opt-in to start a survey very easily and, potentially, without a lot of commitment, which might make them opt back out again with equal ease. Understanding how these digital modes of recruitment and interaction affect attrition is therefore an important area for further research. Mode effects and Attrition of Chatbot Surveying  Every survey format has a “mode effect” that influences the way respondents answer questions. Some initial research shows that chat does, indeed, have mode effects that differ from those of a traditional web survey (​Toepoel et al. 2020​). More work is needed to fully understand these effects. It is of special interest to see how attrition across multi-wave surveys differs between chatbot surveying and other modes (web survey + email, for example). Automatic Stratification  As mentioned in the motivation section on integrating recruitment and surveying, many opportunities arise in an integrated platform to automate the stratification process based on real-time outcome data: both for optimizing the number of respondents per stratum and for selecting the variables that make up the strata themselves. Given the technology to create integrated platforms, like Virtual Lab, and the APIs of modern digital advertising tools, online optimal stratification methods can have important potential benefits in increasing the accuracy of any number of surveys or polls. We believe this is a very interesting area for statistical methodological research that addresses these problems and proposes solutions that work in this context. Comparing accuracy of results to gold-standard surveys  Similar to previous results using post-stratification to estimate election or survey outcomes (​Wang et al. 2015​; ​Goel, Obeng, and Rothschild 2015​), it will be important to show how well samples recruited via digital advertising can replicate gold-standard probability sampling results such as those by national census-based surveys (i.e the General Social Survey or the European Social Survey). In the motivation section of this manual, we lay out a hypothesis that purposefully targeted subpopulations recruited via large digital advertising platforms can improve upon results from pure convenience samples such as XBox players or MTurk workers. If such an improvement exists and is substantial, targeted digital advertising could quickly gain prominence as a proven and affordable tool for accurate survey recruitment. 23    Bibliography     Battaglia, Michael P., David C. Hoaglin, and Martin R. Frankel. 2009. “Practical Considerations in Raking Survey Data.” Survey Practice 2 (5): 1–10. ​https://doi.org/10.29115/sp-2009-0019​.   Biemer, Paul P. 2010. “Total survey error: Design, implementation, and evaluation.” Public Opinion Quarterly 74 (5): 817–48. ​https://doi.org/10.1093/poq/nfq058​.   Deville, Jean-Claude, and Carl-Erik Särndal. 1992. “in Survey Sampling Calibration Estimators.” Journal of the American Statistical Association 87 (418): 376–82. http://www.jstor.org/stable/2290268​.   Gelman, Andrew, and Thomas C. Little. 1997. “Poststratification Into Many Categories Using Hierarchical Logistic Regression.”   Goel, Sharad, Adam Obeng, and David Rothschild. 2015. “Non-Representative Surveys: Fast, Cheap, and Mostly Accurate,” 27.   Groves, Robert M., Eleanor Singer, James M. Lepkowski, Steven G. Heeringa, and Duane F. ​ Alwin. 2010. Survey methodology. ​https://doi.org/10.4324/9780429314254-2.   Grow, André, Daniela Perrotta, Emanuele Del Fava, Jorge Cimentada, Francesco Rampazzo, Sofia Gil-Clavel, and Emilio Zagheni. 2020. “Addressing Public Health Emergencies via Facebook Surveys: Advantages, Challenges, and Practical Considerations.” https://doi.org/10.31235/osf.io/ez9pb​.   Keeter, Scott, Nick Hatley, Courtney Kennedy, and Arnold Lau. 2017. “What Low Response Rates Mean for Telephone Surveys,” 1–39. http://www.pewresearch.org/wp-content/uploads/2017/05/RDD-Non-response-Full-Report.pdf​.   Kolenikov, Stas J. 2016. “Post-stratification or a non-response adjustment?” Survey Practice 9 (3): 1–12. ​https://doi.org/10.29115/SP-2016-0014​.   Mercer, Andrew, Arnold Lau, and Courtney Kennedy. 2018. “For Weighting Online Opt-In Samples, What Matters Most?” Pew Research Center.   Neyman, Jerzy. 1934. “On the Two Different Aspects of the Representative Method : The Method of Stratified Sampling and the Method of Purposive Selection Author ( s ): Jerzy Neyman Source : Journal of the Royal Statistical Society , Vol . 97 , No . 4 ( 1934 ), pp . 558-625 Pub.” Journal of the Royal Statistical Society 97 (4): 558–625.   Perrotta, Daniela, André Grow, Francesco Rampazzo, Jorge Cimentada, Emanuele Del Fava, Sofia Gil-Clavel, and Emilio Zagheni. 2020. “Behaviors and attitudes in response to the COVID-19 pandemic: Insights from a cross-national Facebook survey,” 1–17. https://doi.org/10.1101/2020.05.09.20096388​. 24    Shirani-Mehr, Houshmand, David Rothschild, Sharad Goel, and Andrew Gelman. 2018. “Disentangling Bias and Variance in Election Polls.” Journal of the American Statistical Association 113 (522): 607–14. ​https://doi.org/10.1080/01621459.2018.1448823​.   Toepoel, Vera, Peter Lugtig, Bella Struminskaya, Anne Elevelt, and Marieke Haan. 2020. “Adapting surveys to the modern world: Comparing a research messenger design to a regular responsive design for online surveys.” Survey Practice 13 (1): 1–10. https://doi.org/10.29115/sp-2020-0010​.   Wang, Wei, David Rothschild, Sharad Goel, and Andrew Gelman. 2015. “Forecasting elections with non-representative polls.” International Journal of Forecasting 31 (3): 980–91. https://doi.org/10.1016/j.ijforecast.2014.06.001​.   Zagheni, Emilio, Ingmar Weber, and Krishna Gummadi. 2017. “Leveraging Facebookʼs Advertising Platform to Monitor Stocks of Migrants.” Population and Development Review 43 (4): 721–34. ​https://doi.org/10.1111/padr.12102​. 25