WPS8594 Policy Research Working Paper 8594 Training to Teach Science Experimental Evidence from Argentina Facundo Albornoz María Victoria Anauati Melina Furman Mariana Luzuriaga María Eugenia Podestá Inés Taylor Development Economics Vice Presidency Strategy and Operations Team September 2018 Policy Research Working Paper 8594 Abstract This paper evaluates the learning impact of different teacher is particularly beneficial for inexperienced teachers with training methods using a random controlled trial imple- less than two years of teaching Science. Coaching teachers mented in 70 state schools in Buenos Aires, Argentina. also showed specific gains for girls, who both learned and A control group receiving standard teacher training was declared to enjoy science lessons more. Higher-performing compared with two alternative treatment arms: providing students especially benefited from both interventions, with a structured curriculum unit or receiving both the unit and students from coached teachers performing particularly weekly coaching. Following a 12-week intervention, there well in harder questions. Using structured curriculum units are substantial learning gains for students whose teachers and providing coaching also affected teacher perceptions: were trained using structured curriculum units, as well as for teachers expressed that they enjoyed teaching Science more those whose teachers received coaching (between 55 percent and taught more hours of Science, and that their students and 64 percent of a standard deviation more than those developed more skills. Results from a follow-up survey students in the control group). Coaching teachers does not suggest persistent change in teacher practice, with the vast appear to be cost-effective, as the unit cost per 0.1 standard majority reporting using the structured curriculum unit deviation is more than twice the cost of using only the one year after the intervention. structured curriculum unit. However, additional coaching This paper is a product of the Strategy and Operations Team, Development Economics Vice Presidency. It is part of a larger effort by the World Bank to provide open access to its research and make a contribution to development policy discussions around the world. Policy Research Working Papers are also posted on the Web at http://www.worldbank.org/ research. The authors may be contacted at facundo.albornoz@nottingham.ac.uk. The Policy Research Working Paper Series disseminates the findings of work in progress to encourage the exchange of ideas about development issues. An objective of the series is to get the findings out quickly, even if the presentations are less than fully polished. The papers carry the names of the authors and should be cited accordingly. The findings, interpretations, and conclusions expressed in this paper are entirely those of the authors. They do not necessarily represent the views of the International Bank for Reconstruction and Development/World Bank and its affiliated organizations, or those of the Executive Directors of the World Bank or the governments they represent. Produced by the Research Support Team Training to Teach Science: Experimental Evidence from Argentina Facundo Albornoz, María Victoria Anauati, Melina Furman, Mariana Luzuriaga, María Eugenia Podestá, and Inés Taylor JEL classification: C93, I21, I28 Keywords: Science education; teacher training; experimental study Facundo Albornoz (corresponding author) is a Professor of Economics at the University of Nottingham and researcher at CONICET, Nottingham, UK; his email address is facundo.albornoz@nottingham.ac.uk. María Victoria Anauati is a doctoral student at CONICET, Argentina; her email address is victoria.anauati@gmail.com. Melina Furman is an Assistant Professor at Universidad de San Andrés and Researcher at CONICET, Argentina; her email address is mfurman@udesa.edu.ar. Mariana Luzuriaga is a Research Assistant at the Universidad de San Andrés, Argentina; her email address is mluzuriaga@udesa.edu.ar. María Eugenia Podestá is a Director of the Science Education Program at the Universidad de San Andrés, Argentina; her email is mepodesta@udesa.edu.ar. Inés Taylor is a Research Assistant at the Universidad de San Andrés, Argentina; her email address is itaylor@udesa.edu.ar. The research for this article was financed by CIAESA (Centro de Investigación Aplicada en Educación San Andrés). We are very grateful to Samuel Berlinski, Guillermo Cruces, Alejandro Ganimian, Lucila Minvielle, and Abhijeet Singh for comments and suggestions. 1. Introduction Teacher training programs are ubiquitous across educational systems and constitute an essential tool to improve student learning and, thus, promote economic growth and development. Surprisingly, however, current approaches to teacher training are mainly uninformed by high-quality evidence of their impact (Yoon et al. 2007). This is a serious issue especially because the different ways that programs can be designed and implemented involve a substantial variation in costs. In Latin American countries, for example, total investments in teacher training represent a major element of non-salary public spending in education, but there are no rigorous evaluations of their impact on learning (Bruns and Luque 2014), let alone an evaluation of the cost-effectiveness of different designs when implementing them. Thus, how to design cost-effective teacher training programs becomes one of the central questions of education policy. This paper provides experimental evidence on the impact and cost-effectiveness of different teacher professional development interventions on student Science learning from a specifically designed large-scale study, implemented in state primary schools in the Autonomous City of Buenos Aires, Argentina (hereafter CABA, for its Spanish acronym). While the experiment is specific to the instruction of Science in Argentina, the results may have broad relevance for other curriculum subjects and contexts. A typical training program consists of a one-off short training session (Darling-Hammond et al. 2009). This field experiment assesses the marginal gain of complementing this basic training with two distinct teacher training models that provide different ongoing degrees of scaffolding. The first treatment is the provision of a structured curriculum unit (SC henceforth), a detailed teaching guide comprised of lesson-by- lesson plans that provide teachers with objectives, content knowledge, and specific activities to implement with their students. The second treatment is supplementing the first two treatments (short training sessions plus SC unit) with weekly coaching. This allows us to study and compare a basic one-off teacher training session with two distinct follow-up models, each with different degrees of support and associated costs. 2  More specifically, the main findings and associated policy lessons of a randomized controlled experiment designed to assess the effect of different working modalities with in-service teachers on student learning in Science are reported. As primary education is considered of key importance to lay the foundations of scientific literacy (Novak 2005; Näslund-Hadley and Bando 2016), this study focuses on seventh grade— the last level of primary school in CABA. The study involves 70 schools, which constitute a representative sample of CABA state primary schools. Although seeking to provide experimental evidence of teacher training in general, Science as a particular curriculum area has its own specific merit. Over the past decades, many governments and international organizations have advocated Science, Technology, Engineering, and Math (STEM) subjects and degrees to promote economic growth in a context of highly technological and rapidly changing societies and jobs. The promotion of scientific literacy has also been emphasized by standardized international student assessment programs such as the Program for International Student Assessment—PISA (OECD 2016). The interest in this specific educational setting is easy to explain. Argentina, like the rest of the Latin American countries, is a perfect setting to study the effect of different teacher training strategies for Science. Despite several government initiatives aimed at encouraging Science education (see, e.g., Serra 2001; Argentine Ministry of Education and Sports 2007), the performance of Argentinian students in standardized assessments is still poor (Vegas, Ganimian, and Bos 2014; UNESCO 2016). Even in CABA, the best performing Argentinian district, 41 percent of students only achieved the minimum level in Science, placing them as one of the lowest-performing groups in the world (Martin et al. 2016; OECD 2016). For this study, the participating schools were randomly assigned to three groups. All teachers in the three groups received a short-term training session. Besides being widely used in other countries, this approach to teacher professional development is also the most common one in Argentina (Argentine National Institute of Teacher Training 2016). Teachers that only received this short-term training form the Control 3        Group. However, the literature indicates that gains in student achievement are weak and they can only be observed in longer training interventions with ongoing support (Yoon et al. 2007).1 In a recent review, Arancibia, Popova, and Evans (2016) conclude that only a few characteristics of teacher training programs, such as the inclusion of supplemental materials, follow-up visits, and focus on a specific subject, are positively associated with student test score gains. Following this, as well as other indicators of best practice, both treatment arms include such characteristics. The second group (Sequence Group henceforth) of teachers received the same short-term training, but this was then complemented with ongoing support through the use of a structured curriculum unit, which guided teachers in the organization, content, and pedagogy of a given topic. Research shows that well- designed structured curriculum units can enhance training sessions by providing concrete ways of taking the approaches learned in training directly to the classroom and serving as catalysts for local customization (Brown 2009). Developing curriculum units is a key strategy followed by the Argentinian education authorities as part of their efforts to improve teaching (Educ.ar 2005; Argentine Ministry of Education and Sports 2017). However, the literature also highlights challenges associated with the use of structured curriculum units. In some cases, teachers adapt these units, making the lessons easier and more aligned with their regular practice, which in turn lowers their cognitive load (Davis, Janssen, and Van Driel 2016). Additionally, many factors may influence how and why teachers choose to adapt curriculum units, such as their previous teaching experience, knowledge, and beliefs about science and education, among others (Forbes and Davis 2010; Arias et al. 2016). One way to bridge the gap between structured curriculum units and the classroom and to help teachers truly understand the rationale behind each activity proposed in structured curriculum units is by providing                                                              1 Some studies even show that a minimum of 50 or even 80 hours of training and continuous post-training support are required to observe any result (Gulamhussein 2013). 4          teachers with pedagogical support (Kraft and Blazar 2016). Thus, in the third group (Coaches Group henceforth), teachers received the same short-term training and structured curriculum unit as the Sequence Group, with the addition of individual tutoring from pedagogical coaches. Coaches worked with teachers on a weekly basis to promote the full understanding of the nature of the activities proposed in the curriculum unit, and provided extra support, explanations, and feedback depending on each particular teacher’s needs. The literature shows that coaches seem to increase the fidelity of implementation and improve teacher and student performance (Kretlow and Bartholomew 2010). Kraft, Blazar, and Hogan (2016) estimate that coaching raised student performance on standardized tests by 0.15 standard deviations and improved instructional practice by 0.58 standard deviations based on effect sizes reported in 44 studies that used experimental or quasi-experimental designs. This effect compares favorably when contrasted with the larger body of literature on teacher training (Yoon et al. 2007; Garet et al. 2011). Comparing these two treatment groups with the Control Group allows us to confidently establish the marginal effect of complementing training sessions with either just a structured curriculum unit or additional coaching effort. The first set of results clearly suggests that there is a gain in terms of learning. Specifically, students in the Sequence Group and Coaches Group learned between 55 percent and 64 percent of a standard deviation more than those students in the Control Group, respectively. This is equivalent to an average increase in student achievement from the 50th to the 66th (70th) percentile, approximately, for a student moved from the control condition to the structured curriculum condition (coaches condition). The marginal costs of doing so are also relatively low compared to the benefits. Complementing training sessions with a structured curriculum unit costs (per student) 0.84 dollars per 0.1 standard deviations; in other words, it costs 0.84 dollars to move a child from the 50th to the 53th percentile approximately, whereas complementing this intervention with additional coaching effort costs (per student) 2.28 dollars per 0.1 standard deviations. 5          Empirically establishing the additional effect of coaching with respect to a structured curriculum unit is also relevant in terms of policy. Although, in general terms, coaches seem to increase the impact of teacher professional development public policies, hiring, training, and providing coaches is an expensive and human- resource intensive approach.2 According to the results, there is no general additional benefit in terms of student learning when providing ongoing coaching compared to the structured curriculum unit on its own. However, qualifying this result is another contribution of this paper. Also, additional coaching does make a difference for relatively inexperienced teachers. Specifically, students in the Coaches Group learned 82 percent of a standard deviation more than students in the Sequence Group when considering the least- experienced teachers. Therefore, tutors add value for those teachers who are relatively inexperienced in teaching science, particularly when considering higher-order skills, which require more intensive teaching. This suggests that improving teaching in Science is not a matter of choosing the best strategy, but rather the one that best suits the specific teachers and learning goals in question. Finally, a potential long-lasting effect of teacher training requires that the targeted teachers adopt the structured sequence and change their way of teaching. Thus, the obvious follow-up question is whether teachers under the treatment groups continued using the sequence a year after the training, when teaching the same topic. To answer this question, participant teachers were contacted in the Sequence and Coaches Groups after the intervention to inquire about whether they continued using the sequence provided the prior year, even when this time their students were not going to be externally assessed. From this follow-up, almost every “treated” seventh-grade Science teacher that remained in the same school continued using the sequence                                                              2 In Argentina, exact figures and numbers are not publicly available, but many in-service teacher professional development programs—in particular those which provide support for rural or non-central provinces—include and finance the training and deployment of coaches. For instance, a recent national initiative involved the hiring of coaches to support the work of 800,000 teachers (Argentine Ministry of Education 2015). 6          (100 percent in the Sequence and 89 percent in the Coaches Group, respectively). This is an encouraging finding, as it suggests that the training produced persisting effects on teaching practices. The effect of teaching training on the learning experience goes beyond scores. Their impact on other dimensions of the learning process is of independent interest, but this study shed further light on how the effective adoption of evidence-based Science teaching techniques affects the perceptions of students and teachers. The structured curriculum unit seems to be an effective instrument to enhance curiosity and interest among students. In particular, using an index that captures these aspects, pupils in the Sequence group present a scale 20 percent of a standard deviation higher than those in the Control Group. Results also show that both treatments favorably change teacher perceptions of their practices and their expectations of student learning. Compared to the Control Group, teachers in the Sequence and Coaches Groups present a scale between 63 percent and 100 percent higher in their perception that their teaching practices changed meaningfully, that they enjoyed teaching Science more, that they taught more hours of Science, and that students learned more and developed more skills. Coaching teachers also showed specific gains for girls, who both learned and declared to enjoy Science lessons more relative to those assigned to the Sequence Group. There is a growing body of literature in economics devoted to evaluating the impact of different policy interventions at the school level. Most of this effort has gone into identifying the causal effects of two broad categories of interventions: (a) improving school inputs, such as textbooks or classroom libraries (Glewwe, Kremer, and Moulin 2009; He, Linden, and Margaret 2009;3 Abeberese, Kumler, and Linden 20144), remedial education and/or assistant teachers (Jacob and Lefgren 2004a; Banerjee et al. 2007), computers and                                                              3 He, Linden, and Margaret (2009) assessed a program that consisted of two main components: the child library and the activities carried out in class, which included using story books, flash cards for word and letter recognition, and charts to instruct children. 4 The main component of the program evaluated by these authors was providing schools with a set of age-appropriate books. This component was completed with training teachers to incorporate Reading in the curriculum, and with a 31-day “read-a-thon” to encourage children to read and supporting teachers as they incorporated Reading into their classes. 7          computer-aided instruction (Linden 2008; Barrera-Osorio and Linden 2009; Cristia et al. 2012; Mo et al. 2014; Muralidharan, Singh, and Ganimian 2016; Berlinski and Busso 2017), and other instructional technology, like flash cards (He, Linden, and MacLeod 2008) or flipcharts (Glewwe et al. 2004); and (b) providing additional educational resources and their management, including the effect of voucher programs (Angrist et al. 2002) or lump sum grants to schools (Das et al. 2013), as well as organizational changes like, for example, curricular design (Harris et al. 2014; De Philippis 2016), reducing class size (Angrist and Lavy 1999; Krueger and Whitmore 2002; Urquiola 2006; Fredriksson, Ockert, and Oosterbeek 2012), group tracking (Duflo, Dupas, and Kremer 2011), enhancing teacher incentives (Glewwe, Ilias, and Kremer 2010; Duflo, Hanna, and Ryan 2012), and providing large-scale assessments to inform improvements in school management and classroom instruction (de Hoyos, Ganimian, and Holland 2017). This study makes a contribution to both literatures insofar as training teachers has a direct effect on school inputs and is able to identify and evaluate alternative ways to organize and deliver this training. The identification of the causal effect of on-the-job or in-service teacher training has received far less attention.5 Most of this research in education economics uses regression discontinuity strategies to estimate the effect of different training programs. For example, Jacob and Lefgren (2004b) find no evidence on student achievement of an in-service training program targeting teachers of Math and Reading in elementary schools located in relatively poor areas in the United States. Angrist and Lavy (2001) estimate the effect of in-service teacher training on achievement in Jerusalem elementary schools. In this case, results are more encouraging.                                                              5 There are a number of papers in the education literature studying the effect of on-the-job or in-service teacher training programs. This literature has been recently reviewed by McEwen (2015), who concludes that most of these studies do not identify the pure effect of training, as it usually overlaps with other types of treatments, such as class size reductions or other institutional treatments. Also, these papers are based on small-scale studies. An example of an RCT study on the effect of teacher training in Science is Sloan (1993), which involved a sample of 173 students and whose results on the positive effects of the intervention were later discarded by Yoon et al. (2007) for not addressing clustering and multiple outcomes. 8          They find that the training program improved test scores by 0.2 to 0.4 standard deviations in secular schools, but they seemed to have no effect in religious schools (which were poorly organized). More closely related to this study, there is an emerging literature on teaching training based on experimental evidence. Bassi, Meghir, and Reynoso (2016) use a randomized controlled trial to estimate the effectiveness of guided instruction methods in underperforming schools in Chile. Teachers in treated schools received detailed classroom guides and scripted material to follow in their lectures (similar to the Sequence intervention in this study). They find that only the most advantaged students within treated schools (students from higher-income families within the lower-income population) benefit from the program, improving test scores by almost 0.2 of a standard deviation. Finally, Cilliers and Taylor (2017) conduct a randomized evaluation of two interventions in South Africa aimed at improving early grade Reading. As in this study, both interventions involved a short training session but one of the groups received additional coaching sessions. They find that only the intervention complemented with coaching had an impact in reading proficiency (about 0.25 standard deviations relative to the Control Group). The remainder of the paper is organized as follows: section 2 presents the research context, and section 3 explains the design of the experiment, describes the components of the intervention, and explains the data collection process. Section 4 presents the research sample. Section 5 presents descriptive statistics for the main variables. Section 6 discusses the identification strategy, and section 7 shows the main results of the paper. Finally, section 8 concludes and reflects on the implications in educational policy. 2. Research Context In Argentina, education from primary school through high school education is compulsory and free of charge. The country has one of the highest rates of literacy (98 percent) and school-life expectancy (16 years) in the world (World Bank 2014). Although attendance and completion at the secondary school level remains an issue, primary education is considered to be universal. 9          According to official statistics, Argentina has 11 million students enrolled in four education levels: preschool (ages 3–5, 15.6 percent); primary (ages 6–11, 41 percent); secondary (ages 12–17, 35.5 percent); and tertiary (ages 18–22, 7.9 percent). The majority of these students (71 percent) attend public schools (DiNIECE 2015).6 Between 2003 and 2013, student numbers increased by approximately 10 percent, while the number of teachers increased by more than 20 percent over roughly the same period (DiNIECE 2004, 2015). This allowed Argentina to reach a pupil-teacher ratio of 11, the lowest in Latin America after Cuba (OECD 2016), although it is worth noting that this ratio varies considerably across provinces. Although there have been large successes in terms of increasing coverage, the Argentinean education system fails to provide high-quality education (at least as measured by standardized test scores). While other countries in the region improved learning outcomes since 2000—measured by the OECD’s Program for International Student Assessment (PISA)—Argentina’s scores show no progress (at best) in Science, or even experienced a marginal decline between 2000 and 2012.7 According to the 20128 study, Argentina ranked amongst the lowest of the participating countries (59 out of a possible 65) (OECD 2014).9 CABA, being the wealthiest jurisdiction in Argentina, exhibits some specific features; namely, a lower share of students attending public schools (49 percent) and higher levels of student achievement (DiNIECE                                                              6 These figures do not include special and adult education. 7 According to De Hoyos, Holland, and Troiano (2015), there is a gradual increase in Argentina’s Science scores between 2006 and 2012, but it is not statistically significant at conventional levels (95 percent). 8 Argentina participated in PISA 2015, but its results were excluded from the main report due to problems with sample design. However, CABA participated as an adjudicated region and was included in the results. 9 In a similar vein, results of the UNESCO Second Regional Comparative and Explanatory Study (SERCE), applied to students of third and sixth graders, show that only 11.4 percent of sixth-grade students were able to explain everyday situations based on scientific evidence, use models to explain natural phenomena, or draw conclusions based on data (UNESCO 2009). 10          2015). Despite this, international assessments show that the level of achievement of CABA students in Science is still well below the OECD average (OECD 2016). Problems with teaching and learning Science in CABA were also highlighted by the last wave of the Trends in International Mathematics and Science Study (TIMSS), according to which CABA students (fourth and eighth graders) place at the bottom of the world ranking, just above Egypt and South Africa (Martin et al. 2016). These results are not surprising given the reality of Science education in Argentina, where lessons are mostly teacher centered and focused on the transmission of encyclopedic content, far from competency-based international learning standards, such as those assessed by TIMMS and PISA (Argentine Ministry of Education 2007). Countries with high levels of scientific literacy tend to implement inquiry-based approaches that position students as active knowledge producers in a classroom community of practice, placing importance on the development of specific Science skills and deep understandings (OECD 2016). 3. Experimental Design A randomized controlled experiment was carried out to assess the effect of different teacher professional development approaches on student learning in Science. The intervention focused on a compulsory unit of the seventh-grade Science national curriculum: the Human Body. The intervention consisted of a random allocation of 70 CABA state primary schools to one of three experimental groups (Appendix 1). Thus, the unit of randomization was the school.10 All participating teachers involved received one in-service four-hour training session, and were then asked to teach the Human Body Science unit according to national curriculum guidelines, over the following 12 weeks. During the training session, teachers discussed and took part in inquiry-based activities related to the teaching of the                                                              10 The randomization was at the school level and not at the classroom level because 46 percent of the schools in our sample shared the same Science teacher for at least one of their classrooms. Therefore, assigning classrooms to different treatments was operationally impossible. 11          Human Body topic.11 The session was designed and run by specialists in Science education at the School of Education, University of San Andrés (Argentina). Those teachers receiving only this training form the Control Group. In the first treatment group (Sequence Group), teachers received the same four-hour training session and a structured curriculum unit that outlined how to teach the Human Body using an inquiry-based approach.12 The structured curriculum unit focused not only on Human Body content, but also on the development of Science competencies, as defined by the ability to explain phenomena scientifically, evaluate and design scientific inquiry, and interpret data and evidence scientifically (OECD 2016). This document included experiential learning activities, which are a departure from more common and traditional teaching methods,13 along with questions, approaches, and worksheets for students. Science education specialists designed the structured curriculum unit in consultation with a group of seventh-grade teachers who were not part of the schools selected for the study. Teachers were expected to adapt and implement these activities in their classrooms over the following 12 weeks. In the second treatment group (Coaches Group), teachers also received the same four-hour training session and structured curriculum unit, but their training was complemented with weekly sessions with a pedagogical coach. The coaches met with teachers at their schools during planning periods of 60-minute sessions over 12 weeks with the aim to guide and support teachers on how to implement the structured curriculum unit, as well as to enhance teacher reflection on their practice. The pedagogical coaches were recruited by the School of Education, University of San Andrés. They were selected based on their knowledge                                                              11 Details on the specific activities carried out during this session are available upon request. 12 Given that inquiry-based pedagogies have been shown to promote Science competencies (Minner, Levy, and Century 2010), this approach was chosen for this unit with a particular emphasis on active learning. 13 Examples of these activities include investigating changes in heart rate, measuring lung capacity, dissecting organs, and evaluating historical experiments. The structured curriculum unit designed for this study is available upon request. 12          and prior experience in Science education, as well as their potential to create a positive working relationship with participating teachers. They all held at least a bachelor’s degree in Science and/or pedagogical certification (Table 2.1 of Appendix 2 reports their main demographic characteristics). In addition, coaches received regular training sessions every fortnight throughout the intervention (a total of eight three-hour meetings) and were given access to an extensive library of guiding documents and resources to support their work. Design of the assessment instrument A central part of the design of any experiment is to determine the outcome measure, which in this case is student achievement in Science. Together with university specialists in Science education, an assessment instrument (hereafter referred to as the “Science test”) was developed to measure learning and to potentially distinguish the gains from the different working modalities with teachers. First, following an in-depth analysis of the unit of the Human Body, the topics included in the Science test were determined. In addition to this, three levels of skills were outlined: (a) basic skills, which required students to recall scientific content (such as identifying organs) and read simple tables and graphs, (b) medium-order skills, which required students to explain scientific phenomena and develop conclusions based on simple experimental data, and (c) higher-order skills, which required students to describe how different body systems work together, identify researchable questions, design experiments to address a hypothesis, explain scientific phenomena, and draw conclusions based on more complex experimental data (Appendix 3 details the differences between these skills). The Science test was developed using the following procedure: (a) a pool of items for basic, medium-, and higher-order content areas was created following the structure of PISA and TIMMS Science questions, (b) experts reviewed the items, (c) the test was piloted in two seventh-grade classrooms at schools not 13          participating in the project, and performed think-aloud exercises with students to better understand their answers and make adjustments, and (d) a panel of experts reviewed the final assessment instrument. The outcome of this process was an 11-item Science test of approximately one hour of duration. It consisted of both multiple-choice and open-ended questions. This combination allowed evaluators to capture a wider range of student responses, including stronger evidence of critical thinking skills, than is typically associated with only multiple-choice tests (Stanger-Hall 2012). The test was administered at the end of the intervention at each school by external observers to guarantee the fidelity of its implementation under strict exam conditions. The test had sound psychometric properties. The scale reliability coefficient (Cronbach’s alpha) is 0.79 in the full sample data and 0.76 in the Control Group. The test also shows a statistically significant correlation of 0.37 across schools with the Language score in the local end-of-primary exam of seventh grade (FEPBA, for its Spanish acronym).14 The Science test questions were weighted according to difficulty, with higher-order questions scoring three points, medium-order questions scoring two points, and basic-skills questions scoring one point. Answers were classified as either “Correct,” for which they achieved full marks; “Partially correct,” for which they achieved half of the maximum marks for the given question; “Incorrect,” for which no marks were given; and “Omitted,” when no answer was given and for which no marks were given. Specialists corrected all tests using a common rubric, which was shared and discussed during a half-day training session. Answers that were challenging to classify were discussed and determined by multiple assessors. Experimental Data                                                              14 The FEPBA test was prepared and administered to seventh-grade students of both private and public schools in CABA by the Ministry of Education of CABA in 2014. 14          Data was collected on all students, teachers, and schools. After the intervention, a student survey was conducted to collect socio-demographic data in order to check whether the randomly created groups of schools were comparable.15 The Science test was administered after schools had completed their 12-week intervention, followed by a student questionnaire designed to measure fidelity of implementation, students’ perceptions on the teaching they had experienced, and their attitudes toward Science. These questions were later used to build an index that measured if learning was interesting and relevant as well as if teaching practices inspired curiosity (see Appendix 4 for a detailed description of the index).16 These questions were based on a validated instrument, the Tripod Survey for Students (MET Project 2012). Before and after the intervention took place, teachers provided further additional information. As a baseline, a survey gathered background characteristics of teachers and general information about their Science class. In the post-intervention survey, teachers responded a set of questions to assess the fidelity of implementation of the intervention, as well as perceived changes in class dynamics and teaching practices. Finally, administrative information was collected at the school level. This information included data on school and seventh-grade enrollment, number of classrooms and teachers, repetition rate, promotion rate, over-aged rate, location of the schools, and the Language score in the local end-of-primary exam of seventh grade (FEPBA). 4. The Sample                                                              15 The student survey was collected in the classroom and contained information on students’ characteristics, their family, and their socioeconomic background. 16 See, for example, the Measures of Effective Teaching (MET) project, where results show that student surveys produce more consistent results than classroom observations or achievement gain measures (Met Project 2012). 15          The sample consisted of 70 state primary schools from six (out of 21) school districts within CABA, giving a representative sample of state primary schools in the jurisdiction.17 These 70 schools involved about 3,000 students, grouped into 136 seventh-grade classes, and 99 Science teachers (table 1). Participating schools were randomized into a Control Group of 24 schools and two treatment groups, each of them composed of 23 schools. On average, schools in the experiment were comparable to the rest of primary state schools throughout CABA. Further information was also gathered regarding the average characteristics of the 70 participating schools with the characteristics of the non-participating primary state schools in CABA (see Table 5.1 of Appendix 5). As shown, there are no significant statistical differences between participating and non- participating schools in their size, seventh-grade size, and seventh-grade classrooms, as well as in students’ promotion rate, over-aged rate, and dropout rate per school.18 There is only one statistically significant difference, at 90 percent of confidence, in students’ repetition rate per school, which is slightly higher for the schools included in the experiment. However, there are no statistically different results in the FEPBA results. The distributions of study participants and non-study participants’ scores also share a substantial common support (see Figure 5.1 of Appendix 5). Finally, there are no statistically significant differences in the Social Vulnerability Index at the school district level, which ranks houses in each school district according to their degree of vulnerability in terms of material and non-material assets.19 Based on these results, we can                                                              17 Currently, there are 455 state primary schools in CABA. Thus, the share of schools included in our sample is 15 percent. 18 Over-aged students are those who are older than the normal age for a grade level, as defined by law. 19 The Social Vulnerability Index (SVI) is a weighted index, calculated by the Ministry of Education of CABA, which assigns a value to each household according to its characteristics with respect to the material and non-material assets. In this way, households are ranked according to their degree of vulnerability. Households that have the highest vulnerability assume the value of 1 in the index, while those that have the lowest vulnerability assume the value of 0 in the index. 16          confidently state that the participating schools constitute a representative sample of the CABA primary state schools. 5. Randomization and Descriptive Statistics Pre-treatment main sample means and standard deviations for the full sample and experimental groups show that half of the students in the research sample are female and on average they are approximately 12 years old (see table 2). The majority were born in Argentina (86 percent). In relation to their socioeconomic background, approximately 70 percent of the students have parents with secondary education. In addition to this, 90 percent of the students have access to internet in their homes; and 59 percent and 64 percent of them also own at least one air conditioner and one car in their homes, respectively. Finally, about 65 percent of students missed, at most, one class per month since the beginning of this investigation (see Appendix 6 for more details on other variables). In terms of participating teachers, 88 percent of them are female, with an average age of 42 years (see table 2). Nearly 45 percent of teachers have a postgraduate certificate and, on average, they have about 12 years of teaching experience and 6.5 years of experience teaching Science. Almost all of the teachers had participated in some form of teacher training in the past two years, and half of them had never used a structured curriculum unit. Schools enroll an average of 301 students and 42 students in seventh grade, which are, most commonly, divided into two classrooms (see table 2). Repetition rate in seventh grade is 3 percent on average, and school over-aged and promotion rate are on average 15 percent and 97 percent, respectively. 17          The FEPBA score in Language, which presents an average of 448,20 is also reported. We do not discuss the meaning of this score here, but use this variable with the sole intention of comparing student academic performance across our groups of CABA schools. With regard to whether the experimental groups were similar with each other and representative, the differences in the means along with p-values from two-tailed t-tests of equality of means across experimental groups are shown (table 3). As seen, the treatment and Control Groups do not differ significantly in any observable dimension. The only variable with a statistical difference at the 95 percent level of confidence is student age, where Control Group students are slightly younger than those in the Sequence. However, this difference is very small and vanishes when considering seventh-grade repetition rate, which is balanced across the three experimental groups (see Appendix 6 for further differences in the means of other variables). Only 14.5 percent, 15.3 percent, and 17 percent of students in the Control, Sequence, and Coaches Group, respectively, did not complete the Human Body test. These are relatively low non-response rates, and there is no statistically significant difference in the number of students who missed or omitted the test across the experimental groups (see table 4). Finally, only seven out of the 136 classrooms failed to teach the Human Body unit, which implies an attrition rate of 5 percent. Excluding classrooms not completing the Human Body unit carries no effect on the balance across the experimental groups (see Table 7.1 in Supplementary Appendix 7). 6. Identification Strategy                                                              20 It is important to note that there is no school-specific measure of Science knowledge of sixth- or seventh-grade students available in Argentina or CABA. Therefore, to the best of our knowledge, FEPBA score in Language is the best approximation, based on administrative data, which we can make. 18          Our goal is to understand how using a structured curriculum or receiving, as a complement, weekly coaching meetings can influence learning outcomes in a randomized controlled experimental setting. In this setting, the Control Group estimates what would have happened to the treated groups in the absence of the intervention. The validity of the Control Group is evaluated by examining the exogeneity of treatment status with respect to the potential outcomes, and by testing whether pre-intervention characteristics of the treatment and Control Groups are reasonably similar. As discussed in section 5, there is a strong similarity across the three experimental groups. The similarity across pre-treatment characteristics is consistent with the exogeneity in the allocation of schools in each treatment. When the treatment status is exogenous, estimating the average treatment effects is straightforward. The random assignment of schools to treatment/Control Groups allows us to identify the average treatment effect by simply comparing the means of each of the two treatment groups with respect to the Control Group. Operationally, we estimate by Ordinary Least Squares a set of models of the following form: (1)  where indexes students and indexes schools. is the outcome of interest (e.g., student performance in the Human Body test) of student in school . is a dummy variable indicating treatment status. We also include control variables (). Specifically, we control for students’ characteristics (gender, age, nationality, parent’s education, if the student missed at most one class per month, if the student has internet in his/her home), teacher characteristics (gender, age, years of experience in teaching, if she/he has postgraduate certificates), and school characteristics (school size, seventh grade size, seventh repetition rate, FEPBA score in Language, school district, or location). The parameter of interest is the average treatment effect (e.g., the average effect on student performance in the Human Body test of being in the treatment group versus the status quo). Finally, is the error term. 19          The specifications of the model stated in equation (1) take into account the potential correlation between students’ and teachers’ performance and behavior by clustering the standard errors at the school level (i.e., the unit of randomization). However, the standard error estimates are typically not sensitive to the level of clustering. For expositional reasons, we estimate equation (1) for each treatment separately. The results hold if we included both treatments in the same regression.21 7. Results Section 7.1 contains the main results regarding the effect of the treatments on student learning. As their effects may depend on how teachers respond to the intervention, section 7.2 explores the average treatment effect conditioned to teacher experiences. Finally, section 7.3 extends our analysis beyond the test results and explores how the different interventions affected students’ and teachers’ perceptions about learning and teaching Science. Student Learning Table 5 shows the mean and standard deviation of the standardized score in the Science test, which was calculated using the mean and standard deviation of the Control Group, the score according to different levels of skills (Basic, Medium-, and Higher-order skills), as well as the percentage of correct, incorrect, and omitted answers across experimental groups. This shows that the average score for the Sequence Group is 0.36 standard deviations higher than the average score for the Control Group, whereas the average score of the Coaches Group is 0.53 standard deviations higher than that of the Control Group. This implies that both treatments were more effective in promoting middle- and higher-order skill development in students than                                                              21 The results of this estimation are available upon request. 20          basic skills. In addition to this, the percentage of correct answers increased in both treatments, while the percentage of incorrect and omitted answers decreased. With regard to student learning, the dependent variable is the standardized score in the Science test, which was calculated using the mean and standard deviation of the Control Group (see table 6). This allows for interpreting the coefficient as the treatment effects in terms of the standard deviation. Columns 1 and 2 of panel A show the effect of the structured curriculum unit (Sequence Group) and the coaches (Coaches Group) in comparison with the Control Group. The estimated coefficients are all statistically significant and present a positive sign. The average treatment effect of the structured curriculum is an increase of 55 percent of a standard deviation in Science scores, and the average treatment effect of the coaches is an increase of 64 percent of a standard deviation in Science scores. Thus, students in the Sequence and Coaches Groups learned between 55 percent and 64 percent of a standard deviation more than those students in the Control Group. This is equivalent to an average increase in student achievement from the 50th to the 66th percentile in the case of the Sequence Group, whereas if they were treated in the Coaches Group, the improvement goes from the 50th to the 70th percentile approximately. These effects are considered to be rather large for interventions with similar characteristics (see, e.g., Matsumura et al. 2010; Allen et al. 2011; Campbell and Malkus 2011; Sailors and Price 2010, 2015; Bassi et al. 2016). Although the estimated coefficient of the Coaches Group (column 2) is higher than that of the Sequence Group (column 1), its marginal effect is not statistically significant. This is shown in column 3, which presents the result of the Wald test, which evaluates the difference in the coefficient between columns 1 and 2. This finding is relevant in terms of policy, as it would suggest that just the implementation of a structured curriculum is sufficient to improve short-term average results in learning outcomes in Science. Panel B of table 6 reports the same analysis but splitting the score according to different levels of skills (Basic, Medium-, and Higher-order skills). The findings are similar: although both treatments improve 21          learning, there is no significant difference in their effects. Interestingly, the average treatment effect of the Coaches Group increases as the content evaluated (or items) becomes more complex (see column 2). This implies that, in these cases, the Coach treatment was more effective in promoting higher-order skill development in students than either the four-hour training session or the provision of the structured curriculum unit. The mechanism through which both the Sequence and Coaches treatments appear to increase Science test scores involves an increase in the percentage of correct answers. Students in the Sequence and Coaches Groups exhibit 10 percent more correct answers than students in the Control Group (panel C). The treatments also reduce the number of omitted and incorrect answers. This suggests that the interventions did not only increase Science learning (which is shown in both the increase of correct answers and the reduction of incorrect ones), but also motivated students to answer more questions.22 These findings are especially important given the large difference in treatments’ costs. While the cost per student for the Control Group was 1.4 dollars, for the Sequence Group the cost per student was 4.6 dollars and for the Coaches Group it was 14.7 dollars.23 This includes the costs of hiring and training the tutors, teacher seminar materials, as well as curriculum unit design and printing. The estimates of table 6 allow us to calculate the cost-effectiveness of the program. Providing teachers with short-term training complemented with a structured curriculum costs (per student) 0.84 dollars per 0.1 standard deviations, whereas providing teachers with the same short-term training complemented with ongoing coaching through the use of the same structured curriculum costs (per student) 2.28 dollars per 0.1 standard deviations. In other words, it costs 0.84 dollars to move a child from the 50th to the 53th percentile, approximately, in the first intervention, and 2.28                                                              22 The results of estimating equation (1) without controls are consistent with those reported in table 6 and are available upon request. 23 Our calculations correspond to 2016 US dollars. 22          dollars in the second intervention. Therefore, providing teachers with a structured curriculum is 2.7 times more cost-effective for the total score than complementing it with ongoing coaching. Even though cost- effectiveness calculations might not be perfectly comparable across programs, in general terms, our calculations are in line with other interventions based on teacher training programs (see, e.g., Banerjee et al. 2007). In order to deepen our analysis, we investigated whether any particular group of students experienced more gains in test score results. Table 8.1 in appendix 8 displays separate estimates for students below (first panel, “low performance”) and above (second panel, “high performance”) the mean test score for each group. In general, both the use of the structured curriculum unit and the coaches seem to benefit higher-performing students more than their lower-performing peers: in both treatment groups, high-performing students obtained a significantly higher percentage of correct answers and lower percentage of incorrect and omitted ones. In particular, the gain of high-performing students in the Coaches Group is almost twice the gain of low- performing students in the same group, while the gain of high-performing students in the Sequence Group is 19 percent higher than the gain of low-performing students in the same group. However, although the Coaches Group has a slightly higher impact in increasing test scores than the Sequence Group for the high- performing students, there is still no statistically significant difference between these two treatments (see column 3). Another factor of interest is to analyze whether learning results differ according to student gender. Separate estimates of the average treatment effects for female and male students are reported in Table 8.2 in Appendix 8. As observed in column 1, there is no difference in the average treatment effect within the Sequence between girls and boys. However, column 2 shows that the average treatment effect of the coaches is higher for girls than boys. In particular, test scores for girls in this group are almost 33 percent higher than for boys, while the percentage of correct answers and omitted answers are nearly 25 percent higher and 39 23          percent lower for girls than boys, respectively. Nevertheless, there is no difference in the average treatment effect of the coaches in comparison with the Sequence for either males or females (see column 3). The Role of Teachers’ Experience An important message conveyed from our results is that there is no statistical difference between supporting teachers with a structured curriculum unit and providing them the same unit with a pedagogical coach. This suggests that the additional learning gain from coaching in Science is weak. We explore in this section whether this result is conditional on teaching experience. To do so, we estimate the following model using OLS: (2)  where, as in equation (1), indexes students and indexes schools. is the outcome of interest, that is, student scores on the Human Body test. is a dummy variable indicating treatment status. represents a set of control variables (students’ characteristics: gender, age, nationality, parent’s education, if the student missed at most one class per month, if he/she has internet in his/her home; teacher characteristics: gender, age, general teaching experience (in years), if she/he has postgraduate certificates; and school characteristics: school size, seventh grade size, seventh repetition rate, FEPBA score in Language, school district, or location). represents an interaction term between treatment status () and a dichotomous variable ( ) that equals 1 if the teacher has less than two years of experience in teaching Science (first quartile in our sample) and zero otherwise; we call this variable “low experience.” Now, our parameters of interest are (the average treatment effect) and (the marginal effect of teaching experience). Finally, is the error term. From this, the average treatment effect of the Coaches Group versus the Sequence Group is conditioned by teacher experience in Science (see table 7). Specifically, the average treatment effect of the coaches is an increase of 82 percent of a standard deviation in Science scores in comparison with the teaching sequence when considering the least-experienced teachers (column 3). This increase in test scores is considerable, and 24          implies, therefore, that coaches add value for the teachers who were relatively inexperienced in teaching Science. The specific effect of each treatment is not conditional on the level of experience when compared to the Control Group. Also, comparing the coefficients associated with the Coaches Group in tables 6 and 7, we observe that the effect of Coaches is lower when controlling for experience. The fact that the effect of Coaches is conditional on experience, whereas the effect of Sequence is not, is consistent with the finding that the effect of providing teachers with Coaches is more relevant for low-experience teachers. Furthermore, we explore if Science teaching experience affects the average treatment effect for the higher-order skills that we expect students to develop. This is particularly relevant, as higher-order skills are those that underlie the development of both complex reasoning and scientific competencies, and therefore could be more challenging for teachers to enhance in students. The results of estimating equation (2) on the test scores for higher-order skills confirm the finding that coaches add value in comparison with the teaching sequence for the least-experienced teachers, and this is true when we consider the higher-order skills, which often require more intensive teaching (see table 8). Effects on Perceptions Looking beyond student test performances, whether treatments have any effects on student and teacher perceptions of Science lessons was also analyzed. As student motivation and perceptions are associated with learning outcomes (Christophel 1990; MET Project 2012), this finding presents an important issue. The next subsections explore these issues from the perspectives of the students and teachers. Student Perceptions Students themselves are a primary source of information on the quality of teaching and the learning environment in individual classrooms (MET Project 2012). To explore whether our treatments affect student perceptions about learning Science, a “captivate” index was constructed to evaluate whether teaching 25          practices inspired curiosity and interest, and whether teachers were able to hold the students’ attention in class and provide the basis for continuing interest. The construction of this index is explained in Appendix 4. Relating the captivate index (which ranges from 0 - 6)24 to Science test scores shows that classrooms in which students rated their teachers higher on the captivate index tended also to produce greater average achievement gains (see fig. 1). The black line, which shows the statistically significant partial correlation (0.08) of the scores and the captivate index controlling for students’, teachers’, and schools’ characteristics, confirms this relation. The results of estimating equation (1) on the captivate index, as well as on each separate question (with a four-point scale) that conforms it, can be seen in columns 1 and 2 (see table 9), while column 3 shows the results of the Wald test, which evaluates the difference between the average treatment effect of the Sequence Group and the Coaches Group. The results suggest that the sequence treatment is an effective instrument for enhancing curiosity and interest among students. With regard to the gender perspective previously considered, we explore if girls experienced more gains in the captivate index than boys. Estimation of the equation (1) using the captivate index as the dependent variable for female and male students shows that female students from the Sequence Group were more interested than their male classmates in comparison with the Control Group (see Table 9.1 in Appendix 9). Indeed, the captivate index for female students is 35 percent of a standard deviation higher than that of the Control Group. In contrast, the Coaches treatment seems to reduce the captivate index for females in                                                              24 The value of 0 indicates that the student strongly disagrees in all the questions included in the index (see Appendix 4). This means that, for that student, teacher practices do not inspire curiosity and interest at all, or fail to keep his/her attention in class. In contrast, the value of 6 indicates that the student strongly agrees in all the questions that make up the index, suggesting that, for that student, teacher practices do inspire curiosity and interest or are successful in keeping his/her full attention in class. 26          comparison with the Sequence treatment, which does not happen when restricting the sample to male students. Teacher Perceptions Finally, this subsection explores the effect of the intervention on how teachers perceived their experience and the effects they observed in their students. For that, a four-point-scale25 was constructed to measure the extent to which teachers agreed with the following statements: A. I feel that the way I teach Science changed a lot; B. I liked or enjoyed more teaching Science than previous years; C. I feel that by implementing the ideas of this training my students learned more in comparison with other groups and/or subjects; D. I feel my students developed more skills than in previous years; and E. I taught more hours of Science classes. The results of estimating the effect of the treatments on these variables, controlling for teacher characteristics, suggest that both the Sequence and Coaches treatments favorably changed teacher perceptions on their practices and their expectations of student learning (see columns 1 and 2 of table 10). Compared to the Control Group, teachers in the Sequence Group (Coaches Group) present a scale 86 percent (95 percent) higher in their perception than their teaching practices meaningfully changed. In a similar vein, teachers in the Sequence Group (Coaches Group) present a scale 72 percent (97 percent) higher in their perception that they enjoyed more teaching Science in comparison with teachers of the Control Group. Furthermore, teachers in the Sequence Group (Coaches Group) present a scale between 63 percent and 67 percent (77 percent and 88 percent) higher in their perception that students learned more and developed more skills than teachers in the Control Group. Finally, according to column 3, which shows the results of a test that evaluates the difference between the coefficients in columns 1 and 2, teachers in the Coaches Group expressed that they taught more                                                              25 In the four-point-scale, 1 represents “strongly disagree,” 2 “disagree,” 3 “agree,” and 4 “strongly agree.” 27          hours of Science than teachers in the Control and Sequence Groups. This is important, as increasing class teaching hours is associated with improvements in learning outcomes (OECD 2016). All of these differences are statistically significant. Follow-Up Finally, this study explores the possibility of the longer-run impact of the different training interventions evaluated. As shown above, a structured sequence using an inquiry-based approach has a positive effect on how students learn a specific pedagogical content. Although learning of a particular topic is likely to have a limited aggregate impact over time for a typical student, what is clearly required for a potential long-lasting effect of teacher training is that the targeted teachers adopt new teaching practices consistently over time. Thus, the obvious follow-up question is whether teachers under our treatment groups continued using the sequence when teaching the same topic the following year. To answer this question, participant teachers in the Sequence and Couches Groups were contacted after the intervention to inquire about whether they continued using the sequence provided the prior year, even when this time their students were not going to be externally assessed. In fact, almost every “treated” seventh-grade Science teacher that remained in the same school continued using the sequence (100 percent in the Sequence Group and 89 percent in the Coaches Group) (see table 11). This is an encouraging finding, as it provides evidence suggesting a persistent change in teaching practices. However, between 67 percent and 73 percent of the participant teachers were reassigned to either other grades or other schools. That means that teaching turnover may dissipate part of the effect of the training. This is an issue that goes beyond the scope of our paper but increases its relevance when assessing the long- run impact of training initiatives. 28          8. Conclusion This study used a randomized controlled trial to assess the impact of different CPD approaches on student Science learning. Seventy participating schools were randomly assigned to one of three conditions: (a) short-term teacher training (Control Group), (b) short-term teacher training complemented with a structured curriculum unit (Sequence Group), and (c) short-term teacher training complemented with both a structured curriculum unit and weekly tutoring from pedagogical coaches (Coaches Group). The study included 2,965 students and 99 teachers in the seventh grade of public schools in CABA, Argentina. The experiment was internally valid and performed on a representative sample of schools of CABA. Providing teachers with a structured curriculum unit increased student performance by 55 percent of a standard deviation compared to a short-term training session. This finding is consistent with the literature that shows that developing high-quality structured curriculum units are valid instruments for assisting teachers and promoting more effective teaching and learning (Brown 2009). The structured curriculum unit also sparked interest and curiosity among students. Using an index that measures if learning was interesting and relevant as well as if teaching practices inspired curiosity, the results show that students in the Sequence Group presented a scale 20 percent of a standard deviation higher than those in the Control Group. This finding supports research that shows that motivation is an essential factor to generate and sustain student learning (Ercan, Ural, and Ates 2016). Also, students in the Coaches Group learned significantly more than those in the Control Group. Specifically, students whose teachers had a coach learned about 64 percent of a standard deviation more than those in the Control Group. However, there is no general additional benefit in terms of student learning between the structured curriculum unit by itself and the unit with the addition of coaching. Nevertheless, the marginal effect of coaches is statistically significant for relatively inexperienced teachers in Science education. Specifically, students in the Coaches Group learned 82 percent of a standard deviation more than 29          students in the Sequence Group when considering the least-experienced teachers. This is particularly true when focusing on higher-order skills, which may require more targeted and specific teaching. This implies that coaches should be utilized for teachers who have little prior experience in teaching Science and focus their support on getting teachers to master the teaching of cognitively demanding activities, rather than simply implement basic active-learning strategies (which teachers seem to be able to pick up alone by just by working with a structured curriculum unit). Additionally, the average treatment effect of the coaches is higher for girls than boys. This is an encouraging finding, as not only girls´ participation and achievement in school Science (Liben and Coyle 2014) but also the choice of careers related to Science shows a noticeable gender gap (Beede et al. 2011). This result may indicate that coaches help teachers create more gender-inclusive Science lessons, supporting them in strategies that better cater to active participation of girls. In this sense, one important value of working with coaches would be promoting more female engagement in Science. Both the structured curriculum unit and the pedagogical coaching improve teacher satisfaction with their practice and their perceptions of student learning. Compared to the Control Group, teachers in the Sequence Group and Coaches Group present a scale between 63 percent and 100 percent higher in their perception that their teaching practices meaningfully changed and that students learned more and developed more skills, that they enjoyed teaching Science more, and that they taught more hours of Science. This last finding is particularly interesting, as a study analyzing the effective class time given to science showed that CABA teachers do not tend to meet the required hours of Science (many times teaching less than half the jurisdictional requirement of hours, as suggested by Furman et al. [2018]). As more hours of teaching are associated with higher learning outcomes, and in particular improvements in higher-order skills (OECD 2016), increasing Science teaching hours is a valuable outcome. In this case, our findings support and further findings that show that teachers perceive tension between Science and other curricular areas, but introducing 30          coaches increases the time and confidence teachers feel with regard to teaching Science (Berg and Mensah 2014). The first policy recommendation that emerges from our study is that a short-term teacher training complemented with a structured curriculum should be considered a cost-effective CPD intervention to increase student learning in Science. Specifically, complementing training sessions with a structured curriculum unit costs (per student) 0.84 dollars per 0.1 standard deviations, implying that it costs 0.84 dollars to move a child from the 50th to the 53th percentile approximately. The second policy recommendation is that providing additional coaching does improve student scores, but only for relatively inexperienced teachers in Science. This finding suggests that experienced teachers already have the pedagogical tool kit that enables them to confidently implement the lessons outlined in the curriculum unit, at least up to a basic level. For less experienced teachers, coaching can bridge the gap between structured lesson plans and the complex world of the actual Science classroom. This suggests that improving teachers’ practice in Science is not a matter of choosing the best (“one size fits all”) CPD strategy, but selecting the strategy that best suits the specific population of teachers and student learning goals being targeted. These are relevant contributions for public policies focused on CPD interventions, as hiring, training, and providing coaches is an expensive and human-resource-intensive approach. Our study shows that providing teachers with a structured curriculum unit is 2.7 times more cost- effective than complementing the unit with ongoing coaching. In all, our study speaks to the need to tailor CPD interventions in order to maximize their effects based on evidence of what works better and taking into account the cost-effectiveness of each strategy. We believe that evidence of this nature is urgent and necessary for the development of effective public policies aimed at promoting students’ scientific literacy and thus effective participation in the global knowledge economy. 31          References Abeberese, A. B., T. J. Kumler, and L. L. Linden. 2014. “Improving Reading Skills by Encouraging Children to Read in School: A Randomized Evaluation of the Sa Aklat Sisikat Reading Program in the Philippines.” Journal of Human Resources 49 (3): 611–33. Allen, J. P., R. C. Pianta, A. Gregory, A. Y. Mikami, and J. Lun. 2011. “An Interaction-Based Approach to Enhancing Secondary School Instruction and Student Achievement.” Science 333: 1034–37. Angrist, J., and V. Lavy. 2001. “Does Teacher Training Effect Pupil Learning? Evidence from Matched Comparisons in Jerusalem Public Schools.” Journal of Labor Economics 19 (2): 343–69. Angrist, J., and V. Lavy. 1999. “Using Maimonides' Rule to Estimate the Effect of Class Size on Scholastic Achievement.” Quarterly Journal of Economics 114 (2): 533–75. Angrist, J., E. Bettinger, E. Bloom, E. King, and M. Kremer. 2002. “Vouchers for Private Schooling in Colombia: Evidence from a Randomized Natural Experiment.” American Economic Review 92 (5): 1535–58. Arancibia, V., A. Popova, and D. K. Evans. 2016. “Training Teachers on the Job: What Works and How to Measure It.” Policy Research Working Paper No. 7834, World Bank, Washington, DC. Argentine Ministry of Education. 2007. Mejorar la Enseñanza de las Ciencias y las Matemáticas—Una Prioridad Nacional. Buenos Aires: Argentine Ministry of Education. http://repositorio.educacion.gov.ar/dspace/handle/123456789/95085. Argentine Ministry of Education and Sports. 2017. Aprender 2016. Primer Informe de Resultados. Buenos Aires: Argentine Ministry of Education and Sports. Argentine Ministry of Education. 2015. Presentación Nuestra Escuela—Programa Nacional de Formación Docente. Buenos Aires: Federal Council for Education. http://nuestraescuela.educacion.gov.ar/pdf/presentacionnuestraescuela.pdf. Argentine National Institute of Teacher Training. 2016. Plan Nacional de Formación Docente 2016–2021. Buenos Aires: Argentine National Institute of Teacher Training. http://cedoc.infd.edu.ar/upload/Plan_Nacional_de_Formacion_Docente1.pdf. Arias, A. M., E. A. Davis, J. C. Marino, S. M. Kademian, and A. S. Palincsar. 2016. “Teachers’ Use of Educative Curriculum Materials to Engage Students in Science Practices.” International Journal of Science Education 38 (9): 1504–26. Banerjee, A., S. Cole, E. Duflo, and L. Linden. 2007. “Remedying Education: Evidence from Two Randomized Experiments in India.” Quarterly Journal of Economics 122(3): 1235–64. 32          Barrera-Osorio, F., and L. L. Linden. 2009. The Use and Misuse of Computers in Education: Evidence from a Randomized Controlled Trial of a Language Arts Program. Cambridge, MA: Abdul Latif Jameel Poverty Action Lab (JPAL) 20. http://www.leighlinden.com/Barrera-Linden%20Computadores_2009-03-25.pdf Bassi, M., C. Meghir, and A. Reynoso. 2016. “Education Quality and Teaching Practices.” NBER Working Paper No. w22719, National Bureau of Economic Research, Cambridge, MA. http://www.nber.org/papers/w22719 Beede, D. N., T. A. Julian, D. Langdon, G. McKittrick, B. Khan, and M. E. Doms. 2011. “Women in STEM: A Gender Gap to Innovation.” Economics and Statistics Administration, Issue Brief No. 04-11 Washington, DC: U.S. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1964782. Berg, A., and F. M. Mensah. 2014. "De-marginalizing science in the elementary classroom by coaching teachers to address perceived dilemmas". Education Policy Analysis Archives, 22 (57): 1-57 Berlinski, S., and M. Busso. 2017. "Challenges in Educational Reform: An Experiment on Active Learning in Mathematics". Economics Letters, 156 (C): 172-175. MET Project. 2012. "Asking Students about Teaching: Student Perception Surveys and Their Implementation. Policy and Practice Brief". Bill and Melinda Gates Foundation. http://www.metproject.org/downloads/Asking_Students_Practitioner_Brief.pdf. Brown, M. W. 2009. “The Teacher-Tool Relationship: Theorizing the Design and Use of Curriculum Materials.” In Mathematics Teachers at Work: Connecting Curriculum Materials and Classroom Instruction, edited by J. T. Remillard, B. A. Herbel-Eisenmann, and G. M. Lloyd. Pp. 37-56. New York: Routledge. Bruns, B., and J. Luque. 2014. Profesores Excelentes. Docentes Excelentes: Cómo Mejorar el Aprendizaje en América Latina y el Caribe. Washington, DC: Banco Mundial. http://www.bancomundial.org/content/dam/Worldbank/Highlights%20&%20Features/lac/LC5/Spanish- excellent-teachers-report.pdf. Cilliers, J., and S. Taylor. 2017. “Monitoring Teachers and Changing Teaching Practice: Evidence from a Field Experiment.” McCourt School of Public Policy, Georgetown University. Department of Basic Education, South Africa. Working Paper. https://sites.tufts.edu/neudc2017/files/2017/10/paper_110.pdf. Christophel, D. M. 1990. “The Relationships Among Teacher Immediacy Behaviors, Student Motivation, and Learning.” Communication Education 39 (4): 323–40. Cristia, J. P., P. Ibarrarán, S. Cueto, A. Santiago, and E. Severín. 2012. “Technology and Child Development: Evidence from the One Laptop per Child Program.” Working Paper No. IDB-WP-304, InterAmerican Development Bank, Washington, DC. Darling-Hammond, L., R. C. Wei, A. Andree, N. Richardson, and S. Orphanos. 2009. Professional Learning in the Learning Profession. Washington, DC: National Staff Development Council. https://learningforward.org/docs/default-source/pdf/nsdcstudy2009.pdf 33          Das, J., S. Dercon, J. Habyarimana, P. Krishnan, K. Muralidharan, and V. Sundararaman. 2013. “When Can School Inputs Improve Test Scores?” American Economic Journal: Applied Economics 5 (2): 29–57. Davis, E. A., F. J. Janssen, and J. H. Van Driel. 2016. “Teachers and Science Curriculum Materials: Where We Are and Where We Need to Go.” Studies in Science Education 52 (2): 127–60. De Hoyos, N. R., A. J. Ganimian, and P. Holland. 2017. “Teaching with the Test: Experimental Evidence on Diagnostic Feedback and Capacity Building for Public Schools in Argentina.” Policy Research Working Paper No. 8261, World Bank, Washington, DC. De Hoyos, R., P. A. Holland, and S. Troiano. 2015. “Understanding the Trends in Learning Outcomes in Argentina, 2000 to 2012.” Policy Research Working Paper No. 7518, World Bank, Washington, DC. De Philippis, M. 2016. “STEM Graduates and Secondary School Curriculum: Does Early Exposure to Science Matter?” CEP Discussion Paper No 1443, Center for Economic Performance, London School of Economics and Political Science. DiNIECE. 2004. Programme for International Student Assessment. Informe Nacional República Argentina. Buenos Aires: DiNIECE. http://repositorio.educacion.gov.ar/dspace/handle/123456789/55289. DiNIECE. 2015. Anuario Estadístico Educativo. Relevamientos anuales 2007–2014. Buenos Aires: DiNIECE. http://portales.educacion.gov.ar/diniece/. Duflo, E., R. Hanna, and S. Ryan. 2012. “Incentives Work: Getting Teachers to Come to School.” American Economic Review 102 (4): 1241–78. Duflo, E., P. Dupas, and M. Kremer. 2011. “Peer Effects, Teacher Incentives, and the Impact of Tracking: Evidence from a Randomized Evaluation in Kenya.” American Economic Review 101 (5): 1739–74. Educ.ar. 2005. Anuario-En cifras. Buenos Aires: Educ.ar http://portal.educ.ar/acercade/anuarios/2006/cifras.html. Ercan, O., E. Ural, and D. Ateş. 2016. “The Effect of Educational Software Based on Ausubel’s Expository Learning on Students' Academic Achievement, Science and Computer Attitudes: ‘Human and Environment’ Unit Example.” British Journal of Education, Society & Behavioural Science 14 (1): 1–10. Forbes, C. T., and E. A. Davis. 2010. “Curriculum Design for Inquiry: Preservice Elementary Teachers' Mobilization and Adaptation of Science Curriculum Materials.” Journal of Research in Science Teaching 47 (7): 820–39. Fredriksson, P., B. Ockert, and H. Oosterbeek. 2012. “Long Term Effects of Class Size.” Quarterly Journal of Economics 128 (1) 249–85. Furman, M., Luzuriaga, M., Taylor, I., Anauati, M. V., & Podestá M. E. 2018. « Abriendo la «caja negra» del aula de ciencias: un estudio sobre la relación entre las prácticas de enseñanza sobre el cuerpo humano y las capacidades de pensamiento que se promueven en los alumnos de séptimo grado. » Enseñanza de las ciencias, 36(2), 81-103 [Opening the “Black Box” of the Science Classroom: A Study on the Relationship 34          between Teaching Practices and Thinking Skills that are Promoted in Students]. Enseñanza de las Ciencias. 36 (2): 81 - 103 Garet, M. S., A. J. Wayne, F. Stancavage, J. Taylor, M. Eaton, K. Walters, M. Song, S. Brown, S. Hurlburt, P. Zhu, S. Sepanik, and F. Doolittle. 2011. “Middle School Mathematics Professional Development Impact Study: Findings after the Second Year of Implementation.” NCEE 2011-4024, National Center for Education Evaluation and Regional Assistance. Washington, DC. Glewwe, P., M. Kremer, and S. Moulin. 2009. “Many Children Left Behind? Textbooks and Test Scores in Kenya.” American Economic Journal: Applied Economics 1 (1): 112–35. Glewwe, P., M. Kremer, S. Moulin, and E. Zitzewitz. 2004. “Retrospective vs. Prospective Analyses of School Inputs: The Case of Flip Charts in Kenya.” Journal of Development Economics 74 (1): 251–68. Glewwe, P., N. Ilias, and M. Kremer. 2010. “Teacher Incentives.” American Economic Journal: Applied Economics 2 (3): 205–27. Gulamhussein, A. 2013. "Teaching the teachers: Effective professional development in an era of high stakes accountability". Alexandria, VA: Center for Public Education, National School Board Association 1: 1-47 Harris, C. J., W. R. Penuel, A. DeBarger, C. D’Angelo, and L. P. Gallagher. 2014. Curriculum Materials Make a Difference for Next Generation Science Learning: Results from Year 1 of a Randomized Controlled Trial. Menlo Park, CA: SRI International. https://www.sri.com/sites/default/files/publications/pbis-efficacy- study-y1-outcomes-report-2014_0.pdf. He, F., L. Linden, and M. Margaret. 2009. A Better Way to Teach Children to Read? Evidence from a Randomized Control Trial. New York: Columbia University. He, F., L. L. Linden, and M. MacLeod. 2008. How to Teach English in India: Testing the Relative Productivity of Instruction Methods with Pratham English Language Education Program. New York: Columbia University. Jacob, B. A., and L. Lefgren. 2004a. “Remedial Education and Student Achievement: A Regression Discontinuity Approach.” Review of Economics and Statistics 86 (1): 226–44. Jacob, B., and L. Lefgren. 2004b. “The Impact of Teacher Training on Student Achievement: Quasi- Experimental Evidence from School Reform Efforts in Chicago.” Journal of Human Resources 39 (1): 50–79. Kraft, M. A., and D. Blazar. 2016. “Individualized Coaching to Improve Teacher Practice Across Grades and Subjects: New Experimental Evidence.” Educational Policy (advance online publication). http://journals.sagepub.com/doi/abs/10.1177/0895904816631099. Kraft, M. A., D. Blazar, and D. Hogan. 2016. “The Effect of Teacher Coaching on Instruction and Achievement: A Meta-Analysis of the Causal Evidence.” Working Paper, Brown University. 35          Kretlow, A. G., and C. C. Bartholomew. 2010. “Using Coaching to Improve the Fidelity of Evidence-Based Practices: A Review of Studies.” Teacher Education and Special Education: The Journal of the Teacher Education Division of the Council for Exceptional Children 33 (4): 279–99. Krueger, A., and D. Whitmore. 2002. “Would Smaller Classes Help Close the Black-White Achievement Gap?” In Bridging the Achievement Gap, edited by J. E. Chubb and T. Loveless. Pp 1-35. Washington, DC: Brookings Institution Press. Liben, L. S., and E. F. Coyle. 2014. “Chapter Three—Developmental Interventions to Address the STEM Gender Gap: Exploring Intended and Unintended Consequences.” Advances in Child Development and Behavior 47: 77–115. Linden, L. L. 2008. "Complement or Substitute? The Effect of Technology on Student Achievement in India". New York: Columbia University. Martin, M. O., I. V. S. Mullis, P. Foy, and M. Hooper. 2016. TIMSS 201 International Results in Science. Boston College: TIMSS & PIRLS International Study Center. http://timssandpirls.bc.edu/timss2015/international-results/. Matsumura, L. C., H. E. Garnier, R. Correnti, B. Junker, and D. DiPrima Bickel. 2010. “Investigating the Effectiveness of a Comprehensive Literacy Coaching Program in Schools with High Teacher Mobility.” Elementary School Journal 111 (1): 35–62. McEwan, P. J. 2015. “Improving Learning in Primary Schools of Developing Countries: A Meta-Analysis of Randomized Experiments.” Review of Educational Research 85 (3): 353–94. Minner, D. D., A. J. Levy, and J. Century. 2010. “Inquiry-Based Science Instruction—What Is It and Does It Matter? Results from a Research Synthesis Years 1984 to 2002.” Journal of Research in Science Teaching 47 (4): 474–96. Mo, D., L. Zhang, R. Luo, Q. Qu, W. Huang, J. Wang, Y. Qiao, M. Boswell, and S. Rozelle. 2014. “Integrating Computer-Assisted Learning into a Regular Curriculum: Evidence from a Randomised Experiment in Rural Schools in Shaanxi.” Journal of Development Effectiveness 6 (3): 300–323. Muralidharan, K., A. Singh, and A. J. Ganimian. 2016. “Disrupting Education? Experimental Evidence on Technology-Aided Instruction in India.” NBER Working Paper No. w22923, National Bureau of Economic Research, Cambridge, MA. Näslund-Hadley, E., and R. Bando (Eds.). 2016. Todos los niños cuentan: enseñanza temprana de las matemáticas y ciencias en América Latina y el Caribe. Washington, DC: Banco Interamericano de Desarrollo. http://doi.org/http://dx.doi.org/10.18235/0000226#sthash.7MqRqw2c.dpuf. Novak, J. D. 2005. “Results and Implications of a 12-Year Longitudinal Study of Science Concept Learning.” Research in Science Education 35 (1): 23–40. 36          OECD. 2014. PISA 2012 Results: What Students Know and Can Do (Volume I, Revised edition). Student Performance in Mathematics, Reading and Science. Paris: OECD Publishing. OECD. 2016. PISA 2015 Results (Volume 1): Excellence and Equity in Education. Paris: OECD Publishing. Organización de las Naciones Unidas para la Educación la Ciencia y la Cultura (UNESCO). 2009. Aportes para la Enseñanza de las Ciencias Naturales: Segundo estudio Regional Comparativo y Explicativo (SERCE). Santiago de Chile: OREALC/UNESCO. Organización de las Naciones Unidas para la Educación la Ciencia y la Cultura (UNESO). 2016. Informe de resultados TERCE. Logros De Aprendizaje. Santiago de Chile: OREALC/UNESCO. http://unesdoc.unesco.org/images/0024/002435/243532S.pdf. UNESCO 2015. Sailors, M., and L. Price. 2010. “Professional Development That Supports the Teaching of Cognitive Reading Strategy Instruction.” Elementary School Journal 110: 301e323. Sailors, M., and L. Price. 2015. “Support for the Improvement of Practices Through Intensive Coaching (SIPIC): A Model of Coaching for Improving Reading Instruction and Reading Achievement.” Teaching and Teacher Education 45: 115–27. Serra, J.C. 2001. “La política de Capacitación Docente en Argentina: La Red Federal de Formación Docente Continua (1994–1999). Buenos Aires: Argentine Ministry of Education http://repositorio.educacion.gov.ar/dspace/bitstream/handle/123456789/96628/EL000689.pdf Sloan, H. A. 1993. “Direct Instruction in Fourth and Fifth Grade Classrooms.” Dissertation Abstracts International 54 (8): 2837A. Stanger-Hall, K. F. 2012. “Multiple-Choice Exams: An Obstacle for Higher-Level Thinking in Introductory Science Classes.” CBE-Life Sciences Education 11 (3): 294–306. Urquiola, M. 2006. “Identifying Class Size Effects in Developing Countries: Evidence from Rural Bolivia.” Review of Economics and Statistics 88 (1): 171–77. Vegas, E., A. Ganimian, and M. S. Bos. 2014. América Latina en PISA 2012: ¿Cuántos Estudiantes Tienen Bajo Desempeño? Washington, DC: InterAmerican Development Bank (IADB). https://publications.iadb.org/handle/11319/701. World Bank. 2014. Argentina—Country Partnership Strategy for the Period of FY2015-18. Washington, DC: World Bank Group. http://documents.worldbank.org/curated/en/846861468210572315/Argentina-Country- partnership-strategy-for-the-period-of-FY2015-18. Yoon, K. S., T. Duncan, S. W. Y. Lee, B. Scarloss, and K. L. Shapley. 2007. “Reviewing the Evidence on How Teacher Professional Development Affects Student Achievement.” Issues & Answers Report, REL 2007-No. 033, US Department of Education, Institute of Education Sciences, National Center for Education Evaluation and Regional Assistance, Regional Educational Laboratory Southwest, Washington, DC. 37          38          Figure 1. Science Scores and Captivate Index Source: Author's analysis based on experimental data generated for this study. Note: Controls include: (i) students characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his home), (ii) teacher characteristics (gender, age, years of experience, if she/he has post-graduate certificates), and (iii) school characteristics (school size, 7th size, 7th repetition rate, FEPBA score in Language, school district –or location). Both the captivate index and the Science cores are standardized in terms of the Control Group. The captivate index is combined scale, whose construction is described in Appendix 4. 39          Table 1. Background characteristics of the sample All Control Sequence Coaches sample Group Group Group Number of schools 70 24 23 23 Average students per school 301 316 289 297 Number of class divisions in 7th grade 136 50 44 42 Number of students in 7th grade 2965 1086 917 962 Number of Science teachers in 7th grade 99 36 32 31 Source: Author's analysis based on survey experimental data generated for this study Note: Numbers refer to the schools, class divisions and students involved in the study. 40          Table 2. Pre-treatment characteristics All simple Control Group Sequence Group Coaches Group N Mean Sd Mean Sd Mean Sd Mean Sd Student-level variables Percent female 2359 0.49 0.50 0.51 0.50 0.48 0.50 0.49 0.50 Age 2346 12.19 0.52 12.17 0.49 12.22 0.55 12.18 0.52 Percent of Argentines 2341 0.86 0.34 0.86 0.35 0.87 0.34 0.86 0.35 Mother or father education (secondary) 1858 0.71 0.46 0.72 0.45 0.71 0.45 0.69 0.46 Have internet in their home 2279 0.90 0.31 0.91 0.29 0.89 0.32 0.89 0.31 Have air conditioning in their home 2130 0.59 0.49 0.60 0.49 0.60 0.49 0.58 0.49 Have at least one car in their home 2162 0.64 0.48 0.64 0.48 0.66 0.48 0.62 0.49 At most, missed one class per month 2288 0.65 0.48 0.66 0.48 0.64 0.48 0.64 0.48 Teacher-level variables Percent female 91 0.88 0.33 0.85 0.36 0.89 0.32 0.90 0.31 Age 90 41.52 8.75 39.59 8.69 42.64 9.46 42.75 7.94 Percent with Post-Graduate Certificate 91 0.43 0.50 10.42 6.67 12.68 6.49 12.26 8.36 Percent with University degree 91 0.10 0.30 5.81 5.58 7.04 5.62 6.75 7.79 Seniority in teaching (in years) 91 11.70 7.19 3.52 3.28 4.09 3.49 3.23 4.29 Seniority in teaching Science (in years) 88 6.48 6.33 0.35 0.49 0.50 0.51 0.45 0.51 Percent of teachers that participated in trainings 91 0.90 0.30 0.91 0.29 0.96 0.19 0.83 0.38 Percent of teachers that used a teaching sequence 91 0.55 0.50 0.62 0.49 0.46 0.51 0.55 0.51 School-level variables Students per school 70 301.20 132.10 316.50 139.60 289.60 127.70 296.80 132.80 Students of 7th grade 70 42.36 19.42 45.25 22.16 39.87 16.85 41.83 19.21 School promotion rate (%) 70 0.97 0.02 0.98 0.03 0.97 0.02 0.97 0.02 School drop-out rate (%) 70 0.00 0.01 0.00 0.00 0.00 0.01 0.00 0.00 School over-aged rate (%) 70 0.15 0.07 0.15 0.08 0.16 0.08 0.14 0.06 40          FEPBA score in Language 70 488.16 18.05 487.03 18.17 487.86 20.02 489.66 16.49 Source: Author's analysis based on survey experimental data generated for this study. Note: N means number of observations in the full sample, and SD means standard deviation. Note: N means number of observation in the full sample and Sd means standard deviation. 41          Table 3: Balance across Treatments Sequence vs. Coaches vs. Coaches vs. Control Control Sequence Student-level variables Percent female -0.03 -0.02 0.01 Age 0.06** 0.01 -0.04 Percent of Argentines 0.01 0.00 -0.01 Mother or father education -0.01 -0.04 -0.03 (secondary) Have internet in their home -0.03 -0.02 0.00 Have air conditioning in their home 0.00 -0.02 -0.02 Have at least one car in their home 0.02 -0.02 -0.04 At most, missed one class per -0.02 -0.02 -0.01 month Teacher-level variables Percent female 0.04 0.04 0.00 Age 3.06 3.16 0.11 Percent with Post-Graduate 0.15 0.10 -0.05 Certificate Percent with University degree 0.13 0.11 0.10 Seniority in teaching (in years) 2.26 1.84 -0.42 Seniority in teaching Science (in 1.23 0.94 -0.29 years) Percent of teachers that participated 0.05 0.08 -0.14* in trainings Percent of teachers that used a 0.15 0.07 0.09 teaching sequence School-level variables Students per school -26.98 -19.72 7.26 Students of 7th grade -5.38 -3.42 1.96 School promotion rate (%) -0.01 0.00 0.01 School drop-out rate (%) 0.00 0.00 0.00 School over-aged rate (%) 0.01 0.00 -0.01 7th student’s repetition rate (%) 0.02 0.00 -0.02 FEPBA score in Language 0.83 2.63 1.80 Source: Author's analysis based on experimental data generated for this study. Note: Each entry indicates the mean difference between the two experimental groups in the column for the corresponding variable in each line. * indicates that the difference of means test is significant at 10 percent; ** significant at 5 percent; *** significant at 1 percent. 1          Table 4. Differences in non-Response Rates Sequence Coaches Coaches vs. vs. vs Control Control Sequence (1) (2) (3) Missing (or omitting) student test 0.008 0.025 0.017 Source: Author's analysis based on experimental data generated for this study. Note: Each entry indicates the mean difference between the two experimental groups in the column for the students who missed or did not complete the student test. * indicates that the difference of means test is significant at 10 percent; ** significant at 5 percent; *** significant at 1 percent. Table 5. Mean and Standard deviation of learning outcomes Control Group Sequence Group Coaches Group Variable N Mean Sd N Mean Sd N Mean Sd Science score 790 0 1 771 0.506 1.119 801 0.681 1.156 Science score (Basic skills) 790 0 1 771 0.185 0.998 801 0.292 0.966 Science score (Medium skills) 790 0 1 771 0.554 1.088 801 0.583 1.102 Science score (Higher-order 790 0 1 771 0.386 1.134 801 0.654 1.195 skills) Percent of correct answers 790 0.326 0.202 771 0.409 0.24 801 0.436 0.243 Percent of incorrect answers 790 0.217 0.16 771 0.166 0.158 801 0.161 0.153 Percent of omitted answers 790 0.203 0.203 771 0.145 0.171 801 0.113 0.166 Note: N means number of observation and Sd means standard deviation. Source: Authors analysis based on experimental data generated for this study.   Table 6. Results on Science learning 2          Sequence Coaches vs. Wald Test Dependent variable vs. Control Control (3) (1) (2) Panel A Science score 0.550*** 0.647*** 0.45 (0.125) (0.137) [0.502] Observations 1,105 1,110 Panel B Science score (Basic skills) 0.276*** 0.282*** 0.01 (0.081) (0.081) [0.951] Observations 1,105 1,110 Science score (Medium skills) 0.571*** 0.534*** 0.07 (0.130) (0.130) [0.791] Observations 1,105 1,110 Science score (Higher-order skills) 0.416*** 0.638*** 2.39 (0.124) (0.139) [0.122] Observations 1,105 1,110 Panel C Percent of correct answers 0.103*** 0.110*** 0.06 (0.023) (0.026) [0.800] Observations 1,105 1,110 Percent of incorrect answers -0.057*** -0.050*** 0.17 (0.014) (0.015) [0.680] Observations 1,105 1,110 Percent of omitted answers -0.053** -0.079*** 1.62 (0.020) (0.020) [0.203] Observations 1,105 1,110 Note: ** significant at 5 percent; *** significant at 1 percent. Robust standard errors in parentheses clustered at school level. The p- values are in brackets in the third column. Controls: (a) students’ characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his/her home), (b) teacher characteristics (gender, age, years of experience, if she/he has postgraduate certificates), and (c) school characteristics (school size, seventh grade size, seventh repetition rate, FEPBA score in Language, school district, or location). All the regressions exclude the classrooms where the Human Body unit was not taught. Source: Authors analysis based on experimental data generated for this study. Table 7. Results on Science learning according to teacher experience Independent variable (1) (2) (3) 3          Sequence vs. Control 0.574*** (0.131) Low experience -0.301 -0.113 -0.752*** (0.208) (0.256) (0.199) (Sequence vs. Control)*Low experience -0.333 (0.282) Coaches vs. Control 0.570*** (0.147) (Coaches vs. Control)*Low experience 0.288 (0.335) Coaches vs. Sequence -0.175 (0.141) (Coaches vs. Sequence)*Low experience 0.824** (0.333) Observations 1,042 1,072 1,102 Source: Author's analysis based on experimental data generated for this study. Note: ** significant at 5 percent; *** significant at 1 percent. Robust standard errors in parentheses clustered at school level. Controls: (a) students’ characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his/her home), (b) teacher characteristics (gender, age, general teaching experience [in years], if she/he has postgraduate certificates), and (c) school characteristics (school size, seventh size, seventh repetition rate, FEPBA score in Language, school district, or location). Low experience represents a dummy variable equal to 1 if the teacher has less than two years of experience in teaching Science and zero otherwise. An interaction term between treatment status and the dichotomous variable of low experience is included. All the regressions exclude the classrooms where the Human Body unit was not taught. 4          Table 8. Results on Science test-scores in higher-order items according to teacher experience Independent variable (1) (2) (3) Sequence vs. Control 0.434*** (0.130) Low experience -0.240 -0.034 -0.618*** (0.188) (0.223) (0.155) (Sequence vs. Control)* Low experience -0.253 (0.253) Coaches vs. Control 0.537*** (0.167) (Coaches vs. Control)* Low experience 0.317 (0.299) Coaches vs. Sequence -0.051 (0.142) (Coaches vs. Sequence)* Low experience 0.752** (0.301) Observations 1042 1072 1102 Source: Author's analysis based on experimental data generated for this study. Note: ** significant at 5 percent; *** significant at 1 percent. Robust standard errors in parentheses clustered at school level. Controls: (a) students’ characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his/her home), (ii) teacher characteristics (gender, age, general teaching experience [in years], if she/he has postgraduate certificates), and (iii) school characteristics (school size, seventh grade size, seventh repetition rate, FEPBA score in Language, school district, or location). Low experience represents a dummy variable equal to 1 if the teacher has less than two years of experience in teaching Science and zero otherwise. An interaction term between treatment status and the dichotomous variable of low experience is included. All the regressions exclude the classrooms where the Human Body unit was not taught. 5          Table 9. Effect on students’ perception Sequence Coaches vs. vs. Wald Test Control Control (3) (1) (2) Standardized captivate index (A+B+C+D) 0.197** 0.051 1.91 (0.094) (0.117) [0.166] 974 964 A. This class (of Science) keeps my attention 0.100** 0.118** 0.13 (0.044) (0.046) [0.722] 1016 1015 B. My teacher (of Science) makes learning enjoyable 0.064 -0.049 2.82 (0.063) (0.067) [0.093] 1016 1038 C. My teacher (of Science) makes lessons interesting 0.118 -0.012 3.10 (0.073) (0.078) [0.078] 1021 1043 D. I like the way we learn in this class (of Science) 0.127*** 0.062 1.27 (0.043) (0.064) [0.260] 1002 991 E. I like the class of Science 0.153*** 0.151** 0.00 (0.055) (0.070) [0.970] 1007 999 Source: Author's analysis based on experimental data generated for this study. Note: ** significant at 5 percent; *** significant at 1 percent. Robust standard errors in parentheses clustered at school level. The p- values are in brackets in the third column. Controls: (a) students’ characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his home, and if he/she has a car), (b) teacher characteristics (gender, age, years of experience, if she/he has postgraduate certificates), and (c) school characteristics (school size, seventh grade size, seventh repetition rate, FEPBA score in Language, school district, or location). The dependent variable in A-E represents a four-point scale, where 1 means strongly disagree, 2 disagree, 3 agree, and 4 strongly agree. The captivate index is a combined scale standardized in terms of the Control Group, whose construction is described in Appendix 4. All the regressions exclude the classrooms where the Human Body unit was not taught. 6          Table 10. Teacher perceptions Sequence Coaches vs. vs. Wald test Dependent variable Control Control (3) (1) (2) A. I feel it the way I teach Science changes a lot 0.864*** 0.953*** 0.14 (0.312) (0.247) [0.708] B. I like and/or enjoy more teaching Science than in 0.717** 0.974*** 0.93 previous years (0.345) (0.307) [0.334] C. I feel that by implementing the ideas of this training, my students learned more in comparison 0.632 0.769** 0.30 with other groups and/or subjects (0.333) (0.324) [0.586] D. I feel my students develop more skills than in 0.672** 0.876*** 0.86 previous years (0.293) (0.306) [0.354] E. I taught more hours of Science classes 0.260 1.107*** 18.33 (0.247) (0.275) [0.000] Source: Author's analysis based on experimental data generated for this study. Note: ** significant at 5 percent; *** significant at 1 percent. Robust standard errors in parentheses clustered at school level. The p- values are in brackets in the third column. Controls: teacher characteristics (gender, age, years of experience, if she/he has postgraduate certificates). The dependent variables are four-point scale, where 1 means strongly disagree, 2 disagree, 3 agree, and 4 strongly agree. 7          Table 11. Use of sequences one year later Coaches Sequence group (1) (2) A. Teachers that continue teaching 7th grade Science 27% 33% at the same school one year after the training. B. 7th grade Science Teachers in the same school 100% 89% that continue using the sequence after the training Source: Author's analysis based on experimental data generated for this study. Note: Responses to one year follow up. Numbers show the percentages of teachers that still taught seventh grade science, and still use the teaching sequence one year after the intervention. 8          Appendixes Appendix 1: Timeline We started the school selection process in October 2015 (see Figure A1.1). We selected 75 primary state schools from six school districts of CABA. We notified the Ministry of Education of Ciudad de Buenos Aires of the lottery results the first days of November, and then the Ministry communicated the results to the schools. We invited those schools to participate in the experiment during November 2015; 70 schools agreed to participate and were effectively included in the experiment. We organized meetings with the schools supervisors to inform them how the experiment would be implemented in December 2015. During these meetings, supervisors received a letter with the following information: description of the project, main objectives, grade and topic to be covered, teacher training, a calendar for the experiment implementation, and contact details. Pedagogical Coaches were recruited by the Education School of Universidad de San Andrés to carry out weekly sessions with teachers in the Coaches Group. The main aim of these sessions was to guide and coach them on how to implement the structured curriculum as well as reflect on their practice at the end of each week. They received an initial training session during February 2016 where key aspects of Science inquiry were discussed and revising the curriculum unit. Throughout the intervention they also attended regular training sessions every fortnight and had access to an extensive library of guiding documents and videos. The intervention started at the end of February 2016 when all the teachers of the sample received an initial 4 hour CPD session. During this session, teachers in the Sequence andCoaches Group received the structured curriculum unit. Meanwhile, Coaches coordinated with teachers the meeting agenda. Coaches visited the teachers at their school during free periods over 12 weeks. The first Coaches meeting was carried out in mid-March and the last in mid-June. The data collection process was carried out in two stages. First, at the beginning of the intervention (late February) we collected information about schools and teachers. Then, at the end of the intervention (late June) we evaluated students as well as collected information about students, teachers and schools. 9          Figure A1.1: Investigation timeline. 10          Appendix 2: Descriptive Statistics of Coaches Table A2.1 Descriptive Statistics of Coaches Maximum educational Seniority Age Field level (in years) Coach 1 45 - 49 Graduate Degree Science Education 20 Biology and Chemistry Coach 2 40 - 44 B.A. 14 Education Coach 3 40 - 44 Ph.D. Biology 10 Coach 4 30 - 34 B.A. Health Science 7 Coach 5 30 - 34 M.A. Science Education 7 Coach 6 35 - 39 M.A. Science Education 8 Coach 7 30 - 34 M.A Science Education 6 Coach 8 25 - 29 B.Sc. Biology 5 Coach 9 25 - 29 M.A. Science Education 3 Coach 10 25 - 29 B.A. Education 4 Author’s analysis based on experimental data generated for this study. 11          Appendix 3: Tests Items Table A3.1 Description of the levels of skills addressed in the Science Test Basic skills Medium skills Higher-order skills Recall of scientific knowledge Describe scientific phenomena. Identify research questions. Read information from tables. Interpret conclusions from Design Science experiments to Science experiments. test hypothesis. Explain more complex Science phenomena (such as the integration of the systems in the Human Body). Author’s analysis based on experimental data generated for this study. 12          Appendix 4. Student Perception: Captivate Index Throughout the student questionnaire we asked students a series of questions about specific teachers and their practices in order to construct an index that aims to measure if teaching practices inspire curiosity and interest and whether teachers are able to hold the student’s attention in class and provide the basis for continuing interest. The questions were based on the Student Tripod Survey, a pre-specified well-known questionnaire validated by the Measures of Effective Teaching (MET) project sponsored by the Bill and Melinda Gates Foundation.26 Column (1) of Table A4.1 lists the questions used to build this index, which we named Captivate Index. These questions were randomly mixed in the student survey instrument. Students gave categorical answers of the type “strongly agree”, “agree”, “disagree” and “strongly disagree”. We aggregated these answers into an index using a maximum likelihood principal components estimator. According to our estimation, only one factor is retained because it has an Eigenvalue over one. Specifically, the Eigenvalue is 2.327 and the Cronbach’s alpha reliability coefficient for our sample is 0.759. Column (2) shows the loading associated with each variable. After the prediction was computed to produce the index, we standardized it using the mean and standard deviation of the Control Group. The index ranges between -6 to 2, where -6 indicates that the student strongly disagrees in all the questions included in the index. This means that, for that student, teacher practices do not inspire curiosity and interest or fails to keep his attention in class. In contrast 2 indicates that the student strongly agrees in all the questions that make up the index suggesting that, for that student, teacher practices do inspire curiosity and interest or are successful in keeping his full attention in class.                                                              26 Tripod survey is the US’s leading provider of classroom-level survey assessments for K-12 education. Tripod surveys are in their 18th generation, refined over more than a decade of field experience and in response to valuable feedback from educators. For further information see http://tripoded.com. 13          Table A4.1 Captivate Index Factor Question Loadings A. I feel it the way I teach Science changes a lot. 0.67 B. I like and/or enjoy more teaching Science than in previous years. 0.80 C. I feel that by implementing the ideas of this training, my students learned 0.80 more in comparison with other groups and/or subjects. D. I feel my students develop more skills than in previous years. 0.78 Author’s analysis based on experimental data generated for this study. 14          Appendix 5: Sample representativeness The following table compares the average characteristics of the 70 participating schools with the characteristics of the non-participating primary state schools in CABA. Then, Figure A5.1 shows the distribution of study participants and non-study participants’ scores in the FEBPA test. Table A5.1 Mean test between non-participating and participating schools Non-participating Participating Variable schools schools Difference N Average N Average School enrollment 385 321.51 70 298.50 7th grade enrollment 385 45.30 70 40.67 Number of 7th classrooms 385 2.12 70 1.97 School promotion rate (%) 385 97.56 70 97.32 School drop-out rate (%) 385 0.17 70 0.17 School repetition rate (%) 385 2.06 70 2.47 * School over-aged rate (%) 385 14.28 70 15.02 7th grade score in language 341 487.45 70 488.74 FEBPA test Social Vulnerability Index 371 0.18 70 0.17 Note: N indicates the number of schools. * indicates that the difference of means test is significant at 10%. Author’s 15          analysis based on experimental data generated for this study. Figure A5.1: Distribution of FEPBA scores Author’s analysis based on experimental data generated for this study. 16          Appendix 6: Balance across experimental groups Table A6.1 Pre-treatment characteristics (continuation) Control Coaches All simple Sequence Group Group N Mean Sd Mean Sd Mean Sd Mean Sd Student-level variables Took holidays in the last two 2297 0.8 0.4 0.78 0.41 0.81 0.4 0.8 0.4 years Scored "Very Good" in 2249 0.45 0.5 0.46 0.5 0.45 0.5 0.43 0.5 Language Scored "Very Good" in 2243 0.44 0.5 0.46 0.5 0.44 0.5 0.43 0.5 Mathematics Teacher-level variables Seniority in teaching 7th grade 88 4.39 5.1 4.57 5.11 4.35 3.36 4.23 6.48 (in years) Seniority in teaching at the 87 3.6 3.65 0.06 0.24 0.07 0.26 0.17 0.38 current school (in years) School-level variables Percent of schools with double 70 0.39 0.49 0.46 0.51 0.35 0.49 0.35 0.49 school day Classrooms of 7th grade 70 1.99 0.81 2.08 0.93 1.96 0.71 1.91 0.79 Percent of schools in school- district: 2 70 0.17 0.38 0.21 0.42 0.09 0.29 0.22 0.42 3 70 0.19 0.39 0.21 0.42 0.17 0.39 0.17 0.39 4 70 0.16 0.37 0.21 0.42 0.13 0.34 0.13 0.34 5 70 0.14 0.35 0.08 0.28 0.17 0.39 0.17 0.39 7 70 0.17 0.38 0.13 0.34 0.3 0.47 0.09 0.29 8 70 0.17 0.38 0.17 0.38 0.13 0.34 0.22 0.42 Note: N means number of observation in the full sample and Sd means standard deviation. Author’s analysis based on experimental data generated for this study. 17          18          Table A6.2. Balance across Treatments (continuation) Coaches Sequence Coaches vs. vs. Control vs. Control Sequence Student-level variables Took holidays in the last two years 0.03 0.02 0.00 Scored "Very Good" in Language -0.01 -0.02 -0.01 Scored "Very Good" in Mathematics -0.02 -0.03 -0.01 Teacher-level variables Seniority in teaching 7th grade (in years) -0.22 -0.34 -0.12 Seniority in teaching at the current school (in years) 0.58 -0.29 -0.87 School-level variables Percent of schools with double school day -0.11 -0.11 0.00 Classrooms of 7th grade -0.13 -0.17 -0.04 Percent of schools of district: 2 -0.12 0.01 0.13 3 -0.03 -0.03 0.00 4 -0.08 -0.08 0.00 5 0.09 0.09 0.00 7 0.18 -0.04 -0.22* 8 -0.04 0.05 0.09 Note: Each entry indicates the mean difference between the two experimental groups in the column for the corresponding variable in each line. * indicates that the difference of means test is significant at 10%; ** significant at 5%; *** significant at 1%. Author’s analysis based on experimental data generated for this study. 19          Appendix 7: Attrition This Appendix explores the attrition of the sample. If individuals move in or out of the sample randomly, then the design would have a change in power, but there is no need to make further adjustments besides documenting. One way to check whether individuals move in or out of the sample randomly is to test if individuals who move out of the sample are different from those that did not migrate (Card, Ibarraran and Villa, 2011). As mentioned in Section 5, the attrition rate is 5% of classrooms that belonged only to the Control Group. In what follows we inspect the differences in the basic characteristics between students, teachers and schools that remained in the sample and those who left. The first column of Table A7.1 shows the differences between those who left the sample in the Control Group and those who remained, while the second and third columns show the differences within treatment and controls in terms of the realized sample (i.e. the sample that includes the 67 schools, or the 129 classrooms of 7th grade, which taught the Human Body unit). Regarding the first column, we can see that there are practically no differences, at 95% of confidence, between those who left the sample in the Control Group and those who remained. The only significant difference is observed in student nationality, at 99% of confidence. However, this difference vanishes when we compare the control with the treatment groups –both the Sequence Group and Coaches Group ‒suggesting that the balance between groups remains the same (columns 2 and 3). Although there are some statistical differences (at 90% of confidence), they are substantially very small, therefore, the balance is maintained overall. As in Table 3, the realized sample of treatments and Control Groups do not differ significantly in any observable dimension except in age. This is the only variable with a statistical difference at the 99% level of confidence for the control and Sequence s (students in the former group are slightly younger than those in the latter). Nevertheless, this difference is very small and vanishes when we consider 7th grade repetition rates, which are balanced across the three experimental groups. 20          Table A7.1. Mean differences Left the sample vs. Realized Sample Control Group Sequence vs. Coaches vs. Control Control Student-level variables Percent female -0.001 0.026 0.021 Age 0.081* -0.07*** -0.023 Percent of Argentines -0.141*** 0.015 0.022 Mother or father education (secondary) -0.082* 0.024 0.051* Have internet in their home -0.02 0.028* 0.025 Have air conditioning in their home -0.03 0.007 0.023 Have at least one car in their home -0.025 -0.015 0.02 Took holidays in the last two years -0.016 -0.024 -0.02 At most, missed one class per month -0.01 0.018 0.024 Scored "Very Good" in Language -0.015 0.015 0.027 Scored "Very Good" in Mathematics 0.023 0.017 0.024 Teacher-level variables Percent female 0.172 0.061 0.069 Age -3.503 2.304 2.647 Percent with Post-Graduate Certificate -0.179 0.14 0.069 Percent with Universitary degree -1.572 1.604 1.607 Seniority in teaching (in years) 0.166 0.003 0.138* Seniority in teaching 7th grade (in years) -2.222 -0.424 -0.676 Seniority in teaching Science (in years) 1.27 1.656 1.137 Seniority in teaching at the current school (in 0.434 0.416 -0.224 years) Percent of teachers that used a teaching -0.021 -0.14 -0.069 sequence School-level variables 21          Students per school 19.202 -27.387 -20.126 Students of 7th grade -0.119 -5.511 -3.555 Classrooms of 7th grade 0.095 -0.138 -0.182 Percent of schools with double school day -0.024 -0.128 -0.128 School promotion rate (%) 1.74 -1.181* -0.512 School drop-out rate (%) 0.033 0.176 -0.046 School over-aged rate (%) -1.049 1.185 -0.246 7th student’s repetition rate (%) -0.692 0.997* 0.58 Percent of schools of district: 2 -0.012 -0.151 -0.021 Percent of schools of district: 3 0.238 -0.064 -0.064 Percent of schools of district: 4 -0.357 -0.013 -0.013 Percent of schools of district: 5 0.095 0.079 0.079 Percent of schools of district: 7 0.143 0.161 -0.056 Percent of schools of district: 8 -0.107 -0.013 0.074 Note: * significant at 10%; ** significant at 5%; *** significant at 1%. Author’s analysis based on experimental data generated for this study. 22          Appendix 8: The role of student ability and gender in student learning This Appendix explores whether any experimental group presents more gains in test-score results. The following table presents separate estimates for students below (panel A) and above (panel B) the mean test-score, while Table A8.2 shows separate estimates of the average treatment effects for female and male students. 23          Table A8.1. Results on Science learning according to student ability Sequence vs. Coaches vs. Wald Test Control Control (3) Dependent variable (1) (2) Panel A: Low ability students Science score 0.155** 0.119* 0.31 (0.066) (0.069) [0.575] 598 586 Percent of correct answers 0.013 0.001 0.64 (0.012) (0.014) [0.425] 598 586 Percent of incorrect answers -0.011 0.023 2.31 (0.019) (0.016) [0.128] 598 586 Percent of omitted answers -0.035 -0.059** 0.91 (0.024) (0.024) [0.339] 598 586 Panel B: High ability students Science score 0.184** 0.237*** 0.58 (0.078) (0.073) [0.448] 507 524 Percent of correct answers 0.047*** 0.037** 0.47 (0.016) (0.017) [0.493] 507 524 Percent of incorrect answers -0.033*** -0.026*** 0.48 (0.009) (0.009) [0.490] 507 524 Percent of omitted answers 0.004 -0.017** 6.27 (0.011) (0.007) [0.012] 24          507 524 Note: ** significant at 5%; *** significant at 1%. Robust standard errors in parentheses clustered at school level. P- values are in brackets in the third column. Controls: (i) students characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his home), (ii) teacher characteristics (gender, age, years of experience, if she/he has post-graduate certificates), and (iii) school characteristics (school size, 7th size, 7th repetition rate, FEPBA score in Language, school district –or location). All the regressions exclude the classrooms where the Human Body Unit was not taught. Author’s analysis based on experimental data generated for this study. Table A8.2. Results on Science learning according to student gender Sequence vs. Coaches vs. Wald Test Control Control (3) Dependent variable (1) (2) Panel A: Female Science score 0.519*** 0.733*** 1.93 (0.146) (0.138) [0.165] 544 545 Percent of correct answers 0.089*** 0.121*** 1.06 (0.030) (0.025) [0.304] 544 545 Percent of incorrect answers -0.035 -0.049*** 0.49 (0.021) (0.016) [0.486] 544 545 Percent of omitted answers -0.062*** -0.092*** 2.14 (0.022) (0.019) [0.143] 544 545 Panel B: Male 25          Science score 0.541*** 0.548*** 0.00 (0.141) (0.168) [0.970] 561 565 Percent of correct answers 0.108*** 0.097*** 0.13 (0.024) (0.032) [0.720] 561 565 Percent of incorrect answers -0.071*** -0.048** 1.02 (0.016) (0.021) [0.313] 561 565 Percent of omitted answers -0.045** -0.066*** 0.81 (0.021) (0.023) [0.367] 561 565 Note: ** significant at 5%; *** significant at 1%. Robust standard errors in parentheses clustered at school level. Controls: (i) students characteristics (gender, age, nationality, parent’s education, if the student missed, at most, one class per month, if the student has internet in his home), (ii) teacher characteristics (gender, age, years of experience, if she/he has post-graduate certificates), and (iii) school characteristics (school size, 7th size, 7th repetition rate, FEPBA score in Language, school district –or location). The number of observations for regression for the sub-sample of students with low skills (high skills) in column (1) is 544 (561), in column 2 is 545 (565), and in column (3) is 543 (584). All the regressions exclude the classrooms where the Human Body Unit was not taught. Author’s analysis based on experimental data generated for this study. 26