Report No: ACS11163 . Central America 6C Big Data Big Data in Action for Development . . GMFDR LATIN AMERICA AND CARIBBEAN . Document of the World Bank . . . Standard Disclaimer: This volume is a product of the staff of the International Bank for Reconstruction and Development/ The World Bank. The findings, interpretations, and conclusions expressed in this paper do not necessarily reflect the views of the Executive Directors of The World Bank or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries. . Copyright Statement: . The material in this publication is copyrighted. Copying and/or transmitting portions or all of this work without permission may be a violation of applicable law. The International Bank for Reconstruction and Development/ The World Bank encourages dissemination of its work and will normally grant permission to reproduce portions of the work promptly. For permission to photocopy or reprint any part of this work, please send a request with complete information to the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA, telephone 978-750-8400, fax 978-750-4470, http://www.copyright.com/. All other queries on rights and licenses, including subsidiary rights, should be addressed to the Office of the Publisher, The World Bank, 1818 H Street NW, Washington, DC 20433, USA, fax 202-522-2422, e-mail pubrights@worldbank.org. BIG DATA IN ACTION FOR DEVELOPMENT This volume is the result of a collaboration of World Bank staff (Andrea Coppola and Oscar Calvo- Gonzalez) and SecondMuse associates (Elizabeth Sabet, Natalia Arjomand, Ryan Siegel, Carrie Freeman, and Neisan Massarrat). The findings, interpretations, and conclusions expressed in this volume do not necessarily reflect the views of the Executive Directors of The World Bank or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. Design of the report by SecondMuse (Nick Skytland). The design of the “Data for Action Framework” and data palette used on case studies created by The Phuse. 2 TABLE OF CONTENTS Executive Summary 5 Section 1. What is big data? 8 The world is filled with data 8 Hello “big data” 9 Use and estimated value of big data 10 Big data in action for international development 11 Section 2. How can we better understand and utilize big data? 13 Insights and behaviors of interest 14 CASE STUDY: The Billion Prices Project and PriceStats 15 Generation to interpretation of data 18 Data generating process 18 Data content and structure 19 Data interpretation process 19 Insight implementation process 20 CASE STUDY: Understanding Labor Market Shocks using Mobile Phone Data 22 Section 3. What can big data look like for the development sector? 26 Examples, by medium and relevant data set 26 Big Data for Development in Central America: World Bank Pilot Efforts 27 Examples by medium and purpose 29 Areas of high potential for big data 30 Early warning 31 Enhancing awareness and enabling real-time feedback 31 Understanding and interacting with social systems 31 Specific challenges and ongoing processes 31 CASE STUDY: Forecasting and Awareness of Weather Patterns using Satellite 32 Data Section 4. How can we work with big data? 36 Technological capabilities 36 Human capabilities and data intermediaries 37 CASE STUDY: Connected Farmer Alliance 39 Section 5. What are some of the challenges and considerations 43 when working with big data? Data generation process and structure 43 Data interpretation process 44 Access 44 CASE STUDY: Tracking Food Price Inflation Using Twitter Data 46 Preparation 48 Analysis 48 Insights and their implementation 49 Insight scope and purpose 49 Insight implementation process 50 CASE STUDY: Using Google Trends to nowcast economic activity in Colombia 52 Section 6. Retrospect and Prospect 55 References 57 Annex 1: Selected Bibliography 60 Annex 2: Interview List 63 Annex 3: Glossary 65 3 FOREWORD When we started this study, our main objective was to explore the potential of big data to close some of the existing data gaps in Central America. For us at the World Bank data are critical to design efficient and effective development policy recommendations, support their implementation, and evaluate results. Paradoxically many of the countries where poverty is high and hence good programs are more in need are also those countries where the data is more scarce. Not surprisingly then, we got seduced by the potential of big data. Not only has connectivity through access to mobile phones, internet, and social media increased dramatically; this has also generated an endless source of precious information. This has been well understood by marketing experts who now target individual needs. In addition to commercial use, these large amounts of data are a potential asset for the development community who could use them to help end poverty and promote shared prosperity. And indeed as this report documents, there are now good examples of how big data is used to improve growth forecasts by Ministries of Finance, track population movements, or plan emergency responses to weather related disasters. In the end, we need to recognize that we did not advance as much as we wished in filling the blanks in our Central American Databases. True, we are working on three pilots that are helping us think about how to use these approaches in our daily work. For example, our teams are exploring whether we can use (i) night time illumination patterns captured by satellites to infer the spatial distribution of poverty; (ii) internet search keyword data to improve our forecasts of price series; and (iii) twitter data to better understand public reactions to policy decisions. Admittedly we did not have a major breakthrough but the work done helped us to start appreciating the potential of big data and we will continue pursuing this agenda trying to find country specific solutions that may emerge from big data analysis. I consider worth sharing the work done by the team so far in (i) structuring a work program around a topic where we at the Bank had little, if any, expertise; (ii) presenting existing examples where big data is being used to improve development prospects, (iii) reflecting about the many development aspects that can be touched with this type of analysis, (iv) considering the technical and human capacity needs to make the best of big data, and (iv) assessing the challenges of working with big data. I also think there are important lessons emerging from the collaboration with SecondMuse and a good number of groups in the Bank: the staff in the Central American Country Management Unit, the Macroeconomics and Fiscal Management Global Practice, the Transport and ICT Global Practice, the Open Finances Group, the Development Economics Group, and the Innovation Lab in the World Bank Institute. As the report notes, the age of big data is upon us. I hope that policy makers and development practitioners alike will find the work described in this report interesting and useful. Humberto López World Bank Country Director for Central America 4 EXECUTIVE SUMMARY This report stemmed from a World Bank pilot activity still in an early stage for big data analytics. to explore the potential of big data to address development challenges in Central American In the development sector, various individuals and countries. As part of this activity we collected and institutions are exploring the potential of big data. analyzed a number of examples of leveraging big Call detail records via mobile phones are being data for development. Because of the growing used, often in combination with other data sources, interest in this topic this report makes available to to analyze population displacement, understand a broader audience those examples as well as the migration patterns, and improve emergency underlying conceptual framework to think about big preparedness. Remote sensing images from satellites data for development. are showing promise to improve food security and minimize traffic congestion. Search queries and To make effective use of big data, many practitioners various text sources on the internet and social media emphasize the importance of beginning with a are being ably analyzed for quick identification of question instead of the data itself. A question clarifies disease epidemic changes or for inference of a the purpose of utilizing big data - whether it is for population’s sentiment on an event. Hence, big data awareness, understanding, and/or forecasting. In shows promise to enhance real-time awareness, addition, a question suggests the kinds of real-world anticipate challenges, and deepen understanding of behaviors or conditions that are of interest. These social systems by governments and other institutions. behaviors are encoded into data through some Yet, to move things forward will require the generating process which includes the media through collaborative formulation of key questions of interest which behavior is captured. Then various data sources which lend themselves to the utilization of big data are accessed, prepared, consolidated and analyzed. and the engagement of data scientists around the This ultimately gives rise to insights into the question world to explore ways to address them. of interest, which are implemented to effect changes in the relevant behaviors. Utilizing big data for any given endeavor requires a host of capabilities. Hardware and software Big data is no panacea. Having a nuanced capabilities are needed for interaction of data from understanding of the challenges to apply big data in a variety of sources in a way which is efficient and development will actually help to make the most of it. scalable. Human capabilities are needed not only to First, being able to count on plenty of data does not make sense of data but to ensure a question-centered mean that you have the right data; and biased data approach, so that insights are actionable and relevant. could lead to misleading conclusions. Second, the To this end, cooperation between development risks of spurious correlations increase with the amount experts as well as social scientists and computer of data used. Third, sectoral expertise remains critical scientists is extremely important [1]. regardless of the amount of data available. And these are just some of the criticisms that one needs to take Several challenges and considerations with big data into account when thinking about the possibilities must be kept in mind. This report touches on some offered by big data. Care is needed in the use of big of them and does not pretend to provide answers data and its interpretation, particularly since we are and solutions but rather to promote discussion. 5 For example, the process through which behaviors are encoded into data may have implications for the kinds of biases which must be accounted for when conducting statistical analyses. Data may be difficult to access, especially if it is held by private institutions. Even in the case of public institutions, datasets are often available but difficult to find due to limited metadata. Once data is opened, challenges around ensuring privacy and safety arise. This is also linked with the issue of personal data ownership. Even preparing data and ensuring its scalable and efficient use presents challenges such as the time and effort required to clean data. Analysis, especially when using big data to understand systems, must carefully consider modeling assumptions; algorithm transparency is critical to maximize trust in data driven intervention; and the work of translating insights into changes in the original behaviors of interest requires attention to those institutional structures and culture that will support the process. The age of data is upon us. The means of its generation are undoubtedly multiplying, the technologies with which to analyze it are maturing, and efforts to apply such technologies to address social problems are emerging. Through a concerted and collaborative effort on the part of participants at various levels, big data can be utilized within meaningful systems and processes which seek to generate and apply insights to address the complex problems humanity faces. 6 7 SECTION 1 WHAT IS BIG DATA? Data is a growing element of our lives. More and KEY FINDINGS more data is being produced and becoming known in the popular literature as “big data”, its usage • Big data can be used to enhance awareness is becoming more pervasive, and its potential for (e.g. capturing population sentiments), international development is just beginning to be understanding (e.g. explaining changes explored. in food prices), and/or forecasting (e.g. predicting human migration patterns). The world is filled with data • Mediums that provide effective sources of In 2007, the world’s capacity to store data was big data include, inter alia, satellite, mobile phone, social media, internet text, internet just over 1020 bytes, approaching the amount of search queries, and financial transactions. information stored in a human’s DNA, and the Added benefits accrue when data from numbers are growing. Between 1986 and 2007, various sources are carefully combined to the world’s capacity to store information increased create “mashups” which may reveal new by approximately four orders of magnitude. The insights. technological capacity to process all this data has, in fact, experienced even more rapid growth [2]. To • It is key to begin with questions, not with cite but one example of the growth rates, the Sloan data. Once the setting for the analysis Digital Sky Survey collected more data over a few is defined, the focus of the research can weeks in the year 2000 than had been collected by move to the behaviors of interest and the astronomers to date [3]. consequent data generation process. The interpretation of this information will be Simultaneous with the rise of the capacity to store used to produce actionable insights with and analyze data is the increasing capacity for people the possible objective of influencing the behaviors of interest considered. Along to contribute to and access it. It is estimated that as these lines, this report develops a Data for of 2013, 39% of the world’s population has access to Action Framework to better understand the internet. More astounding is the fact that mobile and utilize big data. phone subscriptions are near 7 billion, approximately equaling the number of people on the face of the • Making good use of big data will require earth, yet access is not equal. While 75% of Europe’s collaboration of various actors including population has access to the internet, only 16% have data scientists and practitioners, leveraging access in Africa. Interestingly, discrepancies for mobile their strengths to understand the technical phone subscription rates are not as pronounced. possibilities as well as the context While Europe has approximately 126 subscriptions within which insights can be practically per 100 inhabitants, Africa has 63 [4]. Furthermore, implemented. the use of more advanced “smart” cell phones is 8 expected to increase in the coming years. McKinsey distinguish big data with respect to the largeness Global Institute (McKinsey) estimates, for example, of the dataset(s), say 200 gigabytes of data for a that for countries in Asia other than China, India or researcher in 2012. Practitioner audiences, on the Japan, the number of “basic phones” is expected to other hand, will emphasize the value that comes decrease by 12% per year between 2010-2015, while from utilizing various kinds and sizes of datasets “advanced phones” will rise by 17% during the same to make better decisions. Indeed, there does not period [5]. appear to be any real and rigorous definition of big data; instead, it is often described in relative terms Hello “big data” [3]. As an example, McKinsey uses the term big data to refer to datasets “whose size is beyond the ability It is no surprise that, given such a rising tide, the of typical database software tools to capture, store, horizon of possibilities considered by decision- manage, and analyze”, thereby allowing the definition makers over the years has increasingly taken into to vary by setting, such as industry, and time [5]. account how to make best use of such a deluge of Yet with the great diversity of current storage and data. As early as 1975, attendees at the Very Large processing power, not to mention the doubling of Databases conference discussed how to manage the such capacity in short time scales, such a definition then-considered massive US census data [6]. In the makes comparability difficult --exactly the problem a late ‘90s, practitioners were already using massive, definition seeks to avoid. high-frequency store-level scanner data to compute optimal pricing schedules [7]. Indeed, it was around Several other authors [10], [11] often refer to the that time that the term “big data” was used to refer “Three V’s” of big data: volume, variety, and to the storage and analysis of large data collections velocity, originally discussed by Laney in 2001 [8], [8]. Since the ‘90s, a plethora of media, often to distinguish big data. Volume refers to the actual automatically capturing behaviors and conditions size of the dataset(s) analyzed, variety to the various of people or places, provide new data sources types of datasets possibly combined to produce new for analysis. These media sources include online insights, and velocity to the frequency with which shopping websites capturing transaction data, retail data is recorded and/or analyzed for action. The computers capturing purchase data, internet-enabled concept of variety often underlies use of the term big devices capturing environmental data, mobile phones data. For example, in examining the use of big data capturing location data, and social media capturing for governance, Milakovich points out how “single data on consumer sentiment. sources of data are no longer sufficient to cope with the increasingly complicated problems in many policy Decreasing costs of storage and computing power arenas” [12]. In this vein, Boyd and Crawford point have further stimulated the use of data-intensive out that big data “is not notable because of its size, decision making [8], and decision making based on but because of its relationality to other data. Due ever larger and more complex datasets requires more to efforts to mine and aggregate data, Big Data is sophisticated methods of analysis. For example, fundamentally networked” [13]. For the purposes in the case of visual analysis, smaller datasets lend of this report, which considers the development themselves to simple visual interpretations, say, using context, the use of big data will refer to the use of a scatter plot or line chart; while bigger datasets can’t any dataset(s) which are distinguished by one or more always readily be captured using similarly structured of the three “V” features mentioned above for the small and fast renderings [6]. purpose of generating actionable insights. As a result of the rapidly evolving landscape, the Other related terms such as open data and popular press, such as the New York Times [9], as crowdsourced data have also become in vogue. well as academic discourses, have increasingly used “Open data”, for example, refers to data which is the term “big data”, yet its definition has remained made technically and legally open, i.e. available in somewhat elusive. Technical audiences will often machine-readable format and licensed to permit 9 have analyzed consumer purchase data to make personalized recommendations, video and shopping cart transponder data to streamline a grocery store’s Big Data layout, store- and product-level purchases together with climate data efficiently to maximize sales and minimize inventory costs, and location data on trucks together with geographic data to minimize fuel use and delivery times [5]. Accessing mobile data such as foot-traffic patterns or even phone operating systems Crowdsourced Open have helped companies engage in more effective Data Data advertising [18]. Other specific examples include how Microsoft improved the accuracy of its grammar checker by increasing the relevant dataset from a million to a billion words, or how Google utilized a trillion words more effectively to provide language Figure 1: Relationship between Big Data, Crowdsourced Data, translation services [3]. In short, intelligent use of big and Open Data data is becoming an effective way for companies to commercial and non-commercial utilization [14]. outperform their competitors, often through more Cities such as New York sometimes open their data effective foresight and understanding of the market to stimulate innovation by drawing upon outside dynamics [5]. While business has succeeded in resources [15]. Countries such as Singapore, India, demonstrating the marginal benefits which can accrue and the United States have also consolidated from big data, e.g. efficiency gains in more effective and opened data sets to the public. McKinsey retail forecasting, more substantive or qualitative distinguishes open data by explaining that, although effects of big data, particularly in terms of social big data may also be open data, it need not. practices, are just beginning to emerge [19]. That Open data, they explain, refers to the degree to said, the marginal benefits are substantial for private which data is liquid and transferrable [16]. Some and public interests. data, particularly privately held mobile phone data for example, is particularly closed yet is certainly McKinsey estimates the potential additional value of big data. Crowdsourced data is another popular big data in the sectors of US health care, European term which refers to data collected through the public sector administration, global personal location aggregation of the input from large numbers of data, US retail, and global manufacturing to be over people. Crowdsourced data can also be big data, $1 trillion US dollars per year, half of which comes but need not be. “Crowdsourced” emphasizes the from manufacturing alone. Such value often comes means through which data is collected whereas “big from efficiency gains by reducing the inputs required data” emphasizes the depth and complexity of the for the same amount of output [5]. Another study dataset(s). estimated the value of big data via improvements in customer intelligence, supply chain intelligence, Use and estimated value of big data performance improvements, fraud detection, as well as quality and risk management. For the United The use of big data has, over the past several years, Kingdom alone, this value was estimated at $41 billion been motivated largely by private interests. In a US dollars per year in the private and public sectors survey done around 2010, with over 3000 business [20]. executives in over 100 countries, it was found that the top-performing organizations were “twice as likely to apply analytics to activities” including day-to-day operations and future strategies [17]. Businesses 10 Big data in action for international maps which are often prohibitively expensive to development produce. In Mexico, analysis of call detail records enabled tracking of population movements in In addition to providing insight to make businesses response to the spread of epidemic disease and more profitable, big data is showing promise to provided insight into the impact of policy levers like improve, and perhaps substantively change, the transportation hub closures, such that the velocity of international development sector in novel ways [10]. infection rates was reduced by as much as 40 hours Of general interest is the fact that big data often is [23]. In Kenya, once the impact of mobile money produced at a much more disaggregated level, e.g. transfers was evident, governmental regulations individual instead of, say, a country level. Whereas were changed to enable their increased use [18]. aggregated data glosses over the often wide-ranging Text analysis of social media data has the potential disparities within a population, disaggregated data to identify issues pertaining to various population allows decision makers more objectively to consider segments over time, e.g. refugee challenges or those portions of the population who were previously political opinions, thereby allowing development neglected [21]. organizations more effectively to listen to the needs of a population [24]. Two basic approaches appear to stand out with respect to big data in an international development Several governments have used big data in a variety context. One is when big data is utilized for projects of ways to streamline processes, thereby increasing or processes which seek to analyze behaviors outside cost and time savings. In Sweden, the government of government or development agencies in order used previous years’ data combined with user to heighten awareness and inform decision making. confirmation via text messaging to streamline tax Another approach is when big data is utilized for the filings. In Germany, the Federal Labor Agency used analysis of behaviors internal to a single institution, its multidimensional historical customer data more such as the government, often to streamline and effectively to assist unemployed workers, thereby improve services. reducing costs by approximately $15 billion USD annually. Altogether, McKinsey estimates that Several examples demonstrate the use of big data for Europe’s 23 largest governments could create $200- projects or processes more concerned with matters 400 billion USD per year in new value over 10 years outside of governments or other types of agencies. through the use of big data to reduce mistakes and For example, in one case it was found that a country’s fraudulent tax reporting behaviors [5]. gross domestic product could be estimated in real- time using light emission data collected via remote The examples above may be used to stimulate sensing. In this way, alternate data sources serve as thinking on similar or analogous uses of big data to a proxy for official statistics. This is especially helpful drive resource efficiency, process innovation, and in the development context since there is often a citizen involvement where resources are constrained, scarcity of reliable quantitative data in such settings thus laying a strong foundation for poverty alleviation [22]. Changes in the number of Tweets mentioning and shared prosperity around the world. the price of rice in Indonesia were closely correlated to more directly measured indicators for food price inflation. Similarly, conversations about employment including the sentiment of “confusion” via blogs, online forums, and news conversations in Ireland were found to precede by three months official statistics showing increases in unemployment [10]. In Guatemala, a pilot project explored how mobile phone movement patterns could be used to predict socioeconomic status, thereby approximating census 11 12 SECTION 2 HOW CAN WE BETTER UNDERSTAND AND UTILIZE BIG DATA? Before delving into a technical and political analysis of a data interpretation process takes place through the use of big data, it is helpful to have a contextual which the raw data is accessed, consolidated, and understanding of how such data may be generated, analyzed to produce some actionable insight. Often, accessed and acted upon. Figure 2 below describes this interpretation process is cyclical and interactive this framework of data for action. This follows from the rather than strictly linear. For example, the analysis recognition that data is not the same as knowledge may shed light on new data needed, thus requiring and that, therefore, a whole host of capacities are access to new data, or the very act of consolidating required to generate data-driven actionable insights data may already reveal insights which will inform for social betterment. the analysis or the need to access data in a different way. Once actionable insights are determined, they At the most basic level, there are behaviors or must change the behaviors of interest through some conditions existing in the world which include, among implementation process. This process includes many other things, climate, human sentiments, distilling insights into next steps, structuring population movement, demographics, infrastructure organizational processes or projects accordingly, and market-based interactions. These behaviors and engaging in the corresponding actions. To the and conditions, named behaviors henceforth, are extent that the insights gained are accurate and the encoded through some data generating process implementation is done thoroughly, the behaviors will which includes triggers and media through which data change as expected. In whatever case, however, all is recorded. Once data in whatever form is generated, phases of the cycle inform each other through a rich process of learning. 13 FORMUL T ION AT ES IO N U IN Q N IS SI O G D H TI LY S EF A NA IN T E ET IM STR A ,( PR T, PL UCT R EN E) ER EM EM NT AG EN E , AC T AN AI TA UR ,M DAT TIO ESS INSIGHT ACC N DATA BEHAVIOR DATA N G E N ER A TIO UNIT, TRIGGER, MEDIA Figure 2: Data for Action Framework The above framework is general enough to permit Insights and behaviors of interest an exploration of the use of big data in a variety of contexts yet also specific enough to frame the As expressed by experts who were interviewed for subsequent discussion. Each element of the figure this report [25], [26], [27], [28], [29] as well as in a above is discussed in further detail in the sections report distilling experience for over 3,000 business below, starting with a careful consideration for the executives around the world [17], the best way to “end” use of such data, i.e. the kinds of insights and proceed when considering how to use big data is to actions we wish to generate. begin with questions, not data. Undoubtedly, as data is collected and analyzed, a question will be refined. 14 CASE STUDY DATA PALETTE The Billion Prices Retail prices via websites Project and PriceStats Motivation The Billion Prices Project began when Alberto Cavallo, Sloan Faculty Director at MIT, noticed that Argentina could benefit from a more transparent, reliable, and low-cost method to track inflation. The primary purpose, therefore, was to enhance awareness on price changes and purchase behaviors of interest, particularly those prices consumers were paying. A methodological challenge was also identified: could data via the Internet feasibly and reliably provide an alternative data source for traditional price index measures? In time, as the data and methods to access them proved reliable, the data could be used by researchers to enhance understanding as well as central banks and financial traders to enhance their forecasting abilities. Online price data provide distinct advantages over alternative sources of information such as direct survey methods used for traditional price indices or scanner data obtained through companies. In particular, online price data is very high-frequency (daily)--in fact is available in real-time without delay--and has detailed product information such as the size or exact product identifier (i.e. SKU) as well as whether the product is on sale. Unfortunately, data on quantity sold is relatively unavailable via outward-facing Internet webpages. Data Generation As indicated by the data palette above, the data used for this study is generated by various firms via the Internet. While procedures to update prices will vary across companies it is safe to state that, in general, such information is provided on a high-frequency and specific geographical basis (i.e. with store location), or as indicated by the palette, at a high level of temporal frequency and spatial granularity. The data is relatively structured since contextual text on the webpage will indicate the SKU for the product as well as other associated parameters. Once posted online, such data become publically accessible for anyone to access. Data Interpretation In the initial efforts to compute price indices for the country of Argentina, all Dr. Cavallo needed was his laptop and time. As the endeavor grew in scope, however, methods and considerations grew in sophistication. Today, a curation process ensures that the best sources of online data are selected. In some cases, data is collected from a retailer for a few years to evaluate whether the quality of the data is sufficiently high for inclusion. Also important is to capture offline retailers’ price information since fewer individuals in developing countries, in particular, purchase goods online. Although each download or “scraping” places no more pressure on a server than a regular page view, the timing is even adjusted to reduce retailers’ server demands. Finally, retailers’ 15 privacy concerns are addressed in a variety of ways including sharing data at an aggregated level and with time lags. While the technology for “scraping” price and associated product information off of websites is inexpensive and readily available, Dr. Cavallo and his team realized that a careful process of cleaning the pulled data needed to be put in place; this to ensure that all data sources are homogenized and prepared for aggregation and analysis. The consolidated price data has been used in two ways. The first is to produce price indices which give an indication of inflation. This information is sold and/or shared with central banks and financial traders. The second is for academics seeking to understand market behaviors. Academics will often utilize econometric techniques, for example, which leverage the data sets’ high degree of granularity and dimensionality in order to understand short run or disaggregated effects of various policies. Insight Implementation Individuals working at central banks and/or traders use such data to enhance decision making. For example, due to its high-frequency, central banks can see day-to-day volatility combined with sector-by-sector comparisons which traditional measures can’t capture. As experience with such data sources deepens, many governmental statistical offices are shifting their mentality to accept alternative methods of data collection. Academics also use such data to conduct research. The fact that the data has so many dimensions allows economists to avoid the use of complicated techniques to account for expected gaps of information in traditional data sources. In one paper, economists used the Internet scraped data to show that the Law of One Price (a central economic theoretical result) tends to hold across countries that belong to formal or informal currency unions. In another paper, the dataset was used to show how natural disasters affect the availability and prices of goods. Researchers found that more indispensable goods were prone to more quickly reduce in availability and more slowly to resume from price increases. From Idea to Ongoing Process What started as an idea by one individual grew, in time, to a research initiative named Billion Prices Project at MIT which was primarily supported by grants. In time, the company PriceStats was created through which high-frequency and industry-specific price indices could be sold. Through an agreement between PriceStats and the Billion Price Project to share the data as well as some of the earnings, the research work can continue uninterruptedly. In time, it is anticipated that the data will be more widely shared with academic institutions for research purposes. References BPP. (2014). The Billion Prices Project @ MIT. Retrieved from http://bpp.mit.edu/usa/ Cavallo, A., Neiman, B., & Rigobon, R. (2012). Currency Unions, Product Introductions, and the Real Exchange Rate (No. w18563). National Bureau of Economic Research Cavallo, A., Cavallo, E., & Rigobon, R. (2013). Prices and Supply Disruptions during Natural Disasters (No. w19474). National Bureau of Economic Research. Conversation with Alberto Cavallo, February 2014 PriceStats. (2014). History. Retrieved from http://www.pricestats.com/about-us/history 16 A well defined question will clarify three basic correspond, for example, to the actions institutions elements: the primary purpose for using big data, may take to respond to present situations, design the kinds of real-world behaviors of interest, and policy, or prepare for future events. the scope of analysis. The following are possible questions. How do incomes in Uganda change throughout the year? How can assistance be more effectively provided to those in need after a natural disaster near the Pacific Ocean? What cities in the world should prepare for increased flooding due to climate change? How can governmental systems be designed to efficiently and appropriately tax individuals according to transparently defined principles? How can a country’s employment sector be more effectively coordinated with training and education? An analysis of multiple case studies, such as the Figure 3: Awareness/Understanding/Forecasting ones included in the introduction, suggests that endeavors which utilize big data are primarily focused Regarding the question of behaviors, the following on advancing either awareness, understanding, or categories (with corresponding examples) are forecasting. Use of big data to heighten awareness is suggestive of areas of interest in the international exemplified by projects which utilize non traditional socio-economic development context. The table sources of data to serve as proxies for official below shows how a few of the example behaviors statistics, such as the gross domestic product or may be used in the context of heightening awareness, Kenyan examples above. Real-time monitoring advancing understanding, or enhancing forecasting. of disasters provides another avenue in which awareness is especially needed [30]. Big data in • product/service usage (e.g. non-market food some cases is used to more deeply understand consumption) a phenomenon so that better policy levers can • market transactions (e.g. wheat purchase prices) be utilized. The Mexico call detail records case • human population movement (e.g. regional described above is one relevant example. Finally, migration patterns) big data may be utilized to more accurately forecast • human population sentiment (e.g. public behaviors so that institutions and populations can opinion on policies) more effectively prepare. The unemployment case • human population conditions (e.g. extent of in Ireland is one such example. Without a doubt, disease epidemic) these three purposes are deeply interrelated. It is • weather conditions (e.g. ground temperatures) impossible to advance understanding, for example, • natural resource conditions (e.g. extent of without heightening awareness. Understanding, in forests) turn, is often the foundation upon which forecasting • physical infrastructure conditions (e.g. locations methods are utilized. Conversely, certain machine- of usable roads) learning and inductive algorithms may be used to • agricultural production (e.g. extent of rice enhance forecasting ability which can itself give rise cultivation) to a better understanding of a system. That said, the three categories may be useful to stimulate thinking as far as the ways in which big data can be utilized. Awareness, understanding, and forecasting aptly 17 AWARENESS UNDERSTANDING FORECASTING Wheat How much are farmers currently What is driving changes in wheat What will wheat purchase prices be in purchase receiving for the wheat they are purchase prices? a week? selling? prices Public How favorably do citizens feel about a What factors drive public opinion on How will public opinion change in the opinion on particular policy? foreign relation policies? coming months? policies Regional During what times of the year do How do labor wage differences How is country-to-country migration migration people engage in migration stimulate changes in migration expected to change in the next few patterns? years? patterns In determining the scope of analysis, it is helpful to and/or linked data (published in a format which lends know if the use of big data in a process or project itself to identify elements and links between datasets) is intended to consider the situation for a single [33]. Yet another paper explores how data may be individual, city, region, nation, and/or the entire structured (i.e. readily stored and accessed in terms planet. By determining the scope, data requirements of columns and rows) or unstructured (e.g. images or as far as size and interoperability become clear. video) [34]. Generation to interpretation of data An examination of the above categories reveals that underlying the discourse to understand big data are Once the setting for the analysis is well defined, three aspects of the cycle described above: the way it will be helpful to consider the data available in which data is generated, its content and structure, corresponding to the behaviors of interest. The first and the process through which it is accessed and step in this is to describe and categorize the various interpreted. The following sections elucidate these types of data available. three aspects, providing relevant categories with which to organize thinking about the opportunities in Many authors have attempted to do this. One report the big data space. notes that data may record what people say or do [32]. Another report points out that big data may have Data generating process one or more of several features including whether it is digitally generated, passively produced, automatically The data generating process has at least three collected, or geographically or temporally trackable, features: the data-recording trigger, the level at which and/or continuously analyzed. The same report data is collected, and media through which data discusses a data taxonomy with four categories: data is generated. To begin, the trigger that is utilized exhaust (i.e. passively generated, often real-time to encode behaviors into data may be active or data), online information, physical sensors, and citizen passive. For example, much mobile data is passively reported or crowdsourced data. Also included in the or constantly collected by cell phone towers. Data report is the division of data sources into traditional such as a Twitter or Facebook post is generated (e.g. census or survey data) vs. nontraditional (e.g. actively since a human decision actively precipitates social media, mobile phone data) [10]. Another its recording. Data is also generated at a particular, paper discusses how data may be: raw (primary, most granular level of analysis. These levels of unprocessed data directly from the source), real-time analysis may pertain to temporal, spatial, human, or data (measured and accessible with minimal delay), other characteristics. For example, retail sales data 18 is generated at a high temporal frequency (daily, human collaboration [35]. A study done with sixteen in some cases), spatially identified at a store-level data analysts using big data at Microsoft found (i.e. latitude and longitude), and by product SKU. five major steps which most engaged in: acquiring Finally, the media through which data is generated data, choosing an architecture (based on cost and may include one or more of the following: satellite, performance), shaping the data to the architecture, mobile phone, point-of-sale, internet purchase, writing and editing code, and finally reflecting and environmental sensors, and social media, among iterating on the results [6]. Finally, another report by others. Data content and structure In terms of the data content and structure itself, at least six features shed light on the kind of big data one may choose. First, as mentioned above, data may be in a structured or unstructured form. Second, data may be temporally-referenced; in other words, each record or instance has some form of temporal identification attached. Third, data may be spatially- referenced, i.e. be tied to some geographic location Figure 4: Data Interpretation Process data. Fourth, data may be person-identifiable; in the United Nations discusses how big data analysis other words, records are not only separable1 and requires the steps of filtering (i.e. keeping only unique by person, but fail the test of anonymity . Fifth, relevant observations), summarizing, and categorizing data may have various sizes, from, say, hundreds of the data [10]. megabytes to several petabytes. Finally, a dataset In reviewing the above and several other papers, may or may not be a compilation of other datasets. three interrelated components are involved in generating insights from data. First, the data must be Data interpretation process accessed. Second, the data must be prepared. Third, the data must be analyzed. Undoubtedly, these three Once datasets corresponding to the behaviors of components are highly interrelated and need not interest have been identified, these must be collected progress linearly. For example, analysis can inform the and analyzed so that actionable insights may be need to access new data, and the data preparation generated. component may reveal certain preliminary insights for analysis. Several authors have elucidated features pertaining specifically to the process of analyzing data. In one Access paper, several phases are described, including: data acquisition, information extraction and cleaning; In categorizing the ways to work with big data, many integration, aggregation, and representation of sources have actually been describing features the information; query processing, data modeling, pertaining to the access and/or generation of data. analysis; and, finally, interpretation. Each of these At a most basic level, data may be accessed using phases presents challenges such as the heterogeneity one of three sources: via public institutions, via private of the data sources, the scale of the data, the institutions, and directly from individuals or the crowd speed at which data is generated and response is [36]. Publicly-sourced data includes zip-code level US needed, concerns for privacy, as well as enabling census data, or massive weather station data from 1. Separability implies that an individual’s records can be separated and distinguished from others, whereas identifiability implies that it is known who that individual--often simply represented by a number of code in a database--is. As long as a database is person-identifiable it is individually separable. However a database could include separable yet not identifiable individuals. One such example is an individual wage record database where individuals’ names have been scrambled with random, yet consistent identifiers. It is worth noting, however, that even in cases where such data has been scrambled, there are settings and corresponding methods through which an analyst could inductively identify people. 19 the US National Oceanic Atmospheric Administration behaviors of interest. Through the lens of a particular (NOAA). Privately-sourced data may include store- model, the data sheds light on the presence and level retail sales data for hundreds of thousands of degree to which relationships exist. For this reason, stores across a country or mobile phone location model selection is especially important, otherwise data for millions of individuals across a country. true relationships may remain undetected. To ensure Crowd-sourced data may include millions of image a proper model, many researchers will emphasize artifact-identification analyses done by hundreds of how experience, expertise and human intuition are thousands of individuals. critical [38]. In addition, it is important to consider the fact that, when modeling human, non-laboratory/ Management controlled settings using high-frequency big data, several constraints and parameters affect the Given that one or more dataset(s) can be created and/ behaviors of interest. In this regard, models must be or accessed using the means discussed above, the thoughtfully designed [39]. From a statistical-scientific data needs to be combined and prepared for analysis. standpoint, the use of big data has significant At this stage, the steps mentioned above such as implications for modeling and theory-development. choosing an architecture, extracting and cleaning the At a basic level, an analysis of substantial amounts of data, and integrating the datasets become relevant. data can inform the design of models and vice versa This component of process, which will be further such that a dynamic interplay exists between them discussed in Section 4, takes a great deal of time and [37]. is closely tied with interpretation. Insight implementation process Interpretation Simply having insights pertaining to relevant Once data has been formatted in such a way that it behaviors is insufficient to cause a change in those can be more readily accessed by the analyst, various behaviors. A process whereby insights generated are methods can be used to interact with it and generate translated into action is necessary. In considering this insights, or at least elucidate other questions which process, at least three features stand out: defining will refine data access and preparation. Two main next steps, adjusting or creating structures to ensure mechanisms are often used to interpret big data: these steps are carried out, and taking the necessary visualization and modeling. actions. Visualization plays a key role since it leverages the The insights which are ultimately generated as a result human strength to see patterns. It also often helps of the analysis of data may be more or less actionable. the analyst scrutinize the data more closely [37] and Therefore, it is critical to articulate next steps arising may enable ready comprehension on findings that from the insights. And, of course, to act upon these would otherwise be difficult to achieve [10]. Indeed, in steps. However, it is also necessary that some kind of a survey of over 3000 business executives around the structure is in place to ensure continuity of action. world, a great number indicated that “visualizing data differently will become increasingly valuable” [17]. In this regard, the distinction between using big data Visualization requires the thoughtful selection of the for one-time projects versus integrating its usage relevant pieces of information displayed in a visually into an ongoing process merits consideration. Many appealing way to help a decision-maker understand short-term projects have utilized big data to explore the data. its application as a proxy for official statistics or, in some cases, as a way to allocate resources after a Modeling is essential to interpreting data, especially disaster. Once immediate disaster relief is carried if the purpose is to understand and/or forecast out, for example, continued use and refinement of behaviors. Models attempt to describe the underlying the original big data may cease. At best the data processes occurring in the world which give rise to the is used as reference material for future projects. 20 However, big data may be used in the context of acting--are closely tied and are not necessary linear. ongoing processes through which its use is refined It’s possible that action on the ground informs how and the insights it leads to are ongoingly acted structure is defined. For example, when aid agencies upon. Retail giants like Walmart provide examples of assist potential polio victims, they may discover the ways in which big data can be integrated in the that methods of communication need enhancing in ongoing functioning of the company such as using particular ways. Alternatively, creating a structure to forecasts based upon their large stores of data in implement next steps may itself help lend further order to adjust inventories, re-organize content, and shape to them. For example, a non-profit may see price items. Cities like Chicago are gathering and that the insights generated from an analysis very tracking both historical and real-time big data on an clearly point to the need to increase the number of ongoing basis to streamline operations and uncover vaccinations. However, when beginning to define the meaningful correlations [31]. structures which will actually carry this out, it may be discovered that other organizations are already doing this and that what is actually necessary is to more appropriately identify those individuals who need vaccinations. Critical to ensuring that the process has a healthy degree of interactivity among its various elements is a culture of learning characterized by a willingness to share observations and learn from mistakes. Once insights are generated and implemented, behaviors change and new data is generated, whereby the cycle resumes and a rich process of Figure 5: Insight Implementation Process learning ensues. Whatever the nature of the use of big data, whether for a project or process, the three facets of the implementation process--defining, (re)structuring, and 21 CASE STUDY DATA PALETTES Understanding Call detail records via mobile phones Labor Market Shocks using Mobile Phone Data Weather conditions via remote sensing imagery Motivation The use of call detail records (CDRs) to track fine grain patterns of population movement has been a topic of much research in the past few years. Joshua Blumenstock, University of Washington professor, has spearheaded projects in Rwanda and Afghanistan using CDR records which enhance the ability to gather deeper insights into more nuanced forms of migration, including seasonal and temporary migration. Upon the basis of that research, Blumenstock and his colleague Dave Donaldson, are now pursuing evidence to support Labor market data via public agency surveys the long-held theory of migrant workers acting as arbitrageurs whose movement among labor markets serve to bring those markets into equilibrium. The highly- detailed internal migration data necessary to support that theory empirically has been unavailable until now. In addressing their central research question regarding the extent to which the movement of people over space helps stabilize and equilibrate wage dispersion, Blumenstock and Donaldson are seeking to analyze migration patterns in response to labor market shocks to shed light on the dynamics of internal migration in low income economies. Data Generation Although access to CDR data is typically a significant hurdle in projects of this kind, in this case Blumenstock and Donaldson already had access to the necessary CDR data from prior projects. Data is generated automatically as telecommunications companies encode phone calls into mobile phone records. For this study, the data spans several years, accounts for three developing economies and is highly detailed both 22 temporally and spatially. In addition to CDR data, the project is making use of census data and external data sources underlying labor demand shocks, including weather conditions and domestic and international commodity prices. Public census data, while it changes rarely, includes a high volume of information and adds depth of knowledge to the longer term dynamics of rural to urban area migration. Government labor market data is highly structured and includes regional price and wage information at two regular intervals. Finally, the team purchased the high frequency/high resolution satellite weather data needed to assess climate-related labor shocks for the project. Data Interpretation CDR data is a remarkably unwieldy and inconsistent dataset to work with, received in unstructured repositories of millions of individual files. The initial steps in the project from pre-processing the data to teasing out the relationships between the different datasets (e.g. wage and crop price data) are incredibly time consuming. The team works with the data on a daily basis and builds models iteratively based on the data. The data analysis process goes through three steps. The first step is to isolate local shocks to labor demand (e.g. weather or commodity prices changes) and to identify the resulting labor market outcomes. The second step involves using the identified shocks to labor demand and drawing on CDR data to estimate the migration response to those shocks. This step aims to understand the migration response to labor market shocks and answer the following questions: What types of individuals are more or less likely to migrate in response to shocks? What regions are more receptive to migrants? How long after a shock do people stop migrating? How do these dynamics affect urbanization? The final step in the analysis is to estimate the actual effects of migration dynamics on the creation of wage equilibrium and to understand the speed and depth of the impact. There is already significant existing research on models used to determine measures of mobility using CDR data and several research papers on the topic. Moreover, the project draws on classic theoretical models from economic literature, like the Harris-Todaro model, regarding the relationship between wages, labor market shocks and migration. The project team is currently building a quantitative framework that will allow them to test those theories based on the iterative models being developed through analysis of the wage and migration data. Throughout the process of analysis, small insights continuously hone the questions asked and help determine relevant inputs. As data comes in, determinations are continuously made which refine the quantitative models that exist for each factor within the framework and the relationship between factors. Benchmarking against other existing data sources, including census data, is relevant in making those determinations. Insight Implementation Though this project is still in its early stages, it will deepen understanding of the role of migrants in labor markets. While the theories underlying many current policies designed to impact migration and urban-rural population flow are well-developed, empirical evidence is remarkably thin due to the lack of detailed data. As tracking of population movements using CDR data becomes more commonplace, the insights gained through this historical analysis of data will play an integral part in understanding migration patterns. 23 From Idea to Ongoing Process Like many of Blumenstock’s projects which track migration patterns on the basis of CDR data, this retrospective project contributes to a fundamentally deeper understanding of how wages and labor markets are determined. Through such an understanding, labor policy can be more effectively designed for low income countries, including selection of actions to incentivize and disincentivize migratory behavior. By pioneering, and rigorously documenting, a process for gathering insights based on a quantitative framework of evidence, this project could be a foundation upon which ongoing evaluation of government policies could be conducted. References Blumenstock, J. & Donaldson, D., (2013). How Do Labor Markets Equilibrate? Using Mobile Phone Records to Estimate the Effect of Local Labor Demand Shocks on Internal Migration and Local Wages. Proposal Summary for Application C2-RA4-205. Conversation with Joshua Blumenstock, March 2014. Harris, J. & Todaro, M., (1970). Migration, Unemployment and Development: A Two Sector Analysis. American Economic Review 60 (1). 24 25 SECTION 3 WHAT CAN BIG DATA LOOK LIKE FOR THE DEVELOPMENT SECTOR? Big data shows potential to advance development Examples by medium and relevant work in a variety of ways. In the first section above, data set several examples were provided which highlighted the ways in which big data could be used as a Mobile | Call Detail Records. Although usage of call proxy for conventional official statistics, thereby detail record (CDR) data for development is still in enhancing institutional awareness of the conditions early phases, applications such as using Digicel’s of a population; to better organize governmental data to track population displacement after the processes thereby delivering more effective services; Haiti earthquake and modeling of infectious disease or to enhance understanding of the drivers of health spread show great promise [40]. One study in epidemics, thereby guiding policy decisions. Afghanistan showed that CDR data could be used to detect impacts from small-scale violence, such Any point in the framework discussed in Section 2 can as skirmishes and improvised explosive devices, in be used to stimulate the imagination on the horizon terms of their impacts on communication methods of possibilities for big data in development. The and patterns of mobility. Another project done by case studies presented throughout the text provide the lead researcher in the Afghanistan study was to concrete applications of the lens of the framework capture seasonal and temporary migration, usually in various settings. Moreover, this section describes overlooked by traditional survey models, permitting several examples of data sets utilized by medium a more precise quantification of its prevalence. An as well as by purpose and report information on the ongoing project which builds upon these results first World Bank attempts to leverage big data to aims to measure precisely the extent to which wage address development challenges in Central American disparities in Rwanda, Afghanistan, and Pakistan are countries. By cross referencing primary media with the arbitrated by migration [22]. primary purpose of the use of big data--awareness, understanding, or forecasting--one can easily see how Satellite | remote sensing images. Usage of satellite big data projects can take a variety of configurations data abounds. For example, the United Nations depending on the context. Then a summary is University engaged in a project using satellite rainfall presented detailing what institutions and individuals data combined with qualitative data sources and are saying about where big data shows promise for agent-based modeling to understand how rainfall development. Finally, recommendations for next variability affects migration as well as food and steps in advancing the application of big data for livelihood security in South and Southeast Asia, Sub- development are provided. Saharan Africa and Latin America [41]. In Stockholm, GPS-equipped vehicles provided real-time traffic assessments and, when combined with other data sets such as those pertaining to weather, made traffic predictions. Such analyses inform urban planning and 26 also can increase time and cost savings for drivers managers to make better stock exchange decisions [42]. and by researchers to predict a film’s success at the box office or a person’s likelihood to get flu shots [3]. Internet | Search Queries. The internet stores a vast amount of information, much of which is Financial | Credit Card Transactions. Credit card unstructured. Search queries present one source of companies have increasingly been using their massive data on the internet. In this vein, Google searches stores of data to enhance their services. In several for “unemployment” were, found, for example, to cases, companies use purchase data to identify correlate with actual unemployment data. Similar unusual behavior in real time and quickly address data was used to notice changes in the Swine Flu potential credit card fraud [3]. In other cases, financial epidemic roughly two weeks before official US institutions have been cited as being able to predict Centers for Disease Control and Prevention data whether someone is dating [45] or even infer the sources reflected it [42]. The Bank of England uses strength of a marriage [46]. search queries related to property, for example, to infer housing price changes [3]. Colombia’s Ministry Big Data for Development in Central of Finance uses the information generated by Google America: World Bank pilot efforts searches to assess short-term GDP trends in Colombia and publish monthly macroeconomic reports which Since 2014 the World Bank has been exploring the discuss the results of the model developed [43]. potential utility of big data to support public policies in Central American countries. The starting point Internet | Text. Text analysis is critical for data was addressing data availability issues, in a context generated via the internet not only for sentiment where traditional data collection methods, such as analysis (e.g. favorable/unfavorable views on a policy) household surveys, are undertaken with a relatively but also for lexical analysis to understand elements low frequency in Central America and incur high of culture. One group analyzed the concept of honor costs. Therefore, the goal was to explore the potential in the Middle East, for example, and found how it of alternative data sources, such as those one differed by region and changed over time in response described in the paragraphs above, to fill a data gap. to the events of September 11th. Such analysis With this objective, three different exploratory pilots could inform the appropriate selection of language focusing on different sources of information (internet in, say, diplomacy or educational materials. Further data, social network data, and satellite data) were applications in this regard could include, for example, developed. developing a contextual lexicon on financial literacy in order to tailor microlending by region [44]. By The objective of the first pilot was to assess the combining topic modeling methods--whereby one possibility of using web search keyword data (from explores and understands concepts and topics from Google Trends) for nowcasting price series in Central text--with sentiment analysis, one can gain a richer America. The study, which focused on Costa Rica, El understanding of unstructured text data [24]. Salvador, and Honduras, highlighted the challenges in using Google Trends data. The findings, based Social Media | Tweets. Similar to the example of on a number of indexes constructed to summarize analyzing search queries above, social media data Google Trends data, showed that Google Trends data such as Twitter tweets can be used as an early can improve the ability to forecast certain price series indicator of an unemployment hike or to evaluate (especially in Costa Rica and El Salvador, where the crisis-related stress [32]. Another case utilized tweets web search data was of higher quality). to know about a cholera outbreak in Haiti up to two weeks prior to official statistics [42]. Both of these The second pilot, jointly carried out with the United cases demonstrate the ability to reduce reaction Nations initiative working on big data (UN Global time and improve process with which to deal with Pulse), explored the potential of social network various crises. Tweets have been used by hedge fund content to analyze public perception of a policy 27 reform in Central America. The project focused on the gas subsidy reform in El Salvador and consisted of gathering data from Twitter. After geo-referencing on-line content to the country and categorizing information based on the content, the study used text analytics to see if the results from the social media analysis closely followed the public opinion as measured through of a series of household surveys conducted in the El Salvador before and after the reform. By undertaking what can be thought of as a replication study the goal was to establish the validity of the alternative method (social media text analysis) to capture the underlying phenomenon under study. Preliminary results confirmed that Twitter data provides a useful complement to analyze the public perception of a policy reform. The third pilot tried to use satellite data to understand poverty levels in Nicaragua and Guatemala. In particular, the objective of the analysis was to produce a first assessment of the information content of night- time illumination measures and explore correlations with poverty at high levels of geographical disaggregation. The analysis showed that the one-to- one correlation is negative and statistically significant, indicating that night-time illumination data may contain information relevant for analyzing poverty conditions. These pilots are just a starting point. The World Bank launched a Big Data Innovation Challenge in September 2014 to promote big data driven internal projects. In less than a month, more than 130 project proposals were submitted to the Challenge to keep exploring the potential of big data for development. 28 Examples by medium and purpose AWARENESS UNDERSTANDING FORECASTING A study in Afghanistan has shown A study in the UK used mobile Research has shown that when MOBILE that you can use CDR data to detect and census socioeconomic data to mobile operators see airtime top-off impacts from “microviolence” like examine the connection between amounts shrinking in a certain area, skirmishes and IEDs. Microviolence the diversity of social networks and it tends to indicate a loss of income has clear effects on the ways people socioeconomic opportunity and in the resident population. Such communicate and patterns of mobility wellbeing, validating an assumption information might indicate increased and migration, similar to what you in network science previously economic distress before that data might see after a natural disaster. [22] untested at the population level- shows up in official indicators. [36] -that greater diversity of ties provides greater access to social and economic opportunities. [47] Xoom, a company specializing in The Oversea-Chinese Banking Predictive analytics tools like FINANCIAL international money transfers, noticed Corporation (OCBC) increased FlexEdge allow traders on US equity in 2011 that there were more frequent understanding of individual markets to engage in advanced than usual payments being funded customer preferences by forecasting, including overnight by Discover credit cards originating analyzing historic customer data, and intraday forecasts updated by in New Jersey. All looked legitimate, then designed an event-based the minute, resulting in an error but it was a pattern where one should marketing strategy focused on reduction of up to 25% over standard not have existed. Further investigation using a large volume of coordinated forecasting techniques which revealed the fraudulent activity of a and personalized marketing typically take a historical window criminal group. [3] messages. Their precise targeting average [49]. positively impacted numerous key performance metrics and increased campaign revenues by over 400%. [48] Following the 2013 typhoon in The Open Data for Resilience AWhere’s “Mosquito Abatement SATELLITE the Philippines, Tomnod (now Initiative fosters the provision Decision Information System DigitalGlobe) took their high- and analysis of data from climate (MADIS)” crunches petabytes of resolution satellite images, divided scientists, local governments and satellite data imagery to locate the them into pieces and then shared communities to reduce the impact spectral signature of water primed them publicly to crowdsource of natural disasters by empowering for breeding mosquitoes and identification of features of interest decision-makers in 25 primarily combines it with location intelligence and enable rapid assessment of developing countries with better algorithms and models of weather the situation on the ground: where information on where and how to and mosquito biology to identify buildings were damaged, where debris build safer schools, how to insure nascent outbreaks of mosquitoes was located, and where roads were farmers against drought, and how even before they hatch. [46] impassable. First responders used to protect coastal cities against maps generated through this system future climate impacts, among other and the Red Cross relied on the data intelligence. [2] to determine resources. The Philippine government also will analyze the data to better prepare for the future. [50] 29 AWARENESS UNDERSTANDING FORECASTING INTERNET Pricestats uses software to crawl the Logawi engaged in a research project Research has shown that trends in internet daily and collect prices on using lexical analysis--the use of the increasing or decreasing volumes products from thousands of online internet to create a cultural context of housing-related search queries retailers, enabling them to calculate for particular words and phrases in Google are a more accurate daily inflation statistics which are used enabling deeper understanding of predictor of house sales in the next by academic partners to conduct how cultures view particular ideas-- quarter than the forecasts of real economic research and public to assess how different populations estate economists. [9] institutions to improve public policy across the Middle East understood decision-making and anticipate the concept of “honor.” Based on commodity shocks on vulnerable interviews and analysis of internet populations. [3], [51] data, Logawi developed a lexicon of words and phrases that mapped onto the region viewing how definitions and use of “honor” change for different cultures over time. [44] SOCIAL Using social media analytics in Syria, A project by UNICEF used A collaborative research project between Global Pulse and the SAS MEDIA SecDev Group was able to identify social media monitoring tools to the locations of ceasefire violations or track parents’ attitudes towards Institute analyzing unemployment regime deployments within 15 minutes vaccination in Eastern Europe by through the lens of social media after they took place, enabling them to identifying patterns in the sentiments in the US and Ireland revealed rapidly inform UN monitors ensuring of their public posts on blogs and that the increases in the volume of swift response. [52] social media. The study increased employment-related conversations understanding of how to respond on public blogs, online forums to vaccine hesitancy and educate and news in Ireland which were parents’ to make informed choices, characterized by the sentiment including engagement strategies “confusion” show up three and messaging. [53] months before official increases in unemployment, while in the US conversations about the loss of housing increased two months after unemployment spikes. [10] Areas of high potential for big data that global development work can be improved using big data in three ways: strengthening early warning A variety of authors and institutions have pointed systems to shorten crisis response times, enhancing out what they see as areas of high potential for big awareness of situations on the ground to better data. At a broad level, many authors emphasize design programs and policies, and enabling real-time the potential of combining datasets to enhance feedback to make appropriate and timely adjustments understanding [54], [55]. The OECD points to four [10]. These categories are examined below. In broad international research topic areas which would addition to these areas, however, several individuals benefit from a variety of data types. These topic areas have highlighted the promise that big data shows include population dynamics and societal change; in terms of strengthening understanding around public health risks; economic growth, innovation, complex systems dynamics thereby enabling better research and development activity; as well as social policy-making. To the extent that specific challenges and environmental vulnerability and resilience [55]. are elucidated and data is used more in the context of ongoing processes rather than one-time projects, Beyond research, however, there is a need for more big data will have stronger impacts on international specific, practical arenas within which big data shows development. promise. The United Nations’ Global Pulse argues 30 Early warning the retail market, sourcing is often simplified by working with large scale suppliers, crowding out Various other sources have also emphasized the smaller producers. A third-party organization potential of big data for early warning [54], [30]. Two could, however, utilize big data analytics to ensure concrete examples of such work include forecasting replenishment and coordinate supply from a variety of riots using food price data with relevant proxies product sources, large or small [39]. [56], or predictive policing whereby various data sources, including police databases, are combined Specific challenges and ongoing with mathematical tools to route police patrols in processes anticipation of future crime [57]. To move the agenda of big data for development Enhancing awareness and enabling forward, more than general categories or approaches real-time feedback will be needed. Reiterated during one interview [48] is that what is especially needed in the development big The potential of big data for enhancing real-time data space is the specification of challenges that lend awareness, also known as nowcasting, is also themselves to the utilization of big data. Put another repeatedly discussed by other individuals [37], [12]. In way, the kinds of insights that need to be generated fact, approximately 1,200 business and IT professionals should be specified, and such a specification process attributed “real-time information” as one of the would benefit from the input of practitioners, data top three defining characteristics of big data [58]. scientists, designers, and relevant thought leaders. Several examples testify to the power of using big Once specified, a space can be created for those data to enhance awareness. Two MIT economists, for with the necessary contextual and analytical ability example, used internet data to collect prices on half to propose methods to address the carefully defined a million products and detected price deflation two challenges using big data. The World Bank, in months prior to release of the official, and expensive- collaboration with other institutions and organizations, to-produce Consumer Price Index [3]. Alternatively, may play a crucial role in this regard as a convener after a recent tsunami in Japan, Honda was able to of various parties both to specify challenges and provide road closure data within a day using GPS data explore ways to address them [23], [24]. By convening from newer generation cars [45]. One idea discussed parties, a shared way of speaking and thinking about among several big data experts and practitioners big data can be created which is general enough to included creating publicly accessible databases to be inclusive of a diversity of approaches yet specific enable anyone to assess financial markets, thereby enough to generate meaningful action. protecting consumers and investors [37]. Furthermore, while short-term projects using big Understanding and interacting with data can be helpful to increase awareness or begin social systems to adjust systems to be more effective, the value of big data is perhaps most evident when it is integrated One interviewee discussed the possibility of studying into ongoing processes. Examples of such process- the growth of urban boundaries, such as favela growth oriented uses of big data range from private sector in Brazil using historical satellite data combined retailers using big data to minimize inventories to with complex systems modeling. This could lead to public sector governments streamlining tax collection understanding city growth patterns and improved city mechanisms, unemployment services, or city and regional planning [59]. A few authors have also emergency services. Indeed, if big data will be used discussed the possibilities of opening big data as well to address challenges, it will have to be integrated as relevant analytical capabilities to level the playing into ongoing processes. Only in this way can its use field for labor and/or product supply from a variety of be refined over time and the necessary knowledge be sources [39], [60]. With larger-scale players dominating generated more effectively to improve those systems of interest. 31 CASE STUDY DATA PALETTE Weather and land conditions via remote sensing Forecasting and imagery Awareness of Weather Patterns Using Satellite Data Motivation Heavy rainfall in the city of Rio de Janeiro often leads to severe landslides and flooding causing significant public safety issues. For rescue efforts, coordination is needed among several different emergency agencies when this happens. On April 6, 2010 the city had one of its worst storms, with mudslides leaving over 200 people dead and thousands without a home. It was this event along with the fact that the city was preparing for the 2014 World Cup and the 2016 Summer Olympics that pushed the city to use data and predictive modeling for emergency response. Data Generation As indicated by the data palette above, the data used to predict weather forecasts is primarily satellite data gathered from open sources such as a the National Oceanic Atmospheric Administration (NOAA) while sea surface temperatures are collected directly from NASA. For predicting landslides and flooding, data is pulled from the river basin, topographic surveys, the municipality’s historical rainfall logs, and radar feeds. Data such as temperatures at different altitudes, wind data, and soil absorption is captured to help develop accurate predictions. The city is also using loop detectors in the street, GPS data, and video data to help plan emergency responses. As indicated in the palette above, the weather pattern data is not highly structured, however it is highly spatially and temporally referenced as it provides frequent and specific geographic information representing a state in time. Data Interpretation Data from across 30 different city agencies is all housed in the Rio Operations Center providing a holistic perspective of how the city is operating. The city created the Rio Operations Center to allow for real-time decision making for emergency responsiveness and to improve safety based on consolidated data from various urban systems. The idea behind creating this operations center was to remove silos between the different emergency response services (i.e. police department, firefighters, etc.). Data is analyzed using a variety of algorithms that allows for projections of floods and landslides to be made on a half-kilometer basis and is able to predict heavy rains up to 48 hours in advance. The basic model that is used for predicting heavy rainfalls is IBM’s Watson Weather Model. This model has been configured to the city of Rio based on a comprehensive list of weather related events that was given to IBM by the city. 32 A combination of real-time and historical data is currently being used for this analysis. Rio has very good historical flooding data available cataloguing at least 232 recurrent points of flooding. The algorithms are very granular, taking raw data about wind currents, altitude temperatures, humidity level of the soil, and the geography of the city to create accurate predictions of landslides. This data is then analyzed against the infrastructure of the city to determine the likelihood of floods and landslides. For example, rainfall predictions are compared against the layout of city infrastructure, the number of trees that can help absorb some of the water and the conditions of the soil to predict greater risk areas for floods. The city uses IBM’s Intelligent Operations center that pulls in information from multiple sources and provides an executive dashboard for city officials to quickly gain insight into what is happening across Rio in real time. City officials are able to see high-resolution animations of two and three-dimensional visualizations of key weather variables and obtain detailed tables of weather data at various locations. Insight Implementation The new alert system notifies city officials and emergency responders in real time via automated email notifications and instant messaging. As a result of their high-resolution weather forecasting and hydrological modeling systems, Rio has improved emergency response time by 30%. An additional benefit of the new alert system is all of the data that it generates from the receipt of a message to the response taken. Analysis of this data allows city responders to improve their current procedures resulting in lower response times and greater coordination of activities. From Idea to Ongoing Process The Rio Operations Center was the first center in the world to integrate all stages of disaster management from prediction, mitigation and preparedness, to immediate response and feedback capture for future incidents. By having access to each other’s data in a non-siloed environment, greater communication and collaboration was seen between emergency response services leading to more rapid response times. In addition to being able to predict rain and flash floods, the city is now able to also assess the effects of weather on city traffic and power outages by using a unified mathematical model of Rio. Moreover, Rio is now going beyond weather forecasting to integrate big data into other areas of municipal management. For example, data on waste collection via GPS systems installed in trucks is also collected by the Rio Operations Center to create a comprehensive picture of how public services are operating in the city. The city of Rio has also made this data publicly available for its citizens to be able to better manage their lives. The mayor of Rio de Janeiro, Eduardo Paes stated “in addition to using all information available for municipal management, we share that data with the population on mobile devices and social networks, so as to empower them with initiatives that can contribute to an improved flow of city operations”. Citizens can receive daily data feeds by following the Rio Operations Center updates on Twitter @OperacoesRio and Facebook at Centro de Operações Rio. These sites also provide recommendations for alternative routes during special events as well as current traffic and weather conditions. Eduardo Paes, the mayor of Rio has stated that by having these updates, the quality of life in Rio has increased significantly and that this is helping bring more businesses and people to Rio. Rio is using big data as part of its daily operations to transition to being a smarter city and providing a higher quality of life for its citizens. In fact, the mayor of Rio has made technology part of its “4 commandments for smarter cities” stating that “a city of the future has to use technology to be present”. Rio has continued its 33 partnership with IBM to continuously improve upon its original algorithms so that as technology advances, the city is also able to stay ahead of the game. As a result of this pilot, Rio is now fully committed to using technology as a way to help govern the city. References Hilbert, M. (2013). Big Data for Development: From Information-to Knowledge Societies. Available at SSRN 2205145. Conversation with Jean-Francois Barsoum from IBM, March 2014. Conversation with Renato de Gusmao from IBM, March 2014. Treinish, Loyd. (2014). Operational Forecasting of Severe Flooding Events in Rio de Janeiro. Retrieved from: http://hepex.irstea.fr/operational-forecasting-of-severe-flooding-events-in-rio-de-janeiro/ IBM. (2011). City of Rio de Janeiro and IBM Collaborate to Advance Emergency Response System; Access to Real-Time Information Empowers Citizens. Retrieved from: http://www-03.ibm.com/press/us/en/ pressrelease/35945.wss Eduardo Paes TED Talk. (2012). The 4 Commandments of Cities. Retrieved from: http://www.ted.com/talks/ eduardo_paes_the_4_commandments_of_cities 34 35 SECTION 4 HOW CAN WE WORK WITH BIG DATA? Technology alone is not sufficient to understand and database software such as Microsoft Access, for interpret results from use of big data. Turning big example, will scale very poorly in the face of dozens data into insights which are then acted upon requires of gigabytes, let alone terabytes of data. Instead an effective combination of both technological and of using supercomputers, software may be used human capabilities. to conduct parallel data processing over multiple computers, including even videogame consoles, Technological capabilities more cheaply. Examples of such software include Hadoop clusters and products such as Microsoft’s Each phase of the data interpretation process Azure, and Amazon’s EC2 [42], [6]. Open source tools, discussed above highlights the technological such as those used by the city of Chicago’s predictive capabilities necessary to work with big data analytics platform, which utilizes big data [31], present effectively. First, in terms of accessing data, it is more financially inexpensive software options for important to have the necessary hardware and analysis. Altogether, these cheaper parallel processing software to collect data depending on whether and analysis options are promising when considering a dynamic or static method is utilized. If data is the lack of big data hardware and software capacity in dynamically fed from an online source such as Twitter, many developing countries [42]. for example, then the analysis software must allow for such real-time, continuous updated analysis. If, The more diverse the datasets of interest, the more instead, data is being downloaded from some source robust a software platform must be through which and then kept for later analysis, it is important to to interact datasets. For example, in the case of ensure sufficient hardware capacity to store such data. structured column/row datasets, interacting datasets may be as simple as identifying unique keys to join Given sufficient technological capacity simply to tables, as is done using relational database software access data from various sources, it is necessary to like Microsoft Access. However, when considering have software and hardware capabilities to connect relatively unstructured data such as satellite infrared and interact with large and diverse datasets. The data or a collection of hundreds of millions of strings larger the datasets, the greater the hardware storage of text, software capabilities are critical in order to and processing power and the more scalable a analyze effectively and connect them to structured software platform are needed through which to datasets for generation of appropriately formulated process queries and pull data for analysis. With large- insights. scale analyses, software is often needed to make use of parallel computing through which computations Beyond software and hardware requirements, it is of massive amounts of data can be processed immensely helpful when the data which is utilized simultaneously over multiple processors [6]. Relational has appropriately encoded metadata, i.e. data which 36 describes each dataset. In particular, Global Pulse only samples, a willingness to work with messiness, recommends that metadata describe the “type of and an appreciation for correlation rather than a strict information contained in the data”, “the observer interest in causation [3]. A senior statistician at Google or reporter”, “the channel through which the data pointed out that a good data scientist needs to have was acquired”, “whether the data is quantitative or computer science and math skills as well as a “deep, qualitative,” and “the spatio-temporal granularity of wide-ranging curiosity, is innovative and is guided the data, i.e. the level of geographic disaggregation by experience as well as data” [62]. Other necessary (province, village, or household) and the interval at skills include the ability to clean and organize large which data is collected” [32]. Given such metadata, data sets, particularly those that are unstructured, analysts and decision-makers can more easily and to be able to communicate insights in actionable identify the provenance of a particular dataset. This language [11]. is especially helpful when analyzing data mashups, or interactions of multiple datasets. With complete In its report on big data, McKinsey points out, metadata, in other words, the analysis is more however, that a “significant constraint on realizing transparent regarding assumptions. value from big data will be a shortage of talent, particularly of people with deep expertise in Beyond simply being aware of the data sources via statistics and machine learning, and the managers metadata, some authors have highlighted the need and analysts who know how to operate companies for analysts and decision-makers to understand by using insights from big data” [16]. Indeed, the more effectively the assumptions of the model and/ human capabilities needed are wide-ranging from or combined dataset by, in essence “playing” with those having to do with technology to those related the assumptions. In this regard, several authors have to people and real-world problems, e.g. hardware explored the concept of a hypothetical Analytics setup, data management, software development, Cloud Environment through which a user can change mathematical and statistical analysis, real-world assumptions and see their impact on an analysis model development and assessment, as well as the [6]. In designing such software to be scalable, well- distillation and communication of actionable insights. designed Application Programming Interfaces (APIs) Beyond such skills, an intimate knowledge of the real must be created to channel data at an optimal level of world situation of interest is critical [40]. aggregation, so that users may fluidly interact with a large database [25]. Beyond the individual skills and capacities required, effective spaces and environments need to be created Human capabilities and data in order for multiple viewpoints to advance the intermediaries analysis collaboratively. At one level, distributed labor coordination mechanisms such as crowdsourcing can A reading of the above technological capabilities be utilized to aggregate thousands or even millions also indicates the undoubted necessity for human of people’s perspectives on a dataset in order to capabilities to interact meaningfully with big data. complement and strengthen the big data analysis [50]. The need for human capacity to understand and At a smaller, yet more complex level, collaborative use big data effectively, particularly in government environments need to be created through which and the public sector [12] [5], or specifically in the diverse perspectives of those working with big developing countries [42], is reiterated by various data can come together to produce new, more authors and agencies [10], [33], [36], [61], [40]. At a comprehensive insights. fundamental level, to work with big data requires a shift in mindset from working with “small” data. Given the required individual and collective capacities Some authors emphasize that this implies the ability to work with big data, it is no surprise that in to analyze vast amounts of information rather than surveying thousands of businesses around the world, 37 six out of ten said that their “organization has more served as a space in which the right combination of data than it can use effectively” and the leading skill sets among individuals could come together to obstacle was a “lack of understanding of how to use do big data analysis [17]. These enterprise units are analytics to improve the business.” The majority of examples of data intermediaries that will undoubtedly these businesses which frequently used data analytics be needed in the coming years to make sense of big actually used a “centralized enterprise unit” which data [37]. Volunteer Technical Communities Governmental or non-governmental institutions that wish to transform raw datasets into practical tools often lack the expertise to do so. Yet this should not be a limiting factor. One avenue which is proving itself valuable to leverage the skill sets of individuals outside of an organization is volunteer technical communities such as hackathons. In these settings, subject matter experts, technologists, designers, and a variety of other participants, come together to define problems, share relevant datasets, and rapidly prototype solutions. Often these events are followed up by acceleration phases in which the most promising solutions are further developed. Examples of hackathons that have been used in an international development context include the Domestic Violence Hackathon held in Washington D.C. as well as countries in Central America. Another example is Random Hacks of Kindness which, between 2009 and 2013 organized hundreds of community- focused hackathons around the world and engaged in a similar process. These volunteers created, for example, InaSAFE, a web-based tool which combines several relevant datasets and strengthens decision making around natural disaster response [63]. 38 CASE STUDY DATA PALETTE Connected Farmer Crowdsourced supplier data via mobile phones Alliance Motivation Vodafone has a deep interest in development in emerging markets, particularly in Africa where their mobile financial solutions like M-PESA (a mobile money transfer platform) and M-Shwari (a partnership between M-PESA and CBA to provide a savings and loans service) have a strong presence. Vodafone’s interest in development and the pursuit of disruptive innovation tied with a clear potential for commercial businesses to play a role in supporting the agricultural sector (an area of focus for many of Vodafone’s African enterprise clients) led it to join with USAID and TechnoServe to form the Connected Farmer Alliance. This Alliance pilots initiatives aiming to create a better ecosystem for mobile services in the agricultural sector, impacting production through the supply chain to enterprise use. Data Generation The program focuses on Kenya, Tanzania and Mozambique, and is divided into 3 distinct areas of focus: enterprise solutions to source from small farmers, improving mobile financial services and mobile value-added services. The first area, where much of the testing has already taken place, involves enterprise solutions which enable enterprises to better source from small farmers and allow farmers better access to markets. The data is gathered and distributed through a suite of modules, including a registration module allowing an agent of an enterprise to register farmer (or for farmers to register themselves as suppliers) who supply a particular produce. The service enables a remote crowdsourced data-gathering method to identify who and where farmers are and the crops they specialize in producing. The data gathered through mobile phones in this module is highly structured and referenced both temporally and spatially, as well as highly person identifiable, enabling enterprise participants to distinguish specific farmers and their products. The typical enterprise participants are mid-sized national companies who source their produce from small farmers and are seeking more detailed data and interaction with available suppliers. Building upon the crowdsourced supplier data are a series of additional modules including two-way communication that enables enterprises to share information with, or survey, farmers. A receipting module, integrated with M-PESA, allows enterprises to send receipts and pay farmers at the point of sale, identifying volume of purchase, time and price and increasing transparency. Another module allows enterprises to offer short-term loans through M-PESA, enabling cash advances that are later deducted from payment for produce. Finally a tracking module enables enterprises to better track collection processes and points to streamline product collection. At the pilot phase the size of the crowdsourced dataset does not yet approach big data, however Vodafone is currently preparing to bring this first suite of modules to commercial markets for much broader deployment. The second area of focus, currently in the conceptual development and partnership-building phase, involves the improvement of mobile financial services. One area of research is the extent to which big datasets of historical mobile financial transactions, generated through other Vodafone products and services, can prove useful in assessing credit-worthiness of loan applicants. This product area may also work with local partners, to incorporate 39 the use of mobile financial data in streamlining insurance pricing and payouts for farmers by using location data to assist insurers in more rapid analysis of claims. The third focus area, only at the earliest conceptual stages, is to use the enterprise solutions and mobile financial services created in the first two stages of the product to create a supportive environment overall for mobile value- added services for anyone wanting to take products to market. This area will also include the growth of business development and incubation services to sustainable mobile business growth in the agriculture sector. Data Interpretation Vodafone works with subsidiary Mezzanine on the development and management of the data collection platform, which is locally hosted in the Kenyan, Tanzanian and Mozambican markets themselves and protected by high- level security mechanisms. In pilot phase, data is available only to the enterprise and participating farmers and for the surveys, enterprises receive only aggregated responses, not individual records. Vodafone is working with enterprise customers on the most convenient way for farmers to submit data whilst ensuring confidentiality for them and businesses. The details of data privacy will be governed by Vodafone’s data privacy policies to ensure ongoing protection. Within the Connected Farmer Alliance partnership, TechnoServe is charged with analysis and interpretation of how the modules are performing for the enterprises and farmers. However given the small sample set involved in the pilot of the enterprise modules, insights are currently being gathered through traditional survey methods. Those methods include assessing goals for the participants at the project outset, determining areas of measurement, and collecting input through questionnaires during the process. Additionally, the Connected Farmer Alliance supports enterprise partners in their own data analyses of information and outcomes. Insights for Action Although the project is still in its early phases, insights are beginning to emerge around the benefits of the enterprise modules. Cost savings have been shown on the part of farmers who receive M-PESA for loans and payments. By receiving M-PESA, these farmers avoid costly, time-consuming, and risky trips to the enterprise office to collect cash. The receipting module has resulted in cost savings for enterprises due to operational efficiencies and improved process transparency. A key benefit of mobile solutions for farmers is an increase in access to information. Nonetheless it is difficult to make generic content services meaningful to small farmers whose local realities may vary significantly within a distance of just a few kilometers. The targeted information flow permitted by the two-way information module has been shown to provide information particularly relevant to the stakeholder farmers, as well as to enhance face to face interactions among farmers and enterprises. From Idea to Ongoing Process Although the Connected Farmer Alliance has a clear social transformation element which inspires the partners and enterprises alike, growth and use of these mobile tools and long-term sustainability of the piloted approach will fall under Vodafone’s commercial initiatives. With the specific intent of going beyond the pilot phase and putting in place publicly accessible tools, Vodafone is currently in the process of scaling up the mobile tools used in the first phase of the Connected Farmer Alliance project for commercial use as a method to generate large scale, targeted and valuable data for small farmers and enterprises alike. 40 References Conversation with Laura Crow, Principal Product Manager for M-PESA, March 2014. Correspondence with Drew Johnson, Program Manager TechnoServe Tanzania, June 2014 TechnoServe. (2014) Projects: Connected Farmer Alliance. Retrieved from: http://www.technoserve.org/our-work/ projects/connected-farmer-alliance 41 42 SECTION 5 WHAT ARE SOME OF THE CHALLENGES AND CONSIDERATIONS WHEN WORKING WITH BIG DATA? As the data deluge continues to grow and forward Data generation process and structure thinking managers and policy makers seek to make use of it, challenges at the levels of expertise, Several challenges must be overcome and appropriate use, and institutional arrangements considerations kept in mind regarding the data come to the forefront. Whereas in the past, smaller- generating process and the data structure itself. scale, and less diverse datasets could be analyzed by a limited number of individuals, big data requires To begin, the very trigger which encodes behaviors a wider skill set. An added dimension to the use of into data can have implications for analysis. If data bigger data is its ability to understand and predict is passively recorded, then it is less susceptible to ever-smaller segments of the population, sometimes the statistical problem of selection bias, i.e. the data even to the level of the individual. The ability to be that is collected is systematically unrepresentative of aware and even forecast the behaviors of such small the behavior of interest. If instead the data is actively segments, including individuals, raises new ethical selected, then it is more susceptible to such a bias. questions. Finally, whereas data analysis was often If an individual is interested in collecting her walking restricted to that collected by a single institution, data throughout a week, she may, for example, the nature of big data requires new forms of inter- input data into a spreadsheet on those days when institutional relationships in order to leverage data she remembers. This may, however paint a biased resources, human talent, and decision-making picture of her movement since she only records data capacity. when she walks great distances, therefore biasing the report of her movement. If, instead of collecting The following sections organize challenges and data actively, she used a wristband which passively considerations according to stages of the Data for collected data, then a more representative picture Action Framework [Figure 2] discussed in section would be drawn. two. First a series of considerations is discussed with respect to the various ways in which data is Once encoded, it is important to consider the generated, stored, and accessed. Then challenges features of the datasets of interest. If they are around how effectively to manage and analyze the unstructured, for example, they will require the data are enumerated. Along similar lines, practical development of appropriate processing methods, considerations arise when discussing how insights especially when integrated with structured data. are actually defined. Cultural and ethical challenges Mobile data, like that analyzed in the case study on then come to the forefront when considering how to Understanding Labor Market Shocks using Mobile actually implement insights. Phone Data, is received in unstructured repositories of millions of individual files, requiring time-intensive processing, programming and the use of expensive hardware to obtain indicators of population 43 movement and prepare for interaction with other data with exploratory, visual analysis of social media in [22]. Text analysis is one example of making sense of, order to see patterns and build a formal statistical say, unstructured Facebook status updates or online model was also noted [24]. Social media often store Tweets. The method used to structure unstructured actively generated data such as Twitter Tweets and data adds yet another point at which decisions are may therefore suffer from selection bias. On a related made, making the analysis further susceptible to note, retweets and sharing of links is actually the bulk biases that the researcher may have. of Twitter traffic, such that a very small minority control the agenda of what is originally created. This presents Whether data sets have methods to identify temporal, challenges of ensuring that analyses which use social geographical, or individual characteristics, such as media place results in the right population context time/date stamps, latitude/longitude information, [45]. Other information found on the internet such or unique personal IDs, respectively, will determine as webpages, blogs, videos, etc. may share similar to what extent data mashups are possible. However, problems as those noted above regarding identifying a challenge that must be addressed in combining the accurate population being represented as well as such data sets is to ensure proper aggregation. For effectively interacting with unstructured data. Point- example, if one data set is collected every minute, of-sale or internet sales data are often high-frequency but another is collected every day, then, to ensure and high-volume datasets which require effective comparability, the analyst must carefully consider how structures to process massive stores of data constantly to aggregate the minute-based data so that it can be as well as thoughtfully constructed models which effectively joined with the daily data. appropriately consider the relevant, short-run nature of the decisions being made by humans [39]. The media through which behaviors are encoded into data each present their own series of challenges. Data interpretation process Some of these challenges are inherent to the medium, while others are due to association with To access, prepare, and analyze data sets effectively certain features of the data structure or generating presents a series of institutional and technical process. Mobile phone data, by their very nature, challenges. Beginning with access, several are highly sensitive and should be treated carefully. institutional challenges must be overcome just to Although mobile data is highly disaggregated and enable the ready sharing of data sets. Once data is can be very rich, it has been observed that its analysis accessible, several technical and data management should be validated through corroboration with challenges must be overcome. other sources such as household surveys or satellite data [40]. Satellite data can become very large Access very fast, since it is primarily high-resolution image data. It can also be highly unstructured, particularly One of the first challenges which must be overcome when it comes to visual pattern analysis, and may in order to conduct big data analyses is for data to be benefit especially from human review. Social media more openly shared [65]. This is particularly important data is often in the form of unstructured text which for those development-oriented institutions which requires specific analytical capabilities to codify and do not, themselves, generate the data of interest. identify useful patterns. One researcher identified Indeed, one data scientist/artist pointed out how one techniques such as topic modeling and named entity of the biggest challenges he faces is simply trying to recognition to be useful in this regard. In the case of access data from institutions, including government the Global Pulse program tracking food price inflation agencies, which hold vested interests and/or a in Indonesia through Twitter, a researcher trained commodity-ownership perspective over the data they a sentiment classification algorithm by manually store [25]. classifying Tweets according to various sentiments, allowing the algorithm to, in turn, classify other An entire ecosystem is needed to open and use big Tweets automatically [64]. The benefit of beginning data effectively [16], [36]. Common data standards and 44 sharing incentives constitute two aspects of such an by private players, they are less willing to share ecosystem. Leading international agencies will have their data. One researcher pointed to the fact that to address the challenge of collaborating to define accessing data from companies 10 years ago was and agree on efficient and well-coordinated sharing much easier, for example, than it is today, partially mechanisms [55]. Standards for data integration, such for this same reason [26]. The researchers in the case as APIs, are needed, as are standards to coordinate study Understanding Labor Market Shocks using data generation. Examples of mechanisms to develop Mobile Phone Data indicated the ability to obtain both kinds of standards include IEEE, IEC for smart mobile data would be virtually impossible for private grid, or the Global Earthquake Model for disaster sector agencies and extremely challenging even preparedness. for governments and multilaterals, often relying on personal relationships. Much of the research in that As the challenge of shared standards is overcome, the area, therefore, is being done by academia [22]. incentive to share the data must be strengthened. To this end, business models need to be developed to Whether data is accessed through the public or ensure that private sectors are willing to share data private institutions presents different challenges. [36]. Also, governments need to design policy to help Public institutions can often release data for free; capture the value of big data and enable sharing however, administrative hoops, which the above across agencies [5]. A particular challenge in this discussion emphasizes, can present great barriers regard is the definition of intellectual property rights to access. Moreover, as the researchers in the which retain data ownership yet allow researchers Billion Prices Project discovered, public agencies and/or decision makers to use the data for perhaps are often so accustomed to using traditional data originally unintended purposes [26]. In addition, sources that they face an additional cultural hurdle governments may have to consider the design of to engaging with big data that may take time and privacy policies for personal and proprietary data, use to overcome before its value is internalized safeguards for information misuse, and regulations on [29]. Private institutions, on the other hand, have financial transactions and online purchases [12]. massive stores of data, the value of which is being increasingly recognized. In this case, security and As the challenge of opening data in the last intellectual property concerns may exist. The Billion decade has partly been addressed, several nascent Prices Project methodology of scraping price data phenomena have emerged. Beyond enabling from retailers addressed the privacy concerns of analysis, opening data and making it available to the enterprises-by building in lag time between the data public motivates citizens to engage personally with collection and data sharing, and by sharing data at the data and, in some cases, correct information and aggregated levels [29]. improve the accuracy of government databases [5]. Furthermore, opening data can serve as a catalyst On the other hand, directly sourcing data from crowds for engaging large groups of citizens to apply their of people, often via an online aggregation mechanism technical capacities and desire for social betterment such as one of the many social media tools, presents to design novel ways to leverage data for the public the unique challenge of ensuring wide and high good (e.g. National Day of Civic Hacking, or the quality participation. If participation is meager, then International Space Apps Challenge). Opening data the data collected will not only be insufficient from a has also contributed to the emergence of an open quantitative standpoint, but the perspective may not science paradigm among academics concerned be reflective of a broader set of the population. The with enabling research accessibility by the public, Global Pulse project profiled as a case study in this permitting wider use of research findings, facilitating report—Tracking Food Price Inflation using Twitter research collaborations, and/or simply sharing Data—chose to focus on a part of the world where datasets with other researchers [66]. An additional many people Tweet, such that Twitter represents challenge, however, that has emerged is the fact a broader segment of the population, for this very that as data is opened and its value is recognized reason [64]. Given that individuals who participate in 45 CASE STUDY DATA PALETTES Tracking Food Price sentiments posted via Twitter Price Inflation using Twitter Data Motivation The Global Pulse lab set out to investigate the possibility of utilizing social media data to give an indication of social and/or economic conditions. Official price statistics via public agency surveys In particular, they investigated the relationship between food and fuel price Twitter posts and the corresponding changes in official price index measures. By conducting the research in the Indonesian context, the research benefited from a large user base--the city of Jakarta has the largest Twitter presence in the world with 20 million user accounts. Data Generation The Twitter data used was generated between March 2011 and April 2013 and formed a largely unstructured dataset of over 100,000 Tweets that were highly temporally referenced, spatially referenced by region and identifiable by Twitter account, and at times by person. This data was complemented by structured public datasets regarding food and fuel prices including official price indices from the Indonesian State Ministry of National Development Planning (BAPPENAS) and the World Food Program (WFP). In particular, CPI data for general foodstuffs came from the Indonesian Office of Statistics (BPS) and data on milk and rice prices from the WFP, both datasets typically generated through questionnaires and surveys. As Indonesia was also experiencing soybean shortages during the period of study leading to the import of soy from the U.S., soybean inflation data from the U.S. was also collected from the World Bank. Data Interpretation Data from the Twitter “firehose”, which streams all tweets live, was captured from March 2011 to April 2013. Full access to the Twitter firehose is difficult to obtain, however Global Pulse was able to secure it through their use of the Crimson Hexagon software, which collected and stored tweets in a database, which could then be analyzed. Other services that could provide similar firehose access include DataSift and Gnip. The Crimson Hexagon ForSight software includes a classification algorithm, which can analyze strings of text and sort them into categories of interest. For this study, data was categorized through an initial filter of content 46 based on keyword searches as being related to food price increases or fuel price increases. Specific words in the Bahasa Indonesia language were utilized as keywords, which the algorithm used to filter those tweets which dealt with the aforementioned price increases. Then, a researcher manually classified randomly selected tweets based on sentiment as “positive”, “negative”, “confused/wondering”, or “realised price high/high- no emotion.” This manual selection by the researcher essentially “trains” the algorithm, which can, in turn, automatically classify the remaining tweets automatically. By the end of the process, over 100,000 tweets were collected and categorized, forming part of a dataset, which could be analyzed using simple statistical regression techniques. Using such techniques, correlation coefficients could be estimated to analyze the relationship between twitter conversations and official food price inflation data, among other questions. Insight Implementation Some of the final conclusions of the report indicated a relationship between official food inflation statistics and the number of tweets about food price increases. In addition, another relationship was discerned between the topics of food and fuel prices within twitter data. Altogether, this initial effort to analyze social media data indicated the potential to utilize social media data to analyze public sentiment as well as objective economic conditions. That said, the research demonstrated that, while there was certainly a relationship between the twitter data and official statistics, there was also an abundance of false positive relationships, i.e. large changes in twitter data with no corresponding change in actual inflation measures. More research is certainly needed to improve the classification process as well as the process of geolocation--using arbitrary strings in social media profiles to arrive at exact geographic coordinates--to more fully take advantage of the heterogeneity of social media data and associate sentiment with particular regions of a country. Finally, higher granularity of official statistics are needed in order to more effectively compare it to the correspondingly spatially and temporally specific twitter data. From Idea to Ongoing Process The research has indicated that semi-automatic, retrospective analysis is possible for social media data. To the extent that classification algorithms are strengthened, and more fine grained economic data with which to train algorithms are made available, the potential to implement ongoing real-time analysis of such data appears to be closely within reach. References UN Global Pulse. (2014). Mining Indonesian Tweets to Understand Food Price Crises. http://www. unglobalpulse.org/sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-Crises%20copy.pdf Correspondence with Alex Rutherford, Data Scientist, UN Global Pulse, April 2014. 47 crowd-sourced endeavors vary in their skill, interest, appropriate medical treatment or a financial product, and motivation level, it is possible that mechanisms consumers may be concerned with sharing such will need to be put in place to ensure rewarding information [5]. Adding to the complexity of this desired behavior as well as to develop effective challenge is the fact that each country has different quality control processes to check each other’s inputs. regulations around data privacy [18]. One way to address such a challenge is the development of an Preparation internationally recognized code of conduct regarding the use of personal data, which might include best Once data can been accessed, analysts sometimes practices regarding data acquisition, sharing, and consider filtering the quantity and types of data anonymization2 [54]. needed. These filters, however, need to be carefully considered; otherwise they may preclude useful Analysis information [65]. Regarding data filtration, many big data practitioners estimate that 80% of the effort Given a well prepared and structured dataset, a series when working with data is simply cleaning it so that of considerations must be kept in mind. First, large it can be readily analyzed [34]. A critical step in datasets do not always preclude the use of statistical Global Pulse’s project tracking food price inflation methodology to account for the degree to which in Indonesia was the filtration of the data from the a finding is representative of a population. A data Twitter fire hose, for example the researchers had to scientist will be aware of the various ways in which, for filter out Tweets in English and Indonesia’s numerous example, posts by Twitter users are not representative local dialects to isolate Tweets in the predominant of the world’s thoughts and opinions [42]. In particular, Bahasa Indonesia language [64]. the selection bias which occurs when inferring real- world behavior using digital sources of information Once the data is cleaned, scientists must deal with must be kept in mind. In other words, people who the challenge of how to manage large datasets [65]. In use digital sources of information such as blog managing these, subjectivity can be introduced when posts or online shopping, may be systematically attempting to lend structure to unstructured datasets. unrepresentative of the larger population under With multiple sources, ensuring accurate metadata is consideration. A related concern pertains to the fact critical [65]. When well documented, metadata gives that analyzing what has been said about behaviors is an indication of the provenance of a dataset, thereby different from analyzing behaviors themselves [10]. increasing transparency and improving understanding of results [37], [54]. With regard to combining multiple Also important is that modeling take into account data sources, one interviewee warned against using the data generating process. For example, one too many all at once. Instead, it is helpful to begin interviewee pointed out how small price changes with a couple of relevant data sets and then, as derived from retail scanner data may be due to the capabilities develop and questions are refined, other fact that such prices are actually imputed from data datasets can be added [25]. which often comes in the form of weekly total value and quantity sold. Using such imputed prices may A frequently cited challenge when managing data fed fail to take into account systematic relationships from various companies and/or agencies is ensuring underlying the data to, say, mid-week price changes, individual privacy and security [36], [65], [12], [32]. the use of coupons by certain individuals, etc. [29] For example, although individual-level health or financial data can be used to assist with specifying 2. The question of anonymization itself must be examined carefully. Even when a dataset with personal identifiers--e.g. name or social security number- -is randomized, such assignments to individuals may be imputed. One way to deal with this may be simply to aggregate the data sufficiently to enable analysis but to preclude privacy or security concerns [26]. 48 Another consideration when working with big data is itself--its generation, features, and interpretation--is the tendency toward apophenia--or seeing patterns insufficient, as elucidated in section two, to interact when there are in fact none [13]. One interviewee meaningfully with the data. The interpretation must for this report pointed out how, by its very nature, effectively give rise to insights which can then be big data can give rise to highly significant statistical acted upon to effect the desired change in the world. correlations. The tendency to assign meaning to these At one level, the very kinds of insights of interest statistically significant relationships must, however, can present their own challenges or considerations. be tempered by rigorous model development [39]. At another, the process through which insights are Although the adage which states that correlation is translated into action must be strengthened by different from causation is critical to understand, and overcoming several intertwined challenges. Each of every big data scientist keeps it in mind, the value of these aspects is considered, in turn, below. correlations remains. For example, correlations may provoke interpretive stories, stimulate sales (such Insight scope and purpose as Amazon’s “suggest” feature), or may assist with forecasts in the very short run. Computers excel at As discussed in section two, one way to think about estimating such numerical correlations; however, they the kinds of insights to generate is in terms of are less capable of assessing visual patterns. Data scope. Big data may be used better to understand visualization therefore leverages the human capability the situation of a microregion, while another may to discover such patterns, thereby stimulating further be used for an entire continent, or even the whole scrutiny of data [37]. As indicated by the preceding world. Clearly, the larger the scope, the greater the ideas, a conducive environment must be created potential data required. More subtle, however, is within which data can be managed, visualized, and the fact that the broader the scope, the greater the analyzed [65]. potential diversity of datasets, especially when no singly-managed dataset covers the entire scope of Given an analysis which integrates discovery and interest. Analysts will need to ensure that datasets creativity through direct observation of the data, can interact with each other to form an accurate human intuition, and strong model development, representation of the whole. This requires overcoming a communicative challenge remains when working data management challenges as well as making with big data. Documenting and communicating careful decisions about aggregation to ensure helpful relevant data sources is one of these. However, it comparisons among the various datasets. pales in comparison with the ability to communicate the methodology whereby insights were generated. Alternatively, insights of interest may be considered in This is a challenge which has to be overcome in order terms of their primary purpose. Is the use of big data to create an environment through which others, such meant to generate insights which heighten awareness, as policy makers or other collaborators, can make deepen understanding, and/or enhance forecasting? sense of the results and understand how to use If the primary purpose is to enhance awareness, them. The idea of an “Analytics Cloud Environment” then it is likely that capabilities around visualization discussed above indirectly addresses this by giving will be especially important, since datasets will have users the opportunity to explore how various to be shown in an accessible way to enable shared model assumptions affect results. As potential task human understanding. If, instead, the primary complexity grows, institutions will have to face the purpose is to understand or forecast, capabilities challenge of considering the cost of analysis using big around data modeling will be of primary importance. data as well as integrating feedback into the use of This is especially the case for endeavors which seek big data in order to adjust its use [65]. to understand systems or processes. By utilizing sophisticated inductive techniques such as machine Insights and their implementation learning, for example, forecasting may be improved through the use of additional variables and functional Addressing challenges with respect to the data forms while not necessarily enhancing understanding 49 of the causal web underlying a system’s functioning. Beyond a culture which stifles data openness, several Understanding a system requires more than showing beliefs or perspectives may inhibit the use of big statistical significance; it requires the development of data by institutions. One study showed that public sensible, internally consistent models of behavior. administrators often have three major viewpoints about big data. Some view big data as presenting an Insight implementation process opportunity to improve services and solve practical tasks. Others view big data as a fad which will not Given a sufficient degree of clarity on the kinds necessarily improve services. Instead, these think of insights of interest, some thought should be that big data may actually strengthen governmental given to the ways in which the analysis of data will control of the masses. Finally, others believe that, be translated into actions which will change those while big data may offer opportunities for the original behaviors of interest. As discussed in section government to reach more citizens and tailor services, two, such an implementation process requires use of big data will not necessarily constitute an defining next steps, creating structures to take them, actual improvement. In fact, those individuals who and, finally, actually taking those steps. Below are are not electronically connected may be marginalized discussed some considerations and challenges which [21]. Particularly in contrast to the last two views, those are connected with all three aspects. in the public sector seeking to realize the value of big data will have to face the challenge of creating a Project vs. process culture which seeks to improve systems and processes based on data3 [5]. Whether big data is used for a project or process will impact the degree to which a landscape Even if a single institution is convinced of the benefits analysis should be conducted to understand what of using big data, a culture of cooperation and data sources were used in the past to accomplish collaboration among multiple institutions is critical similar objectives. In addition, if an institution seeks [28]. At one level, cooperation is critical for the ongoingly to utilize big data to assist in managerial establishment of standards by which various agencies tasks, then such a process-orientation will need to or individuals may access and use data. Beyond this, place emphasis on technical capacities to analyze collaboration would help elucidate the possible uses as well as institutional capacities to act upon and of big data and help nurture and channel energies ongoingly inform such analysis. to utilize it. One example of this is the Global Partnership on Development Data recommended by Culture a panel in the United Nations concerned with poverty. This Partnership would bring “together diverse Challenges at the level of culture can have a but interested stakeholders” who would, as a first substantial bearing on the fruitful execution of a step, “develop a global strategy to fill critical gaps, project endeavor which utilizes big data. At one expand data accessibility, and galvanize international level, institutions may avoid releasing data due to efforts to ensure a baseline for post-2015 [poverty] paternalistic beliefs that others cannot make use of targets” [21]. Whether in public or non-profit sectors, the data or due to concerns that the data will be used cooperation among managers and elected politicians against them [33]. Overcoming cultural challenges is essential to use big data to inform decision such as these will be particularly critical when opening making [12]. One way to begin collaborations with data for citizen engagement [15]. an institution as a whole is to find those individuals within it who are willing to share data and collaborate 3. One author has noted that, where such a data-driven culture exists, the tendency of conceiving of society as simply a set of sub-populations rather than as a single social body will have to be avoided. Although this appears to be primarily a philosophical concern, it may well have real implications on defining the public good and defining the means to achieve it, including the role of government [13]. 50 on generating insights for policy or management know about them and the way that data is utilized decisions [26]. [37]. Beyond awareness, people need to be educated on the value of their data and how to control its Use and abuse dissemination through means as simple as, for example, privacy settings on social media [45]. At Another set of challenges that pertain to big data an even higher level, substantive ethical questions relate to people’s interactions with the results need to be discussed in the public sphere regarding or prescriptions from big data analyses. In some the use of predictive, albeit somewhat imperfect, cases, a challenge that must be overcome is the information by companies or other institutions possibility that many organizations, companies, and seeking to advance their own interests [39]. institutions would rather avoid the use of big data since it may reveal truths which they would prefer Due to the somewhat unstoppable nature of the remain hidden. Such non-truth-seeking behavior use of big data, dealing with such challenges at could include pharmaceutical companies avoiding an institutional level will likely have to be handled the use of big data to assist with understanding the through management, rather than preventing effects of a drug once introduced in the market. adoption [39] [37], such as by increasing transparency Alternatively, healthcare companies may avoid health on how data is used [54]. Institutions may have to personalization via big data since it will encourage move beyond hierarchical, centralized, and rule-driven patients to use more preventive medicine thus structures in order to deal with the variety of uses and reducing their income from doctor visits [37]. abuses of big data [37]. As the potential for big data is increasingly Another, more subtle challenge arises when the use of understood, the potential for it to be used for big data is integrated into systems and outputs. When purposes contrary to the public good and/or ethical big data is systematically incorporated to generate principles is just beginning to be explored. Such uses some kind of output, such as predictions on crime hot of big data often leverage the seeming predictability spots or product preferences, the very people whose of human behavior. Some cases have already begun behavior or preferences are being predicted may tentatively to demonstrate the ability to use big data, consciously interact with the system in a manipulative for example, to infer when an individual is dating way. One example of this is “google bombing,” or [45], the strength of a marriage [46], or the effect cases in which people craft webpage features to trick of the size of a retail markdown sign on sales [39]. the Google search algorithm in order to raise their Such predictive information may be used for various page-rankings. Another dangerous possibility is for purposes, not all of which are malicious. However, criminals to outsmart police agencies relying on big the fact that big data permits such predictions raises data. In particular, criminals could use the same data concerns about ensuring that it is used in a way which and algorithms to infer the police’s expectations harmonizes with the public good. thereby enabling them to behave in unexpected ways [37]. These are examples of a broader class of At one level, such powerfully predictive deductions problems arising from the study of social systems as from big data raise the challenge of ensuring that compared to less self-aware systems. The subject people are aware of what companies or institutions may, in fact, become the object of its own study. 51 CASE STUDY DATA PALETTES Using Google Google Trend Data Trends to nowcast economic activity in Colombia Motivation GDP Official Data For the well-timed design of economic policy, it is desirable to count with reliable statistics that allow constant monitoring of economic activity by sectors, preferably in real time. However, it is a well-recognized problem that policymakers must make decisions before all data about the current economic environment is available. In Colombia, the publication of the leading economic indicators that the Administrative Department for National Statistics (DANE- for its acronym in Spanish) uses to analyze economic activity at the sectorial level has an average lag of 10 weeks. In this framework, the Ministry of Finance in Colombia looked for coincident indicators that allow tracking the short-term trends of economic activity. In particular, the approach was to exploit the information provided by the Google search statistical tool known as “Google Trends”. Data Generation The data for this study comes from Google web searches. Based on the web searches performed by Google users, Google Trends (GT) provides daily information of the query volume for a search term in a given geographic region (for Colombia, GT data are available at the departmental level and also for the largest municipalities). Each individual Google Trend series is relative and not an absolute measure of search volume. That is, the period in which the search interest of a keyword is highest within the dates of inquiry receives a value of 100. All other periods for an individual series are measured relative to this highest period. To assess the performance of the indexes built using GT data, official economic activity data (both at the aggregate level and at the sectorial level) from the DANE are used. Both GT and DANE data are publicly available. Data Interpretation In order to exploit the information provided by GT data, it is critical to choose a set of keywords that can be used as a proxy of consumer behavior or beliefs. In some sense, GT data takes the place of traditional 52 consumer-sentiment surveys. For example, the use of data for a certain keyword (such as the brand for a certain product) might be justified in the case a drop or a surge in the web searches for that keyword could be linked to a fall or an increase in its demand and, therefore, a lower or higher production for the specific sector producing that product. The analysis carried out by the Ministry of Finance in Colombia identifies keywords meaningfully related to the different economic sectors and leverages GT data for these keywords to produce leading indicators for economic activity at the sectorial level (ISAAC – for its acronym in Spanish). It is important to highlight that this approach is used only for some of the key sectors of the economy (such as agriculture, industry, commerce, construction, and transports). The performance of other sectors (such as mining or financial services or personal services) cannot be assessed using web searches and other leading indicators need to be used. Once the sectorial ISAACs are produced, the information is used to produce an aggregate leading indicator for the economic activity in the country (ISAAC+). Insight Implementation The research carried out by the Ministry of Finance in Colombia showed the potential of web searches information to explain economic activity variations for some specific sectors in Colombia (in particular, agriculture, industry, commerce, construction, and transports). GT queries allow the construction of leading indicators which determine in real time the short-term trend of the different economic sectors, as well as their turning points. In addition, the leading indicator produced for aggregate economic activity (ISAAC+) shows a high correlation with its reference variable (quarterly GDP annual variation) capturing its turning points and short-term behavior. The production of these leading indicators reduces the lag associated with the publication of traditional statistics and help policy makers in the country to make timely decisions. The main limitation of this work is that the level of Internet penetration in Colombia is still relatively low (about 50%) and this implies that GT data reflects information from just a part of the country’s consumers: those who have access to the Internet and use Google for their web searches. As Internet penetration deepens in the future, the representativeness of GT data will improve and make the ISAAC indicators even more relevant. From Idea to Ongoing Process The research project, led by Luis Fernando Mejía, General Director of the Macroeconomic Policy Department in the Ministry of Finance, raised interest inside and outside Colombia. The ISAAC indicators produced with GT data are currently published on a monthly basis in reports which show the historical correlation between the ISAAC and GDP data at the sectorial level and highlight sectorial trends projected by the ISAAC indicators. Other countries are looking at this interesting project and might start producing similar big data driven forecasts in the future. References L. F. Mejía, D. Monsalve, Y. Parra, S. Pulido and Á. M. Reyes, “Indicadores ISAAC: Siguiendo la actividad sectorial a partir de Google Trend,” Ministerio de Hacienda y Crédito Público, Bogotá, 2013. Available: http:// www.minhacienda.gov.co/portal/page/portal/HomeMinhacienda/politicafiscal/ reportesmacroeconomicos/ NotasFiscales/22 Siguiendo la actividad sectorial a partir de Google Trends.pdf. [Accessed 28 August 2014] Correspondence with Luis Fernando Mejía, General Director of Macroeconomic Policy, Ministry of Finance and Public Credit, Republic of Colombia 53 54 SECTION 6 RETROSPECT AND PROSPECT With the advent of computing, humanity has unstructured data, linking diverse data sets, and entered a new era of an unprecedented and scaling systems to respond to increasing volumes of exponentially rising capacity to generate, store, data. Scientific challenges include the rising demand process, and interact with data. Seeking ways to for data scientists, developing appropriate models maximize efficiency and increase product offerings to work with large and diverse data sets, dealing with by leveraging such data, the private sector has selection bias in various forms, and communicating gained substantial experience over the last few analytical results in order to yield actionable insights. decades. Such experience undoubtedly built upon Institutional challenges include limited access to data, scientific capabilities to analyze data and practical cultures that don’t promote learning, ethical concerns business acumen in order to ensure effective, real- about the utilization of personal data, and the lack of world applications. In the public sphere, leveraging standards with regard to data storage. various, often very large, data sources to effect real improvements has only just begun in the last decade. As the technological capacities to generate, store, The cases in this document testify to the promise for and process data continue unabated, humanity will such data to be used to enhance perception, deepen need to develop a corresponding measure of various understanding, and hone forecasting abilities. technical, social, cultural, and institutional capabilities to ensure that big data is used toward helpful and Although experience with big data is relatively effective ends such as strengthening early warning for nascent, several conclusions can already be drawn. disasters, enhancing awareness by providing real-time For data to be effective, it must be seen in the context feedback, or better understanding social systems. of an ongoing process to better understand and The necessary capabilities enable the integration of interact with the world. In this light, use of big data big data into ongoing processes rather than one-time should begin with a question and a description of the projects, thereby enabling its value to be continually behaviors of interest. Use of big data from various released and refined. Spaces will be needed in which sources requires effective and scalable methods to such technical, cultural, and institutional capabilities access, manage and interpret data. Data must be can commensurately develop. For example, members well-documented to ensure traceability. Models of various institutions, corporations and governments through which data is interpreted must be carefully may convene to develop a shared perspective on selected to correspond to the data generating the usefulness of big data for poverty reduction and process. Collaboration among practitioners, social agree to standards on its utilization. Given the variety scientists, and data scientists will be critical in order and pervasiveness of the necessary capabilities to to ensure the requisite understanding of the real- utilize big data to address big problems, collaborative world conditions, data generation mechanisms, and spaces are needed to enhance the capacity of methods of interpretation are effectively combined. individuals, organizations, businesses and institutions Such collaboration will also enable the overcoming to elucidate challenges and solutions in an interactive of major technological, scientific, and institutional manner, strengthening a global culture of learning to challenges. Technical challenges include managing reduce poverty and promote shared prosperity. 55 56 REFERENCES [1] H. Varian, “Big Data: New Tricks for Econometrics,” [9] S. Lohr, “The age of big data.,” New York Times, Journal of Economic Perspectives, vol. 28, no. 2, p. 2012. 3–28, 2014. [10] United Nations Global Pulse, “Big Data for [2] Hilbert, M., & López, P. (2011). The world’s Development: Challenges and Opportunities.,” technological capacity to store, communicate, and United Nations, 2012. compute information. Science, 332(6025), 60-65. [11] A. McAfee and E. Brynjolfsson, “Big data: the [3] V. Mayer-Schonberger and K. Cukier, Big data: management revolution,” Harvard business review, A Revolution that will transform how we live, work vol. 90, no. 10, pp. 60-68, 2012. and think., New York: Houghton Mifflin Harcourt Publishing Company, 2013. [12] M. Milakovich, “Anticipatory Government: Integrating big data for Smaller Government,” in [4] I. T. Union, “The World in 2013: ICT Facts and Oxford Internet Institute “Internet, Politics, Policy Figures.,” [Online]. Available: http://www.itu.int/en/ 2012” Conference, Oxford, 2012. ITU-D/Statistics/Pages/facts/default.aspx. [Accessed January 2014]. [13] D. Boyd and K. Crawford, “Six provocations for big data,” in A Decade in Internet Time: Symposium [5] J. Manyika, M. Chui, B. Brown, J. Bughin, R. on the Dynamics of the Internet and Society, 2011. Dobbs, C. Roxburgh and A. H. Byers, “Big data: The next frontier for innovation, competition, and [14] The World Bank, “Open Data Essentials,” 2014. productivity.,” McKinsey & Company, 2011. [Online]. Available: http://data.worldbank.org/about/ open-government-data-toolkit/knowledge-repository. [6] D. Fisher, R. DeLine, M. Czerwinski and S. Drucker, [Accessed July 2014]. “Interactions with big data analytics,” Interactions, vol. 19, no. 3, pp. 50-59, 2012. [15] L. Hoffmann, “Data mining meets city hall,” Communications of the ACM, vol. 55, no. 6, pp. 19-21, [7] R. Bucklin and S. Gupta, “Commercial use of UPC 2012. scanner data: Industry and academic perspectives,” Marketing Science, vol. 18, no. 3, pp. 247-273, 1999. [16] J. Manyika, M. Chui, D. Farrell, S. V. Kuiken, P. Groves and E. A. Doshi, “Open Data: Unlocking [8] F. Diebold, “A personal perspective on the origin(s) Innovation and Performance with Liquid and development of ‘big data’: The phenomenon, the Information.,” McKinsey & Company, 2013. term, and the discipline,” Working Paper, 2012. 57 [17] S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins [31] S. Thornton, Interviewee, [Interview]. February and N. Kruschwitz, “Big data, analytics and the path 2014. from insights to value.,” MIT Sloan Management Review, vol. 21, 2014. [32] United Nations Global Pulse, “Big Data for Development: A Primer,” United Nations Publications, [18] S. Lucas, Interviewee, [Interview]. December 2013. 2013. [19] A. Asquer, “The Governance of big data: [33] W. Hall, N. Shadbolt, T. Tiropanis, K. O’Hara and Perspectives and Issues,” in First International T. Davies, “2012,” Nominet Trust, Open data and Conference on Public Policy, Grenoble, 2013. charities. [20] Centre for Economics and Business Research, [34] A. R. Syed, K. Gillela and C. Venugopal, “The “Data equity: unlocking the value of big data.,” SAS, Future Revolution on Big Data,” Future, vol. 2, no. 6, 2012. 2013. [21] United Nations, “A New Global Partnership: [35] E. Bertino, P. Bernstein, D. Agrawal, S. Davidson, Eradicate Poverty and Transform Economies U. Dayal, M. Franklin, ... and J. Widom, “Challenges Through Sustainable Development.,” United Nations and Opportunities with Big Data,” White Paper, 2011. Publications, New York, 2013. [36] World Economic Forum, “Big Data, Big Impact: [22] J. Blumenstock, Interviewee, [Interview]. January New Possibilities for International Development,” 2014. World Economic Forum, Geneva, 2012. [23] V. Frias-Martinez, Interviewee, [Interview]. [37] D. Bollier, “The Promise and Peril of Big Data,” December 2013. The Aspen Institute, Washington, D.C., 2010. [24] M. Khouja, Interviewee, [Interview]. December [38] M. Ghasemali, Interviewee, [Interview]. January 2013. 2014. [25] A. Siegel, Interviewee, [Interview]. December [39] R. Cross, Interviewee, [Interview]. December 2013. 2013. [40] E. Wetter and L. Bengtsson, Interviewees, [26] N. Eagle, Interviewee, [Interview]. December [Interview]. January 2014. 2013. [41] K. Warner, T. Afifi, T. Rawe, C. Smith and A. De [27] A. Leshinsky, Interviewee, [Interview]. January Sherbinin, “Where the rain falls: Climate change, food 2014. and livelihood security, and migration,” 2012. [28] R. Kirkpatrick, Interviewee, [Interview]. February [42] M. Hilbert, “Big Data for Development: From 2014. Information-to Knowledge Societies,” Working Paper, 2013. [29] A. Cavallo, Interviewee, [Interview]. February 2014. [43] L. F. Mejía, D. Monsalve, Y. Parra, S. Pulido [30] N. Scott and S. Batchelor, “Real Time Monitoring and Á. M. Reyes, “Indicadores ISAAC: Siguiendo in Disasters,” IDS Bulletin, vol. 44, no. 2, pp. 122-134, la actividad sectorial a partir de Google Trend,” 2013. Ministerio de Hacienda y Crédito Público, Bogotá, 2013. Available: http://www.minhacienda.gov.co/ portal/page/portal/HomeMinhacienda/politicafiscal/ 58 reportesmacroeconomicos/NotasFiscales/22 [55] OECD, “New Data for Understanding the Human Siguiendo la actividad sectorial a partir de Google Condition: International Perspectives,” OECD, 2013. Trends.pdf. [Accessed 28 August 2014] [56] M. Lagi, K. Bertrand and Y. Bar-Yam, “The food [44] L. T. Interviewee, [Interview]. December 2013. crises and political instability in North Africa and the Middle East,” arXiv:1108.2455., 2011. [45] R. Vasa, Interviewee, [Interview]. January 2014. [57] S. Greengard, “Policing the Future,” [46] R. Smolan and J. Erwitt, The human face of big Communications of the ACM, vol. 55, no. 3, pp. 19-21, data., Sausalito, CA: Against all odds productions, 2012. 2012. [58] IBM Institute for Business Value, “Analytics: The [47] N. Eagle, M. Macy and R. Claxton, “Network real-world use of big data.,” 2012. Diversity and Economic Development,” Science, vol. 328, no. 5981, pp. 1029-1031, 2010. [59] B. Wescott, Interviewee, [Interview]. January 2014. [48] IBM Global Business Services, “Analytics: The [60] V. E. M. Lehdonvirta, “Converting the virtual real-world use of big data in financial services,” IBM economy into development potential: knowledge Global Services, Somers, NY, 2013. map of the virtual economy,” InfoDev/World Bank White Paper, pp. 5-17, 2011. [49] NYSE Euronext, “NYXdata > Data Products > NYSE Euronext > FlexTrade,” 2013. [Online]. [61] K. Kitner, Interviewee, [Interview]. December 2013. Available: http://www.nyxdata.com/nysedata/Default. aspx?tabid=1171. [Accessed February 2014]. [62] S. Lohr, “Sure, Big Data is Great. But So Is Intuition,” New York Times, 29 December 2012. [50] L. Barrington, Interviewee, [Interview]. January 2014. [63] SecondMuse, “Random Hacks of Kindness 2013 Report,” 2013. [51] PriceStats, “PriceStats,” 2014. [Online]. Available: http://www.pricestats.com. [Accessed February 2014]. [64] United Nations Global Pulse, “Mining Indonesian Tweets to Understand Food Price Crises,” United [52] A. Robertson and S. Olson, “Sensing and Shaping Nations Publications, 2014. Emerging Conflicts: Report of a Joint Workshop of the National Academy of Engineering and the United [65] F. Almeida and C. Calistru, “The main challenges States Institute of Peace: Roundtable on Technology, and issues of big data management,” International Science, and Peacebuilding,” The National Journal of Research Studies in Computing, vol. 2, no. Academies Press, 2013. 1, 2012. [53] UNICEF, Regional Office for Central and Eastern [66] B. Fecher, “Open Science: One Term, Five Europe and the Commonwealth of Independent Schools of Thought,” in The 1st International States, “Tracking Anti-Vaccination Sentiment in Conference on Internet Science, Brussels, 2013. Eastern European Social Media Networks,” 2013. [54] A. Howard, Data for the public good, O’Reilly Media, 2012. 59 ANNEX 1 SELECTED BIBLIOGRAPHY Big data, general topics Fisher, D., DeLine, R., Czerwinski, M., & Drucker, S. (2012). Interactions with big data analytics. Almeida, F. L. F., & Calistru, C. (2012). The main Interactions, 19(3), 50-59. challenges and issues of big data management. International Journal of Research Studies in Hilbert, M., & López, P. (2011). The world’s Computing, 2(1). technological capacity to store, communicate, and compute information. Science, 332(6025), 60-65. Asquer, A. (2013). The Governance of Big Data: Perspectives and Issues. Retrieved from: http://ssrn. IBM Institute for Business Value. (2012). Analytics: com/abstract=2272608 or http://dx.doi.org/10.2139/ The real-world use of big data. Executive Report. ssrn.2272608 Retrieved from http://www-935.ibm.com/services/us/ gbs/thoughtleadership/ibv-big-data-at-work.html Bertino, E., Bernstein, P., Agrawal, D., Davidson, S., Dayal, U., Franklin, M., ... & Widom, J. (2011). LaValle, Steve, et al. (2011). Big data, analytics and the Challenges and Opportunities with Big Data. path from insights to value. MIT Sloan Management Whitepaper presented for Computing Community Review 52.2 : 21-31. Consortium. Lohr, S. (2012). Sure, Big Data is Great. But So Bollier, D., & Firestone, C. M. (2010). The promise and is Intuition. Retrieved from http://www.nytimes. peril of big data (p. 56). Washington, DC, USA: Aspen com/2012/12/30/technology/big-data-is-great-but- Institute, Communications and Society Program. dont-forget-intuition.html?_r=0 Boyd, D. & K. Crawford. (2011). presented at Oxford Lohr, S. (2012). The age of big data. New York Times. Internet Institute’s “A Decade in Internet Time: Retrieved from http://www.nytimes.com/2012/02/12/ Symposium on the Dynamics of the Internet and sunday-review/big-datas-impact-in-the-world. Society” on September 21, 2011. html?pagewanted=all Bucklin, R. E., & Gupta, S. (1999). Commercial Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., use of UPC scanner data: Industry and academic Roxburgh, C., & Byers, A. H. (2011). Big data: The next perspectives. Marketing Science, 18(3), 247-273. frontier for innovation, competition, and productivity. Centre for Economics and Business Research. (2012). Mayer-Schönberger, V., & Cukier, K. (2013). Big Data: Data equity: unlocking the value of big data. Centre A Revolution that Will Transform how We Live, Work, for Economics and Business Research White Paper, 4, and Think. Eamon Dolan/Houghton Mifflin Harcourt. 7-26. 60 McAfee, A., & Brynjolfsson, E. (2012). Big data: the management revolution. Harvard business review, Robertson, C., Sawford, K., Daniel, S. L., Nelson, T. A., 90(10), 60-66. & Stephen, C. (2010). Mobile phone–based infectious disease surveillance system, Sri Lanka. Emerging Smolan, R., & Erwitt, J. (2012). The human face of big infectious diseases, 16(10), 1524. data. Sausalito, CA: Against all odds productions. Scott, N., & Batchelor, S. (2013). Real Time Monitoring in Disasters. IDS Bulletin, 44(2), 122-134. Syed, A. R., Gillela, K., & Venugopal, C. (2013). The Future Revolution on Big Data. International UNICEF, Regional Office for Central and Eastern Journal of Advanced Research in Computer and Europe and the Commonwealth of Independent Communication Engineering, 2(6), 2446-2451. States. (2013). Tracking Anti-Vaccination Sentiment in Eastern European Social Media Networks. Retrieved Big data case studies from: http://www.unicef.org/ceecis/Tracking-anti- vaccination-sentiment-in-Eastern-European-social- Eagle, N., Macy, M., and Claxton, R. (2010). Network media-networks.pdf Diversity and Economic Development. Science, 328(5981): 1029-1031. Retrieved from: http:// Big data for development realitymining.com/pdfs/Eagle_Science10.pdf Hilbert, M. (2013). Big Data for Development: From Greengard, S. (2012). Policing the Future. Information-to Knowledge Societies. Available at Communications of the ACM, 55(3), 19-21 SSRN 2205145. Herrera, J. C., Work, D. B., Herring, R., Ban, X. J., Howard, A. (2012). Data for the public good. O’Reilly. Jacobson, Q., & Bayen, A. M. (2010). Evaluation of Karlsrud, J. (2014). Peacekeeping 4.0: Harnessing traffic data obtained via GPS-enabled mobile phones: the Potential of Big Data, Social Media, and Cyber The Mobile Century field experiment. Transportation Technologies. In Cyberspace and International Research Part C, 18, 568-583. Relations (pp. 141-160). Springer Berlin Heidelberg. Hoffmann, L. (2012). Data mining meets city hall. Lehdonvirta, V., & Ernkvist, M. (2011). Converting Communications of the ACM, 55(6), 19-21. the virtual economy into development potential: knowledge map of the virtual economy. InfoDev/ Lagi, M., Bertrand, K. Z., & Bar-Yam, Y. (2011). The World Bank White Paper, 1, 5-17. Food Crises and Political Instability in North Africa and the Middle East. arXiv preprint arXiv:1108.2455. Organisation for Economic Co-operation and Development. (2013). New Data for Understanding Milakovich, M. (2012). Anticipatory Government: the Human Condition. Retrieved from http://www. Integrating Big Data for Smaller Government, paper oecd.org/sti/sci-tech/new-data-for-understanding- presented at the Oxford Internet Institute “Internet, the-human-condition.htm Politics, Policy 2012” Conference, Oxford, 20-21 September. United Nations Global Pulse. (2012). Big Data for Development: Challenges & Opportunities. UN, New Robertson, A. & Olson, S. (2013). Sensing and Shaping York: NY Emerging Conflicts. The National Academies Press. 13-14. Retrieved from: http://www.nap.edu/catalog. United Nations Global Pulse. (2013). Big Data for php?record_id=18349 Development: A Primer. Retrieved from http://www. unglobalpulse.org/bigdataprimer 61 World Economic Forum. (2012). Big Data, Big Impact: New Possibilities for International Development. Retrieved from http://www.weforum.org/reports/ big-data-big-impact-new-possibilities-international- development Open data / Open science Fecher, B., & Friesike, S. (2013). Open Science: One Term, Five Schools of Thought (No. 218). German Council for Social and Economic Data (RatSWD). Hall, W., Shadbolt, N., Tiropanis, T., O’Hara, K., & Davies, T. (2012). Open data and charities. Nominet Trust. Retrieved from http://www.nominettrust.org.uk/ knowledge-centre/articles/open-data-and-charities McKinsey Global Institute. (2013). Open Data: Unlocking innovation and performance with liquid information. Retrieved from: http://www.mckinsey. com/insights/business_technology/open_data_ unlocking_innovation_and_performance_with_liquid_ information 62 ANNEX 2 INTERVIEW LIST Aaron Siegel Jean Francois Barsoum Head of Interaction and Online Experience, Fabrica Senior Managing Consultant, Smarter Cities Benetton Water and Transportation, IBM Topic areas: data visualization Topic areas: big data, urban infrastructure and opera- tions Alberto Cavallo Billion Price Project Lead Joshua Blumenstock MIT – Sloan School of Management Assistant Professor Topic area: economics, price indices University of Washington Topic areas: mobile and social media data Anthony Leshinsky Media Services Lead Kathi Kitner Coldlight Solutions Senior Researcher, Anthropologist Topic area: business intelligence, data analytics Intel Labs Topic area: big data for development Bill Wescott Executive Vice President Laura Crow The CoSMo Company Principal Product Manager Topic areas: satellite data, systems modeling M-PESA Topic areas: mobile data and financial data Bayan Bruss, Mohamad Khouja, Jian Khod- dedad, Jon Pouya Ehsani Linus Bengtsson Co-Founders Executive Director Logawi Flowminder/Karolinska Institutet Topic areas: Text analytics and analysis Topic areas: mobile data, big data for development Erik Wetter Luke Barrington Assistant Professor, Stockholm School of Economics Senior Manager, Research and Development Co-founder, Flowminder Digital Globe Topic areas: mobile data, big data for development Topic areas: satellite data, big data analysis Graham Dodge Mahyar Ghasemali CEO and Founder Partner and Co-founder Sickweather dbSeer Topic areas: Social media data Topic areas: data infrastructure and processing 63 Mohamad Khouja Robin Cross Data Scientist and Big Data Solutions Specialist Research Director IREX/Logawi DemandLink Topic areas: opinion mining, sentiment and lexical Topic areas: Retail data and predictive analysis analysis Sean Thornton Nathan Eagle Research Fellow CEO Data-Smart City Solutions Jana Topic areas: data and public policy Topic areas: mobile data, big data for development Shannon Lucas Rajesh Vasa Senior Enterprise Innovation Manager Senior Lecturer, Faculty of Information and Communi- Vodafone cation Technologies Topic areas: mobile and financial data Swinburne University of Technology Topic areas: social media big data Vanessa Frias-Martinez Research Scientist Renato de Gusmao Cerqueria Telefonica Research Senior Manager, Natural Resources Solutions Topic area: mobile data IBM Research Brazil Topic areas: data analytics and urban systems Robert Kirkpatrick Director UN Global Pulse Topic areas: big data for development 64 ANNEX 3 GLOSSARY 3 “V”s - A term defining certain properties of big data Data migration - The transition of data from one as volume (the quantity of data), velocity (the speed at format or system to another. which data is processed) and variety (the various types of data). Data science - The gleaning of knowledge from data as a discipline that includes elements of Algorithm - A formula or step-by-step procedure for programming, mathematics, modeling, engineering solving a problem. and visualization. Anonymization - The process of removing specific Data silos - Fixed or isolated data repositories that do identifiers (often personal information) from a dataset. not interact dynamically with other systems. Data warehousing - the practice of copying data from API (Application Programming Interface) - A set of operational systems into secondary, offline databases. tools and protocols for building software applications that specify how software components interact. Geospatial analysis - a form of data visualization that overlays data on maps to facilitate better Business intelligence - The use of software tools to understanding of the data. gain insight and understanding into a company’s operations. Hadoop - an open source platform for developing distributed, data-intensive applications. Clickstream analytics (analysis) - The collection, analysis and reporting of data about the quantity and Internet of things - The unique digital identifiers in succession of mouse clicks made by website visitors. objects that can automatically share data and be represented in a virtual environment. Crowdsourced - The collection of data through contributions from a large number of individuals. Latency - the delay in the delivery of data from one point to another, or when one system responds to Data cleaning/cleansing - The detection and another. removal, or correction, of inaccurate records in a dataset. Machine learning - The creation of systems that can learn or improve themselves on the basis of data; Data exhaust - Data that is collected as a digital by- often linked to artificial intelligence. product of other behaviors. Map/reduce - a method of breaking up a complex Data governance - The process of handling and problem into many chunks, distributing them across management of data being utilized in an endeavor, many computers and then reassembling them into a including policies, data quality and risk management. single answer. 65 Mashup - The use of data from more than one source Relational database - A database in which information to generate new insight. is formally described and organized in tables representing relations. Metadata - Information about, and descriptions of, data. Sentiment analysis (Opinion Mining) - The use of text analysis and natural language processing to assess Nowcasting - A combination of “now” and the attitudes of a speaker or author, or a group. “forecasting,” used in both meteorology and economics referring to immediate term forecasting on Structured data - Data arranged in an organized data the basis of real time data flow. model, like a spreadsheet or relational database. Open data - Public, freely available data. Terabyte - 1 thousand gigabytes. Open science - An initiative to make scientific data Text analytics - the process of deriving insight from and research openly accessible. unstructured, text-based data. Petabyte - 1 thousand terabytes. Topic modelling - The use of statistical models or algorithms to decipher themes and structure in Predictive analytics/modeling - The analysis of datasets. contemporary and historic trends using data and modeling to predict future occurrences. Tweet - a post via the Twitter social networking site restricted to a string up to 140 characters Quantitative data analysis - the use of complex mathematical or statistical modeling to explain, or Unstructured data - Data that cannot be stored in a predict, financial and business behavior. relational database and can be more challenging to analyze from documents and tweets to photos and Reality mining - The study and analysis of human videos. interactions and behavior through the usage of mobile phones, GPS and other machine-sensed Visualization - Graphic ways of presenting data that environmental data. help people to make sense of huge amounts of information. 66 Photo credits: cover page: Mor Naaman / flickr.com; page 2 Martin Sojka / flickr.com; page 7: Uncul- tured / flickr.com; page 12: CGIAR Climate / flickr.com; page 25: CGIAR Climate / flickr.com; page 35: Linh Nguyen; page 42 / unsplash.com : NASA’s Marshall Space Flight Center / flickr.com; page 54 / unsplash.com; page 56: Wojtek Witkowski / flickr.com 67