Report No: ACS11163
.




    Central America
    6C Big Data

    Big Data in Action for
    Development
.
.
    GMFDR
    LATIN AMERICA AND CARIBBEAN
.




Document of the World Bank
.
.


.
    Standard Disclaimer:


    This volume is a product of the staff of the International Bank for Reconstruction and Development/ The World Bank. The findings,
    interpretations, and conclusions expressed in this paper do not necessarily reflect the views of the Executive Directors of The World
    Bank or the governments they represent. The World Bank does not guarantee the accuracy of the data included in this work. The
    boundaries, colors, denominations, and other information shown on any map in this work do not imply any judgment on the part of
    The World Bank concerning the legal status of any territory or the endorsement or acceptance of such boundaries.
.
    Copyright Statement:
.
    The material in this publication is copyrighted. Copying and/or transmitting portions or all of this work without permission may be a
    violation of applicable law. The International Bank for Reconstruction and Development/ The World Bank encourages dissemination
    of its work and will normally grant permission to reproduce portions of the work promptly.

    For permission to photocopy or reprint any part of this work, please send a request with complete information to the Copyright
    Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA, telephone 978-750-8400, fax 978-750-4470,
    http://www.copyright.com/.

    All other queries on rights and licenses, including subsidiary rights, should be addressed to the Office of the Publisher, The World
    Bank, 1818 H Street NW, Washington, DC 20433, USA, fax 202-522-2422, e-mail pubrights@worldbank.org.
BIG DATA
IN
ACTION
FOR
DEVELOPMENT
This volume is the result of a collaboration of World Bank staff (Andrea Coppola and Oscar Calvo-
Gonzalez) and SecondMuse associates (Elizabeth Sabet, Natalia Arjomand, Ryan Siegel, Carrie
Freeman, and Neisan Massarrat). The findings, interpretations, and conclusions expressed in this
volume do not necessarily reflect the views of the Executive Directors of The World Bank or the
governments they represent. The World Bank does not guarantee the accuracy of the data included
in this work.

Design of the report by SecondMuse (Nick Skytland). The design of the “Data for Action Framework”
and data palette used on case studies created by The Phuse.




                                                                                                    2
TABLE OF CONTENTS
   Executive Summary                                                                   5

   Section 1. What is big data?                                                        8
          The world is filled with data                                                8
          Hello “big data”                                                             9
          Use and estimated value of big data                                         10
          Big data in action for international development                            11

   Section 2. How can we better understand and utilize big data?                      13
          Insights and behaviors of interest                                          14
          CASE STUDY: The Billion Prices Project and PriceStats                       15
          Generation to interpretation of data                                        18
                Data generating process                                               18
                Data content and structure                                            19
                Data interpretation process                                           19
          Insight implementation process                                              20
          CASE STUDY: Understanding Labor Market Shocks using Mobile Phone Data       22

   Section 3. What can big data look like for the development sector?                 26
          Examples, by medium and relevant data set                                   26
          Big Data for Development in Central America: World Bank Pilot Efforts       27
          Examples by medium and purpose                                              29
          Areas of high potential for big data                                        30
          Early warning                                                               31
          Enhancing awareness and enabling real-time feedback                         31
          Understanding and interacting with social systems                           31
          Specific challenges and ongoing processes                                   31
          CASE STUDY: Forecasting and Awareness of Weather Patterns using Satellite   32
          Data

   Section 4. How can we work with big data?                                          36
          Technological capabilities                                                  36
          Human capabilities and data intermediaries                                  37
          CASE STUDY: Connected Farmer Alliance                                       39

   Section 5. What are some of the challenges and considerations                      43
   when working with big data?
          Data generation process and structure                                       43
          Data interpretation process                                                 44
                Access                                                                44
          CASE STUDY: Tracking Food Price Inflation Using Twitter Data                46
                Preparation                                                           48
                Analysis                                                              48
          Insights and their implementation                                           49
                Insight scope and purpose                                             49
                Insight implementation process                                        50
          CASE STUDY: Using Google Trends to nowcast economic activity in Colombia    52

   Section 6. Retrospect and Prospect                                                 55

   References                                                                         57
   Annex 1: Selected Bibliography                                                     60
   Annex 2: Interview List                                                            63
   Annex 3: Glossary                                                                  65

                                                                                           3
FOREWORD
When we started this study, our main objective was to explore the potential of big data to close some of
the existing data gaps in Central America. For us at the World Bank data are critical to design efficient
and effective development policy recommendations, support their implementation, and evaluate results.
Paradoxically many of the countries where poverty is high and hence good programs are more in need are
also those countries where the data is more scarce. Not surprisingly then, we got seduced by the potential
of big data. Not only has connectivity through access to mobile phones, internet, and social media increased
dramatically; this has also generated an endless source of precious information. This has been well understood
by marketing experts who now target individual needs. In addition to commercial use, these large amounts
of data are a potential asset for the development community who could use them to help end poverty and
promote shared prosperity. And indeed as this report documents, there are now good examples of how
big data is used to improve growth forecasts by Ministries of Finance, track population movements, or plan
emergency responses to weather related disasters.

In the end, we need to recognize that we did not advance as much as we wished in filling the blanks in our
Central American Databases. True, we are working on three pilots that are helping us think about how to
use these approaches in our daily work. For example, our teams are exploring whether we can use (i) night
time illumination patterns captured by satellites to infer the spatial distribution of poverty; (ii) internet search
keyword data to improve our forecasts of price series; and (iii) twitter data to better understand public
reactions to policy decisions. Admittedly we did not have a major breakthrough but the work done helped us
to start appreciating the potential of big data and we will continue pursuing this agenda trying to find country
specific solutions that may emerge from big data analysis.

I consider worth sharing the work done by the team so far in (i) structuring a work program around a topic
where we at the Bank had little, if any, expertise; (ii) presenting existing examples where big data is being
used to improve development prospects, (iii) reflecting about the many development aspects that can be
touched with this type of analysis, (iv) considering the technical and human capacity needs to make the best
of big data, and (iv) assessing the challenges of working with big data. I also think there are important lessons
emerging from the collaboration with SecondMuse and a good number of groups in the Bank: the staff in the
Central American Country Management Unit, the Macroeconomics and Fiscal Management Global Practice,
the Transport and ICT Global Practice, the Open Finances Group, the Development Economics Group, and the
Innovation Lab in the World Bank Institute.

As the report notes, the age of big data is upon us. I hope that policy makers and development practitioners
alike will find the work described in this report interesting and useful.




                                              Humberto López
                                World Bank Country Director for Central America

                                                                                                                       4
EXECUTIVE SUMMARY
This report stemmed from a World Bank pilot activity        still in an early stage for big data analytics.
to explore the potential of big data to address
development challenges in Central American                  In the development sector, various individuals and
countries. As part of this activity we collected and        institutions are exploring the potential of big data.
analyzed a number of examples of leveraging big             Call detail records via mobile phones are being
data for development. Because of the growing                used, often in combination with other data sources,
interest in this topic this report makes available to       to analyze population displacement, understand
a broader audience those examples as well as the            migration patterns, and improve emergency
underlying conceptual framework to think about big          preparedness. Remote sensing images from satellites
data for development.                                       are showing promise to improve food security and
                                                            minimize traffic congestion. Search queries and
To make effective use of big data, many practitioners       various text sources on the internet and social media
emphasize the importance of beginning with a                are being ably analyzed for quick identification of
question instead of the data itself. A question clarifies   disease epidemic changes or for inference of a
the purpose of utilizing big data - whether it is for       population’s sentiment on an event. Hence, big data
awareness, understanding, and/or forecasting. In            shows promise to enhance real-time awareness,
addition, a question suggests the kinds of real-world       anticipate challenges, and deepen understanding of
behaviors or conditions that are of interest. These         social systems by governments and other institutions.
behaviors are encoded into data through some                Yet, to move things forward will require the
generating process which includes the media through         collaborative formulation of key questions of interest
which behavior is captured. Then various data sources       which lend themselves to the utilization of big data
are accessed, prepared, consolidated and analyzed.          and the engagement of data scientists around the
This ultimately gives rise to insights into the question    world to explore ways to address them.
of interest, which are implemented to effect changes
in the relevant behaviors.                                  Utilizing big data for any given endeavor requires
                                                            a host of capabilities. Hardware and software
Big data is no panacea. Having a nuanced                    capabilities are needed for interaction of data from
understanding of the challenges to apply big data in        a variety of sources in a way which is efficient and
development will actually help to make the most of it.      scalable. Human capabilities are needed not only to
First, being able to count on plenty of data does not       make sense of data but to ensure a question-centered
mean that you have the right data; and biased data          approach, so that insights are actionable and relevant.
could lead to misleading conclusions. Second, the           To this end, cooperation between development
risks of spurious correlations increase with the amount     experts as well as social scientists and computer
of data used. Third, sectoral expertise remains critical    scientists is extremely important [1].
regardless of the amount of data available. And these
are just some of the criticisms that one needs to take      Several challenges and considerations with big data
into account when thinking about the possibilities          must be kept in mind. This report touches on some
offered by big data. Care is needed in the use of big       of them and does not pretend to provide answers
data and its interpretation, particularly since we are      and solutions but rather to promote discussion.

                                                                                                                  5
For example, the process through which behaviors
are encoded into data may have implications for
the kinds of biases which must be accounted for
when conducting statistical analyses. Data may be
difficult to access, especially if it is held by private
institutions. Even in the case of public institutions,
datasets are often available but difficult to find due to
limited metadata. Once data is opened, challenges
around ensuring privacy and safety arise. This is also
linked with the issue of personal data ownership.
Even preparing data and ensuring its scalable and
efficient use presents challenges such as the time
and effort required to clean data. Analysis, especially
when using big data to understand systems, must
carefully consider modeling assumptions; algorithm
transparency is critical to maximize trust in data driven
intervention; and the work of translating insights into
changes in the original behaviors of interest requires
attention to those institutional structures and culture
that will support the process.

The age of data is upon us. The means of its
generation are undoubtedly multiplying, the
technologies with which to analyze it are maturing,
and efforts to apply such technologies to address
social problems are emerging. Through a concerted
and collaborative effort on the part of participants
at various levels, big data can be utilized within
meaningful systems and processes which seek to
generate and apply insights to address the complex
problems humanity faces.




                                                            6
7
SECTION 1
WHAT IS BIG DATA?
Data is a growing element of our lives. More and            KEY FINDINGS
more data is being produced and becoming known
in the popular literature as “big data”, its usage          •	 Big data can be used to enhance awareness
is becoming more pervasive, and its potential for              (e.g. capturing population sentiments),
international development is just beginning to be              understanding (e.g. explaining changes
explored.                                                      in food prices), and/or forecasting (e.g.
                                                               predicting human migration patterns).
The world is filled with data
                                                            •	 Mediums that provide effective sources of
In 2007, the world’s capacity to store data was                big data include, inter alia, satellite, mobile
                                                               phone, social media, internet text, internet
just over 1020 bytes, approaching the amount of
                                                               search queries, and financial transactions.
information stored in a human’s DNA, and the
                                                               Added benefits accrue when data from
numbers are growing. Between 1986 and 2007,
                                                               various sources are carefully combined to
the world’s capacity to store information increased            create “mashups” which may reveal new
by approximately four orders of magnitude. The                 insights.
technological capacity to process all this data has,
in fact, experienced even more rapid growth [2]. To         •	 It is key to begin with questions, not with
cite but one example of the growth rates, the Sloan            data. Once the setting for the analysis
Digital Sky Survey collected more data over a few              is defined, the focus of the research can
weeks in the year 2000 than had been collected by              move to the behaviors of interest and the
astronomers to date [3].                                       consequent data generation process. The
                                                               interpretation of this information will be
Simultaneous with the rise of the capacity to store            used to produce actionable insights with
and analyze data is the increasing capacity for people         the possible objective of influencing the
                                                               behaviors of interest considered. Along
to contribute to and access it. It is estimated that as
                                                               these lines, this report develops a Data for
of 2013, 39% of the world’s population has access to
                                                               Action Framework to better understand
the internet. More astounding is the fact that mobile
                                                               and utilize big data.
phone subscriptions are near 7 billion, approximately
equaling the number of people on the face of the            •	 Making good use of big data will require
earth, yet access is not equal. While 75% of Europe’s          collaboration of various actors including
population has access to the internet, only 16% have           data scientists and practitioners, leveraging
access in Africa. Interestingly, discrepancies for mobile      their strengths to understand the technical
phone subscription rates are not as pronounced.                possibilities as well as the context
While Europe has approximately 126 subscriptions               within which insights can be practically
per 100 inhabitants, Africa has 63 [4]. Furthermore,           implemented.
the use of more advanced “smart” cell phones is
                                                                                                                 8
expected to increase in the coming years. McKinsey          distinguish big data with respect to the largeness
Global Institute (McKinsey) estimates, for example,         of the dataset(s), say 200 gigabytes of data for a
that for countries in Asia other than China, India or       researcher in 2012. Practitioner audiences, on the
Japan, the number of “basic phones” is expected to          other hand, will emphasize the value that comes
decrease by 12% per year between 2010-2015, while           from utilizing various kinds and sizes of datasets
“advanced phones” will rise by 17% during the same          to make better decisions. Indeed, there does not
period [5].                                                 appear to be any real and rigorous definition of big
                                                            data; instead, it is often described in relative terms
Hello “big data”                                            [3]. As an example, McKinsey uses the term big data
                                                            to refer to datasets “whose size is beyond the ability
It is no surprise that, given such a rising tide, the       of typical database software tools to capture, store,
horizon of possibilities considered by decision-            manage, and analyze”, thereby allowing the definition
makers over the years has increasingly taken into           to vary by setting, such as industry, and time [5].
account how to make best use of such a deluge of            Yet with the great diversity of current storage and
data. As early as 1975, attendees at the Very Large         processing power, not to mention the doubling of
Databases conference discussed how to manage the            such capacity in short time scales, such a definition
then-considered massive US census data [6]. In the          makes comparability difficult --exactly the problem a
late ‘90s, practitioners were already using massive,        definition seeks to avoid.
high-frequency store-level scanner data to compute
optimal pricing schedules [7]. Indeed, it was around        Several other authors [10], [11] often refer to the
that time that the term “big data” was used to refer        “Three V’s” of big data: volume, variety, and
to the storage and analysis of large data collections       velocity, originally discussed by Laney in 2001 [8],
[8]. Since the ‘90s, a plethora of media, often             to distinguish big data. Volume refers to the actual
automatically capturing behaviors and conditions            size of the dataset(s) analyzed, variety to the various
of people or places, provide new data sources               types of datasets possibly combined to produce new
for analysis. These media sources include online            insights, and velocity to the frequency with which
shopping websites capturing transaction data, retail        data is recorded and/or analyzed for action. The
computers capturing purchase data, internet-enabled         concept of variety often underlies use of the term big
devices capturing environmental data, mobile phones         data. For example, in examining the use of big data
capturing location data, and social media capturing         for governance, Milakovich points out how “single
data on consumer sentiment.                                 sources of data are no longer sufficient to cope with
                                                            the increasingly complicated problems in many policy
Decreasing costs of storage and computing power             arenas” [12]. In this vein, Boyd and Crawford point
have further stimulated the use of data-intensive           out that big data “is not notable because of its size,
decision making [8], and decision making based on           but because of its relationality to other data. Due
ever larger and more complex datasets requires more         to efforts to mine and aggregate data, Big Data is
sophisticated methods of analysis. For example,             fundamentally networked” [13]. For the purposes
in the case of visual analysis, smaller datasets lend       of this report, which considers the development
themselves to simple visual interpretations, say, using     context, the use of big data will refer to the use of
a scatter plot or line chart; while bigger datasets can’t   any dataset(s) which are distinguished by one or more
always readily be captured using similarly structured       of the three “V” features mentioned above for the
small and fast renderings [6].                              purpose of generating actionable insights.

As a result of the rapidly evolving landscape, the          Other related terms such as open data and
popular press, such as the New York Times [9], as           crowdsourced data have also become in vogue.
well as academic discourses, have increasingly used         “Open data”, for example, refers to data which is
the term “big data”, yet its definition has remained        made technically and legally open, i.e. available in
somewhat elusive. Technical audiences will often            machine-readable format and licensed to permit
                                                                                                                   9
                                                              have analyzed consumer purchase data to make
                                                              personalized recommendations, video and shopping
                                                              cart transponder data to streamline a grocery store’s
                         Big Data                             layout, store- and product-level purchases together
                                                              with climate data efficiently to maximize sales and
                                                              minimize inventory costs, and location data on trucks
                                                              together with geographic data to minimize fuel use
                                                              and delivery times [5]. Accessing mobile data such as
                                                              foot-traffic patterns or even phone operating systems
         Crowdsourced                   Open                  have helped companies engage in more effective
             Data                       Data                  advertising [18]. Other specific examples include
                                                              how Microsoft improved the accuracy of its grammar
                                                              checker by increasing the relevant dataset from a
                                                              million to a billion words, or how Google utilized a
                                                              trillion words more effectively to provide language
Figure 1: Relationship between Big Data, Crowdsourced Data,   translation services [3]. In short, intelligent use of big
and Open Data
                                                              data is becoming an effective way for companies to
commercial and non-commercial utilization [14].               outperform their competitors, often through more
Cities such as New York sometimes open their data             effective foresight and understanding of the market
to stimulate innovation by drawing upon outside               dynamics [5]. While business has succeeded in
resources [15]. Countries such as Singapore, India,           demonstrating the marginal benefits which can accrue
and the United States have also consolidated                  from big data, e.g. efficiency gains in more effective
and opened data sets to the public. McKinsey                  retail forecasting, more substantive or qualitative
distinguishes open data by explaining that, although          effects of big data, particularly in terms of social
big data may also be open data, it need not.                  practices, are just beginning to emerge [19]. That
Open data, they explain, refers to the degree to              said, the marginal benefits are substantial for private
which data is liquid and transferrable [16]. Some             and public interests.
data, particularly privately held mobile phone data
for example, is particularly closed yet is certainly          McKinsey estimates the potential additional value of
big data. Crowdsourced data is another popular                big data in the sectors of US health care, European
term which refers to data collected through the               public sector administration, global personal location
aggregation of the input from large numbers of                data, US retail, and global manufacturing to be over
people. Crowdsourced data can also be big data,               $1 trillion US dollars per year, half of which comes
but need not be. “Crowdsourced” emphasizes the                from manufacturing alone. Such value often comes
means through which data is collected whereas “big            from efficiency gains by reducing the inputs required
data” emphasizes the depth and complexity of the              for the same amount of output [5]. Another study
dataset(s).                                                   estimated the value of big data via improvements
                                                              in customer intelligence, supply chain intelligence,
Use and estimated value of big data                           performance improvements, fraud detection, as
                                                              well as quality and risk management. For the United
The use of big data has, over the past several years,         Kingdom alone, this value was estimated at $41 billion
been motivated largely by private interests. In a             US dollars per year in the private and public sectors
survey done around 2010, with over 3000 business              [20].
executives in over 100 countries, it was found that the
top-performing organizations were “twice as likely
to apply analytics to activities” including day-to-day
operations and future strategies [17]. Businesses

                                                                                                                      10
Big data in action for international                          maps which are often prohibitively expensive to
development                                                   produce. In Mexico, analysis of call detail records
                                                              enabled tracking of population movements in
In addition to providing insight to make businesses           response to the spread of epidemic disease and
more profitable, big data is showing promise to               provided insight into the impact of policy levers like
improve, and perhaps substantively change, the                transportation hub closures, such that the velocity of
international development sector in novel ways [10].          infection rates was reduced by as much as 40 hours
Of general interest is the fact that big data often is        [23]. In Kenya, once the impact of mobile money
produced at a much more disaggregated level, e.g.             transfers was evident, governmental regulations
individual instead of, say, a country level. Whereas          were changed to enable their increased use [18].
aggregated data glosses over the often wide-ranging           Text analysis of social media data has the potential
disparities within a population, disaggregated data           to identify issues pertaining to various population
allows decision makers more objectively to consider           segments over time, e.g. refugee challenges or
those portions of the population who were previously          political opinions, thereby allowing development
neglected [21].                                               organizations more effectively to listen to the needs
                                                              of a population [24].
Two basic approaches appear to stand out with
respect to big data in an international development           Several governments have used big data in a variety
context. One is when big data is utilized for projects        of ways to streamline processes, thereby increasing
or processes which seek to analyze behaviors outside          cost and time savings. In Sweden, the government
of government or development agencies in order                used previous years’ data combined with user
to heighten awareness and inform decision making.             confirmation via text messaging to streamline tax
Another approach is when big data is utilized for the         filings. In Germany, the Federal Labor Agency used
analysis of behaviors internal to a single institution,       its multidimensional historical customer data more
such as the government, often to streamline and               effectively to assist unemployed workers, thereby
improve services.                                             reducing costs by approximately $15 billion USD
                                                              annually. Altogether, McKinsey estimates that
Several examples demonstrate the use of big data for          Europe’s 23 largest governments could create $200-
projects or processes more concerned with matters             400 billion USD per year in new value over 10 years
outside of governments or other types of agencies.            through the use of big data to reduce mistakes and
For example, in one case it was found that a country’s        fraudulent tax reporting behaviors [5].
gross domestic product could be estimated in real-
time using light emission data collected via remote           The examples above may be used to stimulate
sensing. In this way, alternate data sources serve as         thinking on similar or analogous uses of big data to
a proxy for official statistics. This is especially helpful   drive resource efficiency, process innovation, and
in the development context since there is often a             citizen involvement where resources are constrained,
scarcity of reliable quantitative data in such settings       thus laying a strong foundation for poverty alleviation
[22]. Changes in the number of Tweets mentioning              and shared prosperity around the world.
the price of rice in Indonesia were closely correlated
to more directly measured indicators for food price
inflation. Similarly, conversations about employment
including the sentiment of “confusion” via blogs,
online forums, and news conversations in Ireland
were found to precede by three months official
statistics showing increases in unemployment [10].
In Guatemala, a pilot project explored how mobile
phone movement patterns could be used to predict
socioeconomic status, thereby approximating census
                                                                                                                       11
12
SECTION 2
HOW CAN WE BETTER UNDERSTAND
AND UTILIZE BIG DATA?
Before delving into a technical and political analysis of   a data interpretation process takes place through
the use of big data, it is helpful to have a contextual     which the raw data is accessed, consolidated, and
understanding of how such data may be generated,            analyzed to produce some actionable insight. Often,
accessed and acted upon. Figure 2 below describes           this interpretation process is cyclical and interactive
this framework of data for action. This follows from the    rather than strictly linear. For example, the analysis
recognition that data is not the same as knowledge          may shed light on new data needed, thus requiring
and that, therefore, a whole host of capacities are         access to new data, or the very act of consolidating
required to generate data-driven actionable insights        data may already reveal insights which will inform
for social betterment.                                      the analysis or the need to access data in a different
                                                            way. Once actionable insights are determined, they
At the most basic level, there are behaviors or             must change the behaviors of interest through some
conditions existing in the world which include, among       implementation process. This process includes
many other things, climate, human sentiments,               distilling insights into next steps, structuring
population movement, demographics, infrastructure           organizational processes or projects accordingly,
and market-based interactions. These behaviors              and engaging in the corresponding actions. To the
and conditions, named behaviors henceforth, are             extent that the insights gained are accurate and the
encoded through some data generating process                implementation is done thoroughly, the behaviors will
which includes triggers and media through which data        change as expected. In whatever case, however, all
is recorded. Once data in whatever form is generated,       phases of the cycle inform each other through a rich
                                                            process of learning.




                                                                                                                 13
                                                                  FORMUL
                                                        T   ION            AT
                                                     ES                         IO
                                                                                     N
                                                 U
                                                                                                 IN




                                             Q
                                    N IS                                                            SI
                                   O                                                                  G
                                                                                               D       H
                                 TI LY
                                       S                                                         EF
                                A NA                                                               IN




                                                                                                            T
                                                                                                        E

                           ET




                                                                                                                IM STR
                                 A




                                                                                                        ,(
                         PR

                                T,




                                                                                                                  PL UCT
                                                                                                            R
                           EN




                                                                                                            E)
                      ER




                                                                                                                    EM
                      EM
                  NT
                   AG




                                                                                                                      EN E , AC T
                AN
              AI




                                                                                                                        TA
                                                                                                                         UR
             ,M
         DAT




                                                                                                                            TIO
         ESS




                                                            INSIGHT
               ACC




                                                                                                                               N
                                     DATA                                                    BEHAVIOR




                                             DATA                                        N
                                                             G E N ER A TIO
                                                 UNIT,
                                                             TRIGGER, MEDIA




                                                                                               Figure 2: Data for Action Framework



The above framework is general enough to permit                     Insights and behaviors of interest
an exploration of the use of big data in a variety
of contexts yet also specific enough to frame the                   As expressed by experts who were interviewed for
subsequent discussion. Each element of the figure                   this report [25], [26], [27], [28], [29] as well as in a
above is discussed in further detail in the sections                report distilling experience for over 3,000 business
below, starting with a careful consideration for the                executives around the world [17], the best way to
“end” use of such data, i.e. the kinds of insights and              proceed when considering how to use big data is to
actions we wish to generate.                                        begin with questions, not data. Undoubtedly, as data
                                                                    is collected and analyzed, a question will be refined.




                                                                                                                                    14
CASE STUDY                                                                                   DATA PALETTE

The Billion Prices                                          Retail prices via websites



Project and
PriceStats
Motivation

The Billion Prices Project began when Alberto
Cavallo, Sloan Faculty Director at MIT, noticed that
Argentina could benefit from a more transparent,
reliable, and low-cost method to track inflation. The primary purpose, therefore, was to enhance awareness
on price changes and purchase behaviors of interest, particularly those prices consumers were paying. A
methodological challenge was also identified: could data via the Internet feasibly and reliably provide an
alternative data source for traditional price index measures? In time, as the data and methods to access them
proved reliable, the data could be used by researchers to enhance understanding as well as central banks and
financial traders to enhance their forecasting abilities.

Online price data provide distinct advantages over alternative sources of information such as direct survey
methods used for traditional price indices or scanner data obtained through companies. In particular, online
price data is very high-frequency (daily)--in fact is available in real-time without delay--and has detailed product
information such as the size or exact product identifier (i.e. SKU) as well as whether the product is on sale.
Unfortunately, data on quantity sold is relatively unavailable via outward-facing Internet webpages.

Data Generation

As indicated by the data palette above, the data used for this study is generated by various firms via the
Internet. While procedures to update prices will vary across companies it is safe to state that, in general, such
information is provided on a high-frequency and specific geographical basis (i.e. with store location), or as
indicated by the palette, at a high level of temporal frequency and spatial granularity. The data is relatively
structured since contextual text on the webpage will indicate the SKU for the product as well as other
associated parameters. Once posted online, such data become publically accessible for anyone to access.

Data Interpretation

In the initial efforts to compute price indices for the country of Argentina, all Dr. Cavallo needed was his laptop
and time. As the endeavor grew in scope, however, methods and considerations grew in sophistication. Today,
a curation process ensures that the best sources of online data are selected. In some cases, data is collected
from a retailer for a few years to evaluate whether the quality of the data is sufficiently high for inclusion. Also
important is to capture offline retailers’ price information since fewer individuals in developing countries, in
particular, purchase goods online. Although each download or “scraping” places no more pressure on a server
than a regular page view, the timing is even adjusted to reduce retailers’ server demands. Finally, retailers’


                                                                                                                    15
privacy concerns are addressed in a variety of ways including sharing data at an aggregated level and with time
lags.

While the technology for “scraping” price and associated product information off of websites is inexpensive
and readily available, Dr. Cavallo and his team realized that a careful process of cleaning the pulled data
needed to be put in place; this to ensure that all data sources are homogenized and prepared for aggregation
and analysis.

The consolidated price data has been used in two ways. The first is to produce price indices which give an
indication of inflation. This information is sold and/or shared with central banks and financial traders. The
second is for academics seeking to understand market behaviors. Academics will often utilize econometric
techniques, for example, which leverage the data sets’ high degree of granularity and dimensionality in order
to understand short run or disaggregated effects of various policies.

Insight Implementation

Individuals working at central banks and/or traders use such data to enhance decision making. For example,
due to its high-frequency, central banks can see day-to-day volatility combined with sector-by-sector
comparisons which traditional measures can’t capture. As experience with such data sources deepens, many
governmental statistical offices are shifting their mentality to accept alternative methods of data collection.

Academics also use such data to conduct research. The fact that the data has so many dimensions allows
economists to avoid the use of complicated techniques to account for expected gaps of information in
traditional data sources. In one paper, economists used the Internet scraped data to show that the Law of One
Price (a central economic theoretical result) tends to hold across countries that belong to formal or informal
currency unions. In another paper, the dataset was used to show how natural disasters affect the availability
and prices of goods. Researchers found that more indispensable goods were prone to more quickly reduce in
availability and more slowly to resume from price increases.

From Idea to Ongoing Process

What started as an idea by one individual grew, in time, to a research initiative named Billion Prices Project
at MIT which was primarily supported by grants. In time, the company PriceStats was created through which
high-frequency and industry-specific price indices could be sold. Through an agreement between PriceStats
and the Billion Price Project to share the data as well as some of the earnings, the research work can continue
uninterruptedly. In time, it is anticipated that the data will be more widely shared with academic institutions for
research purposes.

References
BPP. (2014). The Billion Prices Project @ MIT. Retrieved from http://bpp.mit.edu/usa/
Cavallo, A., Neiman, B., & Rigobon, R. (2012). Currency Unions, Product Introductions, and the Real Exchange
Rate (No. w18563). National Bureau of Economic Research
Cavallo, A., Cavallo, E., & Rigobon, R. (2013). Prices and Supply Disruptions during Natural Disasters (No.
w19474). National Bureau of Economic Research.
Conversation with Alberto Cavallo, February 2014
PriceStats. (2014). History. Retrieved from http://www.pricestats.com/about-us/history



                                                                                                                  16
A well defined question will clarify three basic         correspond, for example, to the actions institutions
elements: the primary purpose for using big data,        may take to respond to present situations, design
the kinds of real-world behaviors of interest, and       policy, or prepare for future events.
the scope of analysis. The following are possible
questions. How do incomes in Uganda change
throughout the year? How can assistance be more
effectively provided to those in need after a natural
disaster near the Pacific Ocean? What cities in the
world should prepare for increased flooding due to
climate change? How can governmental systems
be designed to efficiently and appropriately tax
individuals according to transparently defined
principles? How can a country’s employment sector
be more effectively coordinated with training and
education?

An analysis of multiple case studies, such as the                       Figure 3: Awareness/Understanding/Forecasting
ones included in the introduction, suggests that
endeavors which utilize big data are primarily focused   Regarding the question of behaviors, the following
on advancing either awareness, understanding, or         categories (with corresponding examples) are
forecasting. Use of big data to heighten awareness is    suggestive of areas of interest in the international
exemplified by projects which utilize non traditional    socio-economic development context. The table
sources of data to serve as proxies for official         below shows how a few of the example behaviors
statistics, such as the gross domestic product or        may be used in the context of heightening awareness,
Kenyan examples above. Real-time monitoring              advancing understanding, or enhancing forecasting.
of disasters provides another avenue in which
awareness is especially needed [30]. Big data in           •	 product/service usage (e.g. non-market food
some cases is used to more deeply understand                  consumption)
a phenomenon so that better policy levers can              •	 market transactions (e.g. wheat purchase prices)
be utilized. The Mexico call detail records case           •	 human population movement (e.g. regional
described above is one relevant example. Finally,             migration patterns)
big data may be utilized to more accurately forecast       •	 human population sentiment (e.g. public
behaviors so that institutions and populations can            opinion on policies)
more effectively prepare. The unemployment case            •	 human population conditions (e.g. extent of
in Ireland is one such example. Without a doubt,              disease epidemic)
these three purposes are deeply interrelated. It is        •	 weather conditions (e.g. ground temperatures)
impossible to advance understanding, for example,          •	 natural resource conditions (e.g. extent of
without heightening awareness. Understanding, in              forests)
turn, is often the foundation upon which forecasting       •	 physical infrastructure conditions (e.g. locations
methods are utilized. Conversely, certain machine-            of usable roads)
learning and inductive algorithms may be used to           •	 agricultural production (e.g. extent of rice
enhance forecasting ability which can itself give rise        cultivation)
to a better understanding of a system. That said, the
three categories may be useful to stimulate thinking
as far as the ways in which big data can be utilized.
Awareness, understanding, and forecasting aptly



                                                                                                                  17
               AWARENESS                                UNDERSTANDING                          FORECASTING

Wheat          How much are farmers currently           What is driving changes in wheat       What will wheat purchase prices be in
purchase       receiving for the wheat they are         purchase prices?                       a week?
               selling?
prices


Public         How favorably do citizens feel about a   What factors drive public opinion on   How will public opinion change in the
opinion on     particular policy?                       foreign relation policies?             coming months?

policies

Regional       During what times of the year do         How do labor wage differences          How is country-to-country migration
migration      people engage in migration               stimulate changes in migration         expected to change in the next few
                                                        patterns?                              years?
patterns



In determining the scope of analysis, it is helpful to            and/or linked data (published in a format which lends
know if the use of big data in a process or project               itself to identify elements and links between datasets)
is intended to consider the situation for a single                [33]. Yet another paper explores how data may be
individual, city, region, nation, and/or the entire               structured (i.e. readily stored and accessed in terms
planet. By determining the scope, data requirements               of columns and rows) or unstructured (e.g. images or
as far as size and interoperability become clear.                 video) [34].

Generation to interpretation of data                              An examination of the above categories reveals that
                                                                  underlying the discourse to understand big data are
Once the setting for the analysis is well defined,                three aspects of the cycle described above: the way
it will be helpful to consider the data available                 in which data is generated, its content and structure,
corresponding to the behaviors of interest. The first             and the process through which it is accessed and
step in this is to describe and categorize the various            interpreted. The following sections elucidate these
types of data available.                                          three aspects, providing relevant categories with
                                                                  which to organize thinking about the opportunities in
Many authors have attempted to do this. One report                the big data space.
notes that data may record what people say or do
[32]. Another report points out that big data may have            Data generating process
one or more of several features including whether it is
digitally generated, passively produced, automatically            The data generating process has at least three
collected, or geographically or temporally trackable,             features: the data-recording trigger, the level at which
and/or continuously analyzed. The same report                     data is collected, and media through which data
discusses a data taxonomy with four categories: data              is generated. To begin, the trigger that is utilized
exhaust (i.e. passively generated, often real-time                to encode behaviors into data may be active or
data), online information, physical sensors, and citizen          passive. For example, much mobile data is passively
reported or crowdsourced data. Also included in the               or constantly collected by cell phone towers. Data
report is the division of data sources into traditional           such as a Twitter or Facebook post is generated
(e.g. census or survey data) vs. nontraditional (e.g.             actively since a human decision actively precipitates
social media, mobile phone data) [10]. Another                    its recording. Data is also generated at a particular,
paper discusses how data may be: raw (primary,                    most granular level of analysis. These levels of
unprocessed data directly from the source), real-time             analysis may pertain to temporal, spatial, human, or
data (measured and accessible with minimal delay),                other characteristics. For example, retail sales data
                                                                                                                                 18
is generated at a high temporal frequency (daily,                              human collaboration [35]. A study done with sixteen
in some cases), spatially identified at a store-level                          data analysts using big data at Microsoft found
(i.e. latitude and longitude), and by product SKU.                             five major steps which most engaged in: acquiring
Finally, the media through which data is generated                             data, choosing an architecture (based on cost and
may include one or more of the following: satellite,                           performance), shaping the data to the architecture,
mobile phone, point-of-sale, internet purchase,                                writing and editing code, and finally reflecting and
environmental sensors, and social media, among                                 iterating on the results [6]. Finally, another report by
others.

Data content and structure

In terms of the data content and structure itself, at
least six features shed light on the kind of big data
one may choose. First, as mentioned above, data
may be in a structured or unstructured form. Second,
data may be temporally-referenced; in other words,
each record or instance has some form of temporal
identification attached. Third, data may be spatially-
referenced, i.e. be tied to some geographic location                           Figure 4: Data Interpretation Process

data. Fourth, data may be person-identifiable; in                              the United Nations discusses how big data analysis
other words, records are not only separable1 and                               requires the steps of filtering (i.e. keeping only
unique by person, but fail the test of anonymity . Fifth,                      relevant observations), summarizing, and categorizing
data may have various sizes, from, say, hundreds of                            the data [10].
megabytes to several petabytes. Finally, a dataset                             In reviewing the above and several other papers,
may or may not be a compilation of other datasets.                             three interrelated components are involved in
                                                                               generating insights from data. First, the data must be
Data interpretation process                                                    accessed. Second, the data must be prepared. Third,
                                                                               the data must be analyzed. Undoubtedly, these three
Once datasets corresponding to the behaviors of                                components are highly interrelated and need not
interest have been identified, these must be collected                         progress linearly. For example, analysis can inform the
and analyzed so that actionable insights may be                                need to access new data, and the data preparation
generated.                                                                     component may reveal certain preliminary insights for
                                                                               analysis.
Several authors have elucidated features pertaining
specifically to the process of analyzing data. In one                          Access
paper, several phases are described, including: data
acquisition, information extraction and cleaning;                              In categorizing the ways to work with big data, many
integration, aggregation, and representation of                                sources have actually been describing features
the information; query processing, data modeling,                              pertaining to the access and/or generation of data.
analysis; and, finally, interpretation. Each of these                          At a most basic level, data may be accessed using
phases presents challenges such as the heterogeneity                           one of three sources: via public institutions, via private
of the data sources, the scale of the data, the                                institutions, and directly from individuals or the crowd
speed at which data is generated and response is                               [36]. Publicly-sourced data includes zip-code level US
needed, concerns for privacy, as well as enabling                              census data, or massive weather station data from
1. Separability implies that an individual’s records can be separated and distinguished from others, whereas identifiability implies that it is known who
that individual--often simply represented by a number of code in a database--is. As long as a database is person-identifiable it is individually separable.
However a database could include separable yet not identifiable individuals. One such example is an individual wage record database where individuals’
names have been scrambled with random, yet consistent identifiers. It is worth noting, however, that even in cases where such data has been scrambled,
there are settings and corresponding methods through which an analyst could inductively identify people.

                                                                                                                                                        19
the US National Oceanic Atmospheric Administration     behaviors of interest. Through the lens of a particular
(NOAA). Privately-sourced data may include store-      model, the data sheds light on the presence and
level retail sales data for hundreds of thousands of   degree to which relationships exist. For this reason,
stores across a country or mobile phone location       model selection is especially important, otherwise
data for millions of individuals across a country.     true relationships may remain undetected. To ensure
Crowd-sourced data may include millions of image       a proper model, many researchers will emphasize
artifact-identification analyses done by hundreds of   how experience, expertise and human intuition are
thousands of individuals.                              critical [38]. In addition, it is important to consider
                                                       the fact that, when modeling human, non-laboratory/
Management                                             controlled settings using high-frequency big data,
                                                       several constraints and parameters affect the
Given that one or more dataset(s) can be created and/ behaviors of interest. In this regard, models must be
or accessed using the means discussed above, the       thoughtfully designed [39]. From a statistical-scientific
data needs to be combined and prepared for analysis. standpoint, the use of big data has significant
At this stage, the steps mentioned above such as       implications for modeling and theory-development.
choosing an architecture, extracting and cleaning the At a basic level, an analysis of substantial amounts of
data, and integrating the datasets become relevant.    data can inform the design of models and vice versa
This component of process, which will be further       such that a dynamic interplay exists between them
discussed in Section 4, takes a great deal of time and [37].
is closely tied with interpretation.
                                                           Insight implementation process
Interpretation
                                                           Simply having insights pertaining to relevant
Once data has been formatted in such a way that it         behaviors is insufficient to cause a change in those
can be more readily accessed by the analyst, various       behaviors. A process whereby insights generated are
methods can be used to interact with it and generate       translated into action is necessary. In considering this
insights, or at least elucidate other questions which      process, at least three features stand out: defining
will refine data access and preparation. Two main          next steps, adjusting or creating structures to ensure
mechanisms are often used to interpret big data:           these steps are carried out, and taking the necessary
visualization and modeling.                                actions.

Visualization plays a key role since it leverages the      The insights which are ultimately generated as a result
human strength to see patterns. It also often helps        of the analysis of data may be more or less actionable.
the analyst scrutinize the data more closely [37] and      Therefore, it is critical to articulate next steps arising
may enable ready comprehension on findings that            from the insights. And, of course, to act upon these
would otherwise be difficult to achieve [10]. Indeed, in   steps. However, it is also necessary that some kind of
a survey of over 3000 business executives around the       structure is in place to ensure continuity of action.
world, a great number indicated that “visualizing data
differently will become increasingly valuable” [17].       In this regard, the distinction between using big data
Visualization requires the thoughtful selection of the     for one-time projects versus integrating its usage
relevant pieces of information displayed in a visually     into an ongoing process merits consideration. Many
appealing way to help a decision-maker understand          short-term projects have utilized big data to explore
the data.                                                  its application as a proxy for official statistics or, in
                                                           some cases, as a way to allocate resources after a
Modeling is essential to interpreting data, especially     disaster. Once immediate disaster relief is carried
if the purpose is to understand and/or forecast            out, for example, continued use and refinement of
behaviors. Models attempt to describe the underlying       the original big data may cease. At best the data
processes occurring in the world which give rise to the    is used as reference material for future projects.
                                                                                                                   20
However, big data may be used in the context of          acting--are closely tied and are not necessary linear.
ongoing processes through which its use is refined       It’s possible that action on the ground informs how
and the insights it leads to are ongoingly acted         structure is defined. For example, when aid agencies
upon. Retail giants like Walmart provide examples of     assist potential polio victims, they may discover
the ways in which big data can be integrated in the      that methods of communication need enhancing in
ongoing functioning of the company such as using         particular ways. Alternatively, creating a structure to
forecasts based upon their large stores of data in       implement next steps may itself help lend further
order to adjust inventories, re-organize content, and    shape to them. For example, a non-profit may see
price items. Cities like Chicago are gathering and       that the insights generated from an analysis very
tracking both historical and real-time big data on an    clearly point to the need to increase the number of
ongoing basis to streamline operations and uncover       vaccinations. However, when beginning to define the
meaningful correlations [31].                            structures which will actually carry this out, it may be
                                                         discovered that other organizations are already doing
                                                         this and that what is actually necessary is to more
                                                         appropriately identify those individuals who need
                                                         vaccinations. Critical to ensuring that the process has
                                                         a healthy degree of interactivity among its various
                                                         elements is a culture of learning characterized by
                                                         a willingness to share observations and learn from
                                                         mistakes.

                                                         Once insights are generated and implemented,
                                                         behaviors change and new data is generated,
                                                         whereby the cycle resumes and a rich process of
Figure 5: Insight Implementation Process                 learning ensues.



Whatever the nature of the use of big data, whether
for a project or process, the three facets of the
implementation process--defining, (re)structuring, and




                                                                                                               21
CASE STUDY                                                                                DATA PALETTES

Understanding                                               Call detail records via mobile phones



Labor Market
Shocks using
Mobile Phone
Data
                                                            Weather conditions via remote sensing imagery
Motivation

The use of call detail records (CDRs) to track fine
grain patterns of population movement has been a
topic of much research in the past few years. Joshua
Blumenstock, University of Washington professor, has
spearheaded projects in Rwanda and Afghanistan
using CDR records which enhance the ability to gather
deeper insights into more nuanced forms of migration,
including seasonal and temporary migration. Upon the
basis of that research, Blumenstock and his colleague
Dave Donaldson, are now pursuing evidence to support Labor market data via public agency surveys
the long-held theory of migrant workers acting as
arbitrageurs whose movement among labor markets
serve to bring those markets into equilibrium. The highly-
detailed internal migration data necessary to support
that theory empirically has been unavailable until now.
In addressing their central research question regarding
the extent to which the movement of people over
space helps stabilize and equilibrate wage dispersion,
Blumenstock and Donaldson are seeking to analyze
migration patterns in response to labor market shocks to
shed light on the dynamics of internal migration in low
income economies.

Data Generation

Although access to CDR data is typically a significant hurdle in projects of this kind, in this case Blumenstock
and Donaldson already had access to the necessary CDR data from prior projects. Data is generated
automatically as telecommunications companies encode phone calls into mobile phone records. For this
study, the data spans several years, accounts for three developing economies and is highly detailed both

                                                                                                                   22
temporally and spatially. In addition to CDR data, the project is making use of census data and external
data sources underlying labor demand shocks, including weather conditions and domestic and international
commodity prices. Public census data, while it changes rarely, includes a high volume of information and adds
depth of knowledge to the longer term dynamics of rural to urban area migration. Government labor market
data is highly structured and includes regional price and wage information at two regular intervals. Finally, the
team purchased the high frequency/high resolution satellite weather data needed to assess climate-related
labor shocks for the project.

Data Interpretation

CDR data is a remarkably unwieldy and inconsistent dataset to work with, received in unstructured repositories
of millions of individual files. The initial steps in the project from pre-processing the data to teasing out the
relationships between the different datasets (e.g. wage and crop price data) are incredibly time consuming.
The team works with the data on a daily basis and builds models iteratively based on the data.

The data analysis process goes through three steps. The first step is to isolate local shocks to labor demand
(e.g. weather or commodity prices changes) and to identify the resulting labor market outcomes. The second
step involves using the identified shocks to labor demand and drawing on CDR data to estimate the migration
response to those shocks. This step aims to understand the migration response to labor market shocks and
answer the following questions: What types of individuals are more or less likely to migrate in response to
shocks? What regions are more receptive to migrants? How long after a shock do people stop migrating?
How do these dynamics affect urbanization? The final step in the analysis is to estimate the actual effects
of migration dynamics on the creation of wage equilibrium and to understand the speed and depth of the
impact.

There is already significant existing research on models used to determine measures of mobility using CDR
data and several research papers on the topic. Moreover, the project draws on classic theoretical models
from economic literature, like the Harris-Todaro model, regarding the relationship between wages, labor
market shocks and migration. The project team is currently building a quantitative framework that will allow
them to test those theories based on the iterative models being developed through analysis of the wage
and migration data. Throughout the process of analysis, small insights continuously hone the questions
asked and help determine relevant inputs. As data comes in, determinations are continuously made which
refine the quantitative models that exist for each factor within the framework and the relationship between
factors. Benchmarking against other existing data sources, including census data, is relevant in making those
determinations.

Insight Implementation

Though this project is still in its early stages, it will deepen understanding of the role of migrants in labor
markets. While the theories underlying many current policies designed to impact migration and urban-rural
population flow are well-developed, empirical evidence is remarkably thin due to the lack of detailed data. As
tracking of population movements using CDR data becomes more commonplace, the insights gained through
this historical analysis of data will play an integral part in understanding migration patterns.




                                                                                                                23
From Idea to Ongoing Process

Like many of Blumenstock’s projects which track migration patterns on the basis of CDR data, this retrospective
project contributes to a fundamentally deeper understanding of how wages and labor markets are
determined. Through such an understanding, labor policy can be more effectively designed for low income
countries, including selection of actions to incentivize and disincentivize migratory behavior. By pioneering,
and rigorously documenting, a process for gathering insights based on a quantitative framework of evidence,
this project could be a foundation upon which ongoing evaluation of government policies could be
conducted.

References

Blumenstock, J. & Donaldson, D., (2013). How Do Labor Markets Equilibrate? Using Mobile Phone Records to
Estimate the Effect of Local Labor Demand Shocks on Internal Migration and Local Wages. Proposal Summary
for Application C2-RA4-205.

Conversation with Joshua Blumenstock, March 2014.

Harris, J. & Todaro, M., (1970). Migration, Unemployment and Development: A Two Sector Analysis. American
Economic Review 60 (1).




                                                                                                             24
25
SECTION 3
WHAT CAN BIG DATA LOOK LIKE FOR
THE DEVELOPMENT SECTOR?
Big data shows potential to advance development          Examples by medium and relevant
work in a variety of ways. In the first section above,   data set
several examples were provided which highlighted
the ways in which big data could be used as a            Mobile | Call Detail Records. Although usage of call
proxy for conventional official statistics, thereby      detail record (CDR) data for development is still in
enhancing institutional awareness of the conditions      early phases, applications such as using Digicel’s
of a population; to better organize governmental         data to track population displacement after the
processes thereby delivering more effective services;    Haiti earthquake and modeling of infectious disease
or to enhance understanding of the drivers of health     spread show great promise [40]. One study in
epidemics, thereby guiding policy decisions.             Afghanistan showed that CDR data could be used
                                                         to detect impacts from small-scale violence, such
Any point in the framework discussed in Section 2 can    as skirmishes and improvised explosive devices, in
be used to stimulate the imagination on the horizon      terms of their impacts on communication methods
of possibilities for big data in development. The        and patterns of mobility. Another project done by
case studies presented throughout the text provide       the lead researcher in the Afghanistan study was to
concrete applications of the lens of the framework       capture seasonal and temporary migration, usually
in various settings. Moreover, this section describes    overlooked by traditional survey models, permitting
several examples of data sets utilized by medium         a more precise quantification of its prevalence. An
as well as by purpose and report information on the      ongoing project which builds upon these results
first World Bank attempts to leverage big data to        aims to measure precisely the extent to which wage
address development challenges in Central American       disparities in Rwanda, Afghanistan, and Pakistan are
countries. By cross referencing primary media with the   arbitrated by migration [22].
primary purpose of the use of big data--awareness,
understanding, or forecasting--one can easily see how    Satellite | remote sensing images. Usage of satellite
big data projects can take a variety of configurations   data abounds. For example, the United Nations
depending on the context. Then a summary is              University engaged in a project using satellite rainfall
presented detailing what institutions and individuals    data combined with qualitative data sources and
are saying about where big data shows promise for        agent-based modeling to understand how rainfall
development. Finally, recommendations for next           variability affects migration as well as food and
steps in advancing the application of big data for       livelihood security in South and Southeast Asia, Sub-
development are provided.                                Saharan Africa and Latin America [41]. In Stockholm,
                                                         GPS-equipped vehicles provided real-time traffic
                                                         assessments and, when combined with other data
                                                         sets such as those pertaining to weather, made traffic
                                                         predictions. Such analyses inform urban planning and

                                                                                                                26
also can increase time and cost savings for drivers        managers to make better stock exchange decisions
[42].                                                      and by researchers to predict a film’s success at the
                                                           box office or a person’s likelihood to get flu shots [3].
Internet | Search Queries. The internet stores
a vast amount of information, much of which is             Financial | Credit Card Transactions. Credit card
unstructured. Search queries present one source of         companies have increasingly been using their massive
data on the internet. In this vein, Google searches        stores of data to enhance their services. In several
for “unemployment” were, found, for example, to            cases, companies use purchase data to identify
correlate with actual unemployment data. Similar           unusual behavior in real time and quickly address
data was used to notice changes in the Swine Flu           potential credit card fraud [3]. In other cases, financial
epidemic roughly two weeks before official US              institutions have been cited as being able to predict
Centers for Disease Control and Prevention data            whether someone is dating [45] or even infer the
sources reflected it [42]. The Bank of England uses        strength of a marriage [46].
search queries related to property, for example, to
infer housing price changes [3]. Colombia’s Ministry       Big Data for Development in Central
of Finance uses the information generated by Google        America: World Bank pilot efforts
searches to assess short-term GDP trends in Colombia
and publish monthly macroeconomic reports which            Since 2014 the World Bank has been exploring the
discuss the results of the model developed [43].           potential utility of big data to support public policies
                                                           in Central American countries. The starting point
Internet | Text. Text analysis is critical for data        was addressing data availability issues, in a context
generated via the internet not only for sentiment          where traditional data collection methods, such as
analysis (e.g. favorable/unfavorable views on a policy)    household surveys, are undertaken with a relatively
but also for lexical analysis to understand elements       low frequency in Central America and incur high
of culture. One group analyzed the concept of honor        costs. Therefore, the goal was to explore the potential
in the Middle East, for example, and found how it          of alternative data sources, such as those one
differed by region and changed over time in response       described in the paragraphs above, to fill a data gap.
to the events of September 11th. Such analysis             With this objective, three different exploratory pilots
could inform the appropriate selection of language         focusing on different sources of information (internet
in, say, diplomacy or educational materials. Further       data, social network data, and satellite data) were
applications in this regard could include, for example,    developed.
developing a contextual lexicon on financial literacy
in order to tailor microlending by region [44]. By         The objective of the first pilot was to assess the
combining topic modeling methods--whereby one              possibility of using web search keyword data (from
explores and understands concepts and topics from          Google Trends) for nowcasting price series in Central
text--with sentiment analysis, one can gain a richer       America. The study, which focused on Costa Rica, El
understanding of unstructured text data [24].              Salvador, and Honduras, highlighted the challenges
                                                           in using Google Trends data. The findings, based
Social Media | Tweets. Similar to the example of           on a number of indexes constructed to summarize
analyzing search queries above, social media data          Google Trends data, showed that Google Trends data
such as Twitter tweets can be used as an early             can improve the ability to forecast certain price series
indicator of an unemployment hike or to evaluate           (especially in Costa Rica and El Salvador, where the
crisis-related stress [32]. Another case utilized tweets   web search data was of higher quality).
to know about a cholera outbreak in Haiti up to two
weeks prior to official statistics [42]. Both of these     The second pilot, jointly carried out with the United
cases demonstrate the ability to reduce reaction           Nations initiative working on big data (UN Global
time and improve process with which to deal with           Pulse), explored the potential of social network
various crises. Tweets have been used by hedge fund        content to analyze public perception of a policy
                                                                                                                   27
reform in Central America. The project focused on
the gas subsidy reform in El Salvador and consisted
of gathering data from Twitter. After geo-referencing
on-line content to the country and categorizing
information based on the content, the study used
text analytics to see if the results from the social
media analysis closely followed the public opinion as
measured through of a series of household surveys
conducted in the El Salvador before and after the
reform. By undertaking what can be thought of as a
replication study the goal was to establish the validity
of the alternative method (social media text analysis)
to capture the underlying phenomenon under
study. Preliminary results confirmed that Twitter data
provides a useful complement to analyze the public
perception of a policy reform.

The third pilot tried to use satellite data to understand
poverty levels in Nicaragua and Guatemala. In
particular, the objective of the analysis was to produce
a first assessment of the information content of night-
time illumination measures and explore correlations
with poverty at high levels of geographical
disaggregation. The analysis showed that the one-to-
one correlation is negative and statistically significant,
indicating that night-time illumination data may
contain information relevant for analyzing poverty
conditions.

These pilots are just a starting point. The World
Bank launched a Big Data Innovation Challenge in
September 2014 to promote big data driven internal
projects. In less than a month, more than 130 project
proposals were submitted to the Challenge to keep
exploring the potential of big data for development.




                                                             28
Examples by medium and purpose

            AWARENESS                                  UNDERSTANDING                           FORECASTING

            A study in Afghanistan has shown           A study in the UK used mobile           Research has shown that when
MOBILE
            that you can use CDR data to detect        and census socioeconomic data to        mobile operators see airtime top-off
            impacts from “microviolence” like          examine the connection between          amounts shrinking in a certain area,
            skirmishes and IEDs. Microviolence         the diversity of social networks and    it tends to indicate a loss of income
            has clear effects on the ways people       socioeconomic opportunity and           in the resident population. Such
            communicate and patterns of mobility       wellbeing, validating an assumption     information might indicate increased
            and migration, similar to what you         in network science previously           economic distress before that data
            might see after a natural disaster. [22]   untested at the population level-       shows up in official indicators. [36]
                                                       -that greater diversity of ties
                                                       provides greater access to social and
                                                       economic opportunities. [47]


            Xoom, a company specializing in            The Oversea-Chinese Banking             Predictive analytics tools like
FINANCIAL
            international money transfers, noticed     Corporation (OCBC) increased            FlexEdge allow traders on US equity
            in 2011 that there were more frequent      understanding of individual             markets to engage in advanced
            than usual payments being funded           customer preferences by                 forecasting, including overnight
            by Discover credit cards originating       analyzing historic customer data,       and intraday forecasts updated by
            in New Jersey. All looked legitimate,      then designed an event-based            the minute, resulting in an error
            but it was a pattern where one should      marketing strategy focused on           reduction of up to 25% over standard
            not have existed. Further investigation    using a large volume of coordinated     forecasting techniques which
            revealed the fraudulent activity of a      and personalized marketing              typically take a historical window
            criminal group. [3]                        messages. Their precise targeting       average [49].
                                                       positively impacted numerous key
                                                       performance metrics and increased
                                                       campaign revenues by over 400%.
                                                       [48]




            Following the 2013 typhoon in              The Open Data for Resilience            AWhere’s “Mosquito Abatement
SATELLITE
            the Philippines, Tomnod (now               Initiative fosters the provision        Decision Information System
            DigitalGlobe) took their high-             and analysis of data from climate       (MADIS)” crunches petabytes of
            resolution satellite images, divided       scientists, local governments and       satellite data imagery to locate the
            them into pieces and then shared           communities to reduce the impact        spectral signature of water primed
            them publicly to crowdsource               of natural disasters by empowering      for breeding mosquitoes and
            identification of features of interest     decision-makers in 25 primarily         combines it with location intelligence
            and enable rapid assessment of             developing countries with better        algorithms and models of weather
            the situation on the ground: where         information on where and how to         and mosquito biology to identify
            buildings were damaged, where debris       build safer schools, how to insure      nascent outbreaks of mosquitoes
            was located, and where roads were          farmers against drought, and how        even before they hatch. [46]
            impassable. First responders used          to protect coastal cities against
            maps generated through this system         future climate impacts, among other
            and the Red Cross relied on the data       intelligence. [2]
            to determine resources. The Philippine
            government also will analyze the data
            to better prepare for the future. [50]




                                                                                                                                  29
               AWARENESS                                   UNDERSTANDING                            FORECASTING

INTERNET       Pricestats uses software to crawl the       Logawi engaged in a research project     Research has shown that trends in
               internet daily and collect prices on        using lexical analysis--the use of the   increasing or decreasing volumes
               products from thousands of online           internet to create a cultural context    of housing-related search queries
               retailers, enabling them to calculate       for particular words and phrases         in Google are a more accurate
               daily inflation statistics which are used   enabling deeper understanding of         predictor of house sales in the next
               by academic partners to conduct             how cultures view particular ideas--     quarter than the forecasts of real
               economic research and public                to assess how different populations      estate economists. [9]
               institutions to improve public policy       across the Middle East understood
               decision-making and anticipate              the concept of “honor.” Based on
               commodity shocks on vulnerable              interviews and analysis of internet
               populations. [3], [51]                      data, Logawi developed a lexicon of
                                                           words and phrases that mapped onto
                                                           the region viewing how definitions
                                                           and use of “honor” change for
                                                           different cultures over time. [44]


SOCIAL         Using social media analytics in Syria,      A project by UNICEF used                 A collaborative research project
                                                                                                    between Global Pulse and the SAS
MEDIA          SecDev Group was able to identify           social media monitoring tools to
               the locations of ceasefire violations or    track parents’ attitudes towards         Institute analyzing unemployment
               regime deployments within 15 minutes        vaccination in Eastern Europe by         through the lens of social media
               after they took place, enabling them to     identifying patterns in the sentiments   in the US and Ireland revealed
               rapidly inform UN monitors ensuring         of their public posts on blogs and       that the increases in the volume of
               swift response. [52]                        social media. The study increased        employment-related conversations
                                                           understanding of how to respond          on public blogs, online forums
                                                           to vaccine hesitancy and educate         and news in Ireland which were
                                                           parents’ to make informed choices,       characterized by the sentiment
                                                           including engagement strategies          “confusion” show up three
                                                           and messaging. [53]                      months before official increases
                                                                                                    in unemployment, while in the US
                                                                                                    conversations about the loss of
                                                                                                    housing increased two months after
                                                                                                    unemployment spikes. [10]


Areas of high potential for big data                                 that global development work can be improved using
                                                                     big data in three ways: strengthening early warning
A variety of authors and institutions have pointed                   systems to shorten crisis response times, enhancing
out what they see as areas of high potential for big                 awareness of situations on the ground to better
data. At a broad level, many authors emphasize                       design programs and policies, and enabling real-time
the potential of combining datasets to enhance                       feedback to make appropriate and timely adjustments
understanding [54], [55]. The OECD points to four                    [10]. These categories are examined below. In
broad international research topic areas which would                 addition to these areas, however, several individuals
benefit from a variety of data types. These topic areas              have highlighted the promise that big data shows
include population dynamics and societal change;                     in terms of strengthening understanding around
public health risks; economic growth, innovation,                    complex systems dynamics thereby enabling better
research and development activity; as well as social                 policy-making. To the extent that specific challenges
and environmental vulnerability and resilience [55].                 are elucidated and data is used more in the context
                                                                     of ongoing processes rather than one-time projects,
Beyond research, however, there is a need for more                   big data will have stronger impacts on international
specific, practical arenas within which big data shows               development.
promise. The United Nations’ Global Pulse argues

                                                                                                                                       30
Early warning                                              the retail market, sourcing is often simplified by
                                                           working with large scale suppliers, crowding out
Various other sources have also emphasized the             smaller producers. A third-party organization
potential of big data for early warning [54], [30]. Two    could, however, utilize big data analytics to ensure
concrete examples of such work include forecasting         replenishment and coordinate supply from a variety of
riots using food price data with relevant proxies          product sources, large or small [39].
[56], or predictive policing whereby various data
sources, including police databases, are combined          Specific challenges and ongoing
with mathematical tools to route police patrols in         processes
anticipation of future crime [57].
                                                           To move the agenda of big data for development
Enhancing awareness and enabling                           forward, more than general categories or approaches
real-time feedback                                         will be needed. Reiterated during one interview [48] is
                                                           that what is especially needed in the development big
The potential of big data for enhancing real-time          data space is the specification of challenges that lend
awareness, also known as nowcasting, is also               themselves to the utilization of big data. Put another
repeatedly discussed by other individuals [37], [12]. In   way, the kinds of insights that need to be generated
fact, approximately 1,200 business and IT professionals    should be specified, and such a specification process
attributed “real-time information” as one of the           would benefit from the input of practitioners, data
top three defining characteristics of big data [58].       scientists, designers, and relevant thought leaders.
Several examples testify to the power of using big         Once specified, a space can be created for those
data to enhance awareness. Two MIT economists, for         with the necessary contextual and analytical ability
example, used internet data to collect prices on half      to propose methods to address the carefully defined
a million products and detected price deflation two        challenges using big data. The World Bank, in
months prior to release of the official, and expensive-    collaboration with other institutions and organizations,
to-produce Consumer Price Index [3]. Alternatively,        may play a crucial role in this regard as a convener
after a recent tsunami in Japan, Honda was able to         of various parties both to specify challenges and
provide road closure data within a day using GPS data      explore ways to address them [23], [24]. By convening
from newer generation cars [45]. One idea discussed        parties, a shared way of speaking and thinking about
among several big data experts and practitioners           big data can be created which is general enough to
included creating publicly accessible databases to         be inclusive of a diversity of approaches yet specific
enable anyone to assess financial markets, thereby         enough to generate meaningful action.
protecting consumers and investors [37].
                                                           Furthermore, while short-term projects using big
Understanding and interacting with                         data can be helpful to increase awareness or begin
social systems                                             to adjust systems to be more effective, the value of
                                                           big data is perhaps most evident when it is integrated
One interviewee discussed the possibility of studying      into ongoing processes. Examples of such process-
the growth of urban boundaries, such as favela growth      oriented uses of big data range from private sector
in Brazil using historical satellite data combined         retailers using big data to minimize inventories to
with complex systems modeling. This could lead to          public sector governments streamlining tax collection
understanding city growth patterns and improved city       mechanisms, unemployment services, or city
and regional planning [59]. A few authors have also        emergency services. Indeed, if big data will be used
discussed the possibilities of opening big data as well    to address challenges, it will have to be integrated
as relevant analytical capabilities to level the playing   into ongoing processes. Only in this way can its use
field for labor and/or product supply from a variety of    be refined over time and the necessary knowledge be
sources [39], [60]. With larger-scale players dominating   generated more effectively to improve those systems
                                                           of interest.
                                                                                                                 31
CASE STUDY                                                                                 DATA PALETTE
                                                           Weather and land conditions via remote sensing
Forecasting and                                            imagery


Awareness of
Weather Patterns
Using Satellite
Data
Motivation
Heavy rainfall in the city of Rio de Janeiro often leads to severe landslides and flooding causing significant
public safety issues. For rescue efforts, coordination is needed among several different emergency agencies
when this happens. On April 6, 2010 the city had one of its worst storms, with mudslides leaving over 200
people dead and thousands without a home. It was this event along with the fact that the city was preparing
for the 2014 World Cup and the 2016 Summer Olympics that pushed the city to use data and predictive
modeling for emergency response.

Data Generation

As indicated by the data palette above, the data used to predict weather forecasts is primarily satellite data
gathered from open sources such as a the National Oceanic Atmospheric Administration (NOAA) while
sea surface temperatures are collected directly from NASA. For predicting landslides and flooding, data is
pulled from the river basin, topographic surveys, the municipality’s historical rainfall logs, and radar feeds.
Data such as temperatures at different altitudes, wind data, and soil absorption is captured to help develop
accurate predictions. The city is also using loop detectors in the street, GPS data, and video data to help plan
emergency responses. As indicated in the palette above, the weather pattern data is not highly structured,
however it is highly spatially and temporally referenced as it provides frequent and specific geographic
information representing a state in time.

Data Interpretation

Data from across 30 different city agencies is all housed in the Rio Operations Center providing a holistic
perspective of how the city is operating. The city created the Rio Operations Center to allow for real-time
decision making for emergency responsiveness and to improve safety based on consolidated data from various
urban systems. The idea behind creating this operations center was to remove silos between the different
emergency response services (i.e. police department, firefighters, etc.).

Data is analyzed using a variety of algorithms that allows for projections of floods and landslides to be made
on a half-kilometer basis and is able to predict heavy rains up to 48 hours in advance. The basic model that is
used for predicting heavy rainfalls is IBM’s Watson Weather Model. This model has been configured to the city
of Rio based on a comprehensive list of weather related events that was given to IBM by the city.
                                                                                                                 32
A combination of real-time and historical data is currently being used for this analysis. Rio has very good
historical flooding data available cataloguing at least 232 recurrent points of flooding. The algorithms are
very granular, taking raw data about wind currents, altitude temperatures, humidity level of the soil, and the
geography of the city to create accurate predictions of landslides. This data is then analyzed against the
infrastructure of the city to determine the likelihood of floods and landslides. For example, rainfall predictions
are compared against the layout of city infrastructure, the number of trees that can help absorb some of the
water and the conditions of the soil to predict greater risk areas for floods.

The city uses IBM’s Intelligent Operations center that pulls in information from multiple sources and provides
an executive dashboard for city officials to quickly gain insight into what is happening across Rio in real time.
City officials are able to see high-resolution animations of two and three-dimensional visualizations of key
weather variables and obtain detailed tables of weather data at various locations.

Insight Implementation

The new alert system notifies city officials and emergency responders in real time via automated email
notifications and instant messaging. As a result of their high-resolution weather forecasting and hydrological
modeling systems, Rio has improved emergency response time by 30%. An additional benefit of the new alert
system is all of the data that it generates from the receipt of a message to the response taken. Analysis of this
data allows city responders to improve their current procedures resulting in lower response times and greater
coordination of activities.

From Idea to Ongoing Process

The Rio Operations Center was the first center in the world to integrate all stages of disaster management
from prediction, mitigation and preparedness, to immediate response and feedback capture for future
incidents. By having access to each other’s data in a non-siloed environment, greater communication and
collaboration was seen between emergency response services leading to more rapid response times. In
addition to being able to predict rain and flash floods, the city is now able to also assess the effects of weather
on city traffic and power outages by using a unified mathematical model of Rio. Moreover, Rio is now going
beyond weather forecasting to integrate big data into other areas of municipal management. For example,
data on waste collection via GPS systems installed in trucks is also collected by the Rio Operations Center to
create a comprehensive picture of how public services are operating in the city.

The city of Rio has also made this data publicly available for its citizens to be able to better manage their lives.
The mayor of Rio de Janeiro, Eduardo Paes stated “in addition to using all information available for municipal
management, we share that data with the population on mobile devices and social networks, so as to
empower them with initiatives that can contribute to an improved flow of city operations”. Citizens can receive
daily data feeds by following the Rio Operations Center updates on Twitter @OperacoesRio and Facebook
at Centro de Operações Rio. These sites also provide recommendations for alternative routes during special
events as well as current traffic and weather conditions. Eduardo Paes, the mayor of Rio has stated that by
having these updates, the quality of life in Rio has increased significantly and that this is helping bring more
businesses and people to Rio.

Rio is using big data as part of its daily operations to transition to being a smarter city and providing a higher
quality of life for its citizens. In fact, the mayor of Rio has made technology part of its “4 commandments for
smarter cities” stating that “a city of the future has to use technology to be present”. Rio has continued its
                                                                                                                     33
partnership with IBM to continuously improve upon its original algorithms so that as technology advances,
the city is also able to stay ahead of the game. As a result of this pilot, Rio is now fully committed to using
technology as a way to help govern the city.

References
Hilbert, M. (2013). Big Data for Development: From Information-to Knowledge Societies. Available at SSRN
2205145.

Conversation with Jean-Francois Barsoum from IBM, March 2014.

Conversation with Renato de Gusmao from IBM, March 2014.

Treinish, Loyd. (2014). Operational Forecasting of Severe Flooding Events in Rio de Janeiro. Retrieved from:
http://hepex.irstea.fr/operational-forecasting-of-severe-flooding-events-in-rio-de-janeiro/

IBM. (2011). City of Rio de Janeiro and IBM Collaborate to Advance Emergency Response System; Access
to Real-Time Information Empowers Citizens. Retrieved from: http://www-03.ibm.com/press/us/en/
pressrelease/35945.wss

Eduardo Paes TED Talk. (2012). The 4 Commandments of Cities. Retrieved from: http://www.ted.com/talks/
eduardo_paes_the_4_commandments_of_cities




                                                                                                                  34
35
SECTION 4
HOW CAN WE WORK WITH
BIG DATA?
Technology alone is not sufficient to understand and      database software such as Microsoft Access, for
interpret results from use of big data. Turning big       example, will scale very poorly in the face of dozens
data into insights which are then acted upon requires     of gigabytes, let alone terabytes of data. Instead
an effective combination of both technological and        of using supercomputers, software may be used
human capabilities.                                       to conduct parallel data processing over multiple
                                                          computers, including even videogame consoles,
Technological capabilities                                more cheaply. Examples of such software include
                                                          Hadoop clusters and products such as Microsoft’s
Each phase of the data interpretation process             Azure, and Amazon’s EC2 [42], [6]. Open source tools,
discussed above highlights the technological              such as those used by the city of Chicago’s predictive
capabilities necessary to work with big data              analytics platform, which utilizes big data [31], present
effectively. First, in terms of accessing data, it is     more financially inexpensive software options for
important to have the necessary hardware and              analysis. Altogether, these cheaper parallel processing
software to collect data depending on whether             and analysis options are promising when considering
a dynamic or static method is utilized. If data is        the lack of big data hardware and software capacity in
dynamically fed from an online source such as Twitter,    many developing countries [42].
for example, then the analysis software must allow
for such real-time, continuous updated analysis. If,      The more diverse the datasets of interest, the more
instead, data is being downloaded from some source        robust a software platform must be through which
and then kept for later analysis, it is important to      to interact datasets. For example, in the case of
ensure sufficient hardware capacity to store such data.   structured column/row datasets, interacting datasets
                                                          may be as simple as identifying unique keys to join
Given sufficient technological capacity simply to         tables, as is done using relational database software
access data from various sources, it is necessary to      like Microsoft Access. However, when considering
have software and hardware capabilities to connect        relatively unstructured data such as satellite infrared
and interact with large and diverse datasets. The         data or a collection of hundreds of millions of strings
larger the datasets, the greater the hardware storage     of text, software capabilities are critical in order to
and processing power and the more scalable a              analyze effectively and connect them to structured
software platform are needed through which to             datasets for generation of appropriately formulated
process queries and pull data for analysis. With large-   insights.
scale analyses, software is often needed to make use
of parallel computing through which computations          Beyond software and hardware requirements, it is
of massive amounts of data can be processed               immensely helpful when the data which is utilized
simultaneously over multiple processors [6]. Relational   has appropriately encoded metadata, i.e. data which



                                                                                                                 36
describes each dataset. In particular, Global Pulse       only samples, a willingness to work with messiness,
recommends that metadata describe the “type of            and an appreciation for correlation rather than a strict
information contained in the data”, “the observer         interest in causation [3]. A senior statistician at Google
or reporter”, “the channel through which the data         pointed out that a good data scientist needs to have
was acquired”, “whether the data is quantitative or       computer science and math skills as well as a “deep,
qualitative,” and “the spatio-temporal granularity of     wide-ranging curiosity, is innovative and is guided
the data, i.e. the level of geographic disaggregation     by experience as well as data” [62]. Other necessary
(province, village, or household) and the interval at     skills include the ability to clean and organize large
which data is collected” [32]. Given such metadata,       data sets, particularly those that are unstructured,
analysts and decision-makers can more easily              and to be able to communicate insights in actionable
identify the provenance of a particular dataset. This     language [11].
is especially helpful when analyzing data mashups,
or interactions of multiple datasets. With complete       In its report on big data, McKinsey points out,
metadata, in other words, the analysis is more            however, that a “significant constraint on realizing
transparent regarding assumptions.                        value from big data will be a shortage of talent,
                                                          particularly of people with deep expertise in
Beyond simply being aware of the data sources via         statistics and machine learning, and the managers
metadata, some authors have highlighted the need          and analysts who know how to operate companies
for analysts and decision-makers to understand            by using insights from big data” [16]. Indeed, the
more effectively the assumptions of the model and/        human capabilities needed are wide-ranging from
or combined dataset by, in essence “playing” with         those having to do with technology to those related
the assumptions. In this regard, several authors have     to people and real-world problems, e.g. hardware
explored the concept of a hypothetical Analytics          setup, data management, software development,
Cloud Environment through which a user can change         mathematical and statistical analysis, real-world
assumptions and see their impact on an analysis           model development and assessment, as well as the
[6]. In designing such software to be scalable, well-     distillation and communication of actionable insights.
designed Application Programming Interfaces (APIs)        Beyond such skills, an intimate knowledge of the real
must be created to channel data at an optimal level of    world situation of interest is critical [40].
aggregation, so that users may fluidly interact with a
large database [25].                                      Beyond the individual skills and capacities required,
                                                          effective spaces and environments need to be created
Human capabilities and data                               in order for multiple viewpoints to advance the
intermediaries                                            analysis collaboratively. At one level, distributed labor
                                                          coordination mechanisms such as crowdsourcing can
A reading of the above technological capabilities         be utilized to aggregate thousands or even millions
also indicates the undoubted necessity for human          of people’s perspectives on a dataset in order to
capabilities to interact meaningfully with big data.      complement and strengthen the big data analysis [50].
The need for human capacity to understand and             At a smaller, yet more complex level, collaborative
use big data effectively, particularly in government      environments need to be created through which
and the public sector [12] [5], or specifically in        the diverse perspectives of those working with big
developing countries [42], is reiterated by various       data can come together to produce new, more
authors and agencies [10], [33], [36], [61], [40]. At a   comprehensive insights.
fundamental level, to work with big data requires
a shift in mindset from working with “small” data.        Given the required individual and collective capacities
Some authors emphasize that this implies the ability      to work with big data, it is no surprise that in
to analyze vast amounts of information rather than        surveying thousands of businesses around the world,



                                                                                                                  37
six out of ten said that their “organization has more     served as a space in which the right combination of
data than it can use effectively” and the leading         skill sets among individuals could come together to
obstacle was a “lack of understanding of how to use       do big data analysis [17]. These enterprise units are
analytics to improve the business.” The majority of       examples of data intermediaries that will undoubtedly
these businesses which frequently used data analytics     be needed in the coming years to make sense of big
actually used a “centralized enterprise unit” which       data [37].




Volunteer Technical Communities

Governmental or non-governmental institutions that wish to transform raw datasets into practical tools often
lack the expertise to do so. Yet this should not be a limiting factor. One avenue which is proving itself valuable
to leverage the skill sets of individuals outside of an organization is volunteer technical communities such
as hackathons. In these settings, subject matter experts, technologists, designers, and a variety of other
participants, come together to define problems, share relevant datasets, and rapidly prototype solutions.
Often these events are followed up by acceleration phases in which the most promising solutions are further
developed. Examples of hackathons that have been used in an international development context include
the Domestic Violence Hackathon held in Washington D.C. as well as countries in Central America. Another
example is Random Hacks of Kindness which, between 2009 and 2013 organized hundreds of community-
focused hackathons around the world and engaged in a similar process. These volunteers created, for
example, InaSAFE, a web-based tool which combines several relevant datasets and strengthens decision
making around natural disaster response [63].




                                                                                                                 38
CASE STUDY                                                                                   DATA PALETTE

Connected Farmer                                             Crowdsourced supplier data via mobile phones



Alliance
Motivation
Vodafone has a deep interest in development in
emerging markets, particularly in Africa where their
mobile financial solutions like M-PESA (a mobile
money transfer platform) and M-Shwari (a partnership
between M-PESA and CBA to provide a savings and
loans service) have a strong presence. Vodafone’s
interest in development and the pursuit of disruptive
innovation tied with a clear potential for commercial businesses to play a role in supporting the agricultural sector
(an area of focus for many of Vodafone’s African enterprise clients) led it to join with USAID and TechnoServe to
form the Connected Farmer Alliance. This Alliance pilots initiatives aiming to create a better ecosystem for mobile
services in the agricultural sector, impacting production through the supply chain to enterprise use.

Data Generation
The program focuses on Kenya, Tanzania and Mozambique, and is divided into 3 distinct areas of focus: enterprise
solutions to source from small farmers, improving mobile financial services and mobile value-added services.
The first area, where much of the testing has already taken place, involves enterprise solutions which enable
enterprises to better source from small farmers and allow farmers better access to markets. The data is gathered
and distributed through a suite of modules, including a registration module allowing an agent of an enterprise to
register farmer (or for farmers to register themselves as suppliers) who supply a particular produce. The service
enables a remote crowdsourced data-gathering method to identify who and where farmers are and the crops
they specialize in producing. The data gathered through mobile phones in this module is highly structured and
referenced both temporally and spatially, as well as highly person identifiable, enabling enterprise participants
to distinguish specific farmers and their products. The typical enterprise participants are mid-sized national
companies who source their produce from small farmers and are seeking more detailed data and interaction with
available suppliers. Building upon the crowdsourced supplier data are a series of additional modules including
two-way communication that enables enterprises to share information with, or survey, farmers. A receipting
module, integrated with M-PESA, allows enterprises to send receipts and pay farmers at the point of sale,
identifying volume of purchase, time and price and increasing transparency. Another module allows enterprises
to offer short-term loans through M-PESA, enabling cash advances that are later deducted from payment for
produce. Finally a tracking module enables enterprises to better track collection processes and points to
streamline product collection. At the pilot phase the size of the crowdsourced dataset does not yet approach big
data, however Vodafone is currently preparing to bring this first suite of modules to commercial markets for much
broader deployment.

The second area of focus, currently in the conceptual development and partnership-building phase, involves the
improvement of mobile financial services. One area of research is the extent to which big datasets of historical
mobile financial transactions, generated through other Vodafone products and services, can prove useful in
assessing credit-worthiness of loan applicants. This product area may also work with local partners, to incorporate


                                                                                                                   39
the use of mobile financial data in streamlining insurance pricing and payouts for farmers by using location data to
assist insurers in more rapid analysis of claims.

The third focus area, only at the earliest conceptual stages, is to use the enterprise solutions and mobile financial
services created in the first two stages of the product to create a supportive environment overall for mobile value-
added services for anyone wanting to take products to market. This area will also include the growth of business
development and incubation services to sustainable mobile business growth in the agriculture sector.

Data Interpretation
Vodafone works with subsidiary Mezzanine on the development and management of the data collection platform,
which is locally hosted in the Kenyan, Tanzanian and Mozambican markets themselves and protected by high-
level security mechanisms. In pilot phase, data is available only to the enterprise and participating farmers and
for the surveys, enterprises receive only aggregated responses, not individual records. Vodafone is working with
enterprise customers on the most convenient way for farmers to submit data whilst ensuring confidentiality for
them and businesses. The details of data privacy will be governed by Vodafone’s data privacy policies to ensure
ongoing protection.

Within the Connected Farmer Alliance partnership, TechnoServe is charged with analysis and interpretation
of how the modules are performing for the enterprises and farmers. However given the small sample set
involved in the pilot of the enterprise modules, insights are currently being gathered through traditional survey
methods. Those methods include assessing goals for the participants at the project outset, determining areas
of measurement, and collecting input through questionnaires during the process. Additionally, the Connected
Farmer Alliance supports enterprise partners in their own data analyses of information and outcomes.

Insights for Action
Although the project is still in its early phases, insights are beginning to emerge around the benefits of the
enterprise modules. Cost savings have been shown on the part of farmers who receive M-PESA for loans and
payments. By receiving M-PESA, these farmers avoid costly, time-consuming, and risky trips to the enterprise
office to collect cash. The receipting module has resulted in cost savings for enterprises due to operational
efficiencies and improved process transparency. A key benefit of mobile solutions for farmers is an increase in
access to information. Nonetheless it is difficult to make generic content services meaningful to small farmers
whose local realities may vary significantly within a distance of just a few kilometers. The targeted information
flow permitted by the two-way information module has been shown to provide information particularly relevant to
the stakeholder farmers, as well as to enhance face to face interactions among farmers and enterprises.

From Idea to Ongoing Process
Although the Connected Farmer Alliance has a clear social transformation element which inspires the partners
and enterprises alike, growth and use of these mobile tools and long-term sustainability of the piloted approach
will fall under Vodafone’s commercial initiatives. With the specific intent of going beyond the pilot phase and
putting in place publicly accessible tools, Vodafone is currently in the process of scaling up the mobile tools used
in the first phase of the Connected Farmer Alliance project for commercial use as a method to generate large
scale, targeted and valuable data for small farmers and enterprises alike.




                                                                                                                    40
References
Conversation with Laura Crow, Principal Product Manager for M-PESA, March 2014.

Correspondence with Drew Johnson, Program Manager TechnoServe Tanzania, June 2014

TechnoServe. (2014) Projects: Connected Farmer Alliance. Retrieved from: http://www.technoserve.org/our-work/
projects/connected-farmer-alliance




                                                                                                            41
42
SECTION 5
WHAT ARE SOME OF THE
CHALLENGES AND CONSIDERATIONS
WHEN WORKING WITH BIG DATA?
As the data deluge continues to grow and forward         Data generation process and structure
thinking managers and policy makers seek to make
use of it, challenges at the levels of expertise,        Several challenges must be overcome and
appropriate use, and institutional arrangements          considerations kept in mind regarding the data
come to the forefront. Whereas in the past, smaller-     generating process and the data structure itself.
scale, and less diverse datasets could be analyzed
by a limited number of individuals, big data requires    To begin, the very trigger which encodes behaviors
a wider skill set. An added dimension to the use of      into data can have implications for analysis. If data
bigger data is its ability to understand and predict     is passively recorded, then it is less susceptible to
ever-smaller segments of the population, sometimes       the statistical problem of selection bias, i.e. the data
even to the level of the individual. The ability to be   that is collected is systematically unrepresentative of
aware and even forecast the behaviors of such small      the behavior of interest. If instead the data is actively
segments, including individuals, raises new ethical      selected, then it is more susceptible to such a bias.
questions. Finally, whereas data analysis was often      If an individual is interested in collecting her walking
restricted to that collected by a single institution,    data throughout a week, she may, for example,
the nature of big data requires new forms of inter-      input data into a spreadsheet on those days when
institutional relationships in order to leverage data    she remembers. This may, however paint a biased
resources, human talent, and decision-making             picture of her movement since she only records data
capacity.                                                when she walks great distances, therefore biasing
                                                         the report of her movement. If, instead of collecting
The following sections organize challenges and           data actively, she used a wristband which passively
considerations according to stages of the Data for       collected data, then a more representative picture
Action Framework [Figure 2] discussed in section         would be drawn.
two. First a series of considerations is discussed
with respect to the various ways in which data is        Once encoded, it is important to consider the
generated, stored, and accessed. Then challenges         features of the datasets of interest. If they are
around how effectively to manage and analyze the         unstructured, for example, they will require the
data are enumerated. Along similar lines, practical      development of appropriate processing methods,
considerations arise when discussing how insights        especially when integrated with structured data.
are actually defined. Cultural and ethical challenges    Mobile data, like that analyzed in the case study on
then come to the forefront when considering how to       Understanding Labor Market Shocks using Mobile
actually implement insights.                             Phone Data, is received in unstructured repositories
                                                         of millions of individual files, requiring time-intensive
                                                         processing, programming and the use of expensive
                                                         hardware to obtain indicators of population

                                                                                                                 43
movement and prepare for interaction with other data       with exploratory, visual analysis of social media in
[22]. Text analysis is one example of making sense of,     order to see patterns and build a formal statistical
say, unstructured Facebook status updates or online        model was also noted [24]. Social media often store
Tweets. The method used to structure unstructured          actively generated data such as Twitter Tweets and
data adds yet another point at which decisions are         may therefore suffer from selection bias. On a related
made, making the analysis further susceptible to           note, retweets and sharing of links is actually the bulk
biases that the researcher may have.                       of Twitter traffic, such that a very small minority control
                                                           the agenda of what is originally created. This presents
Whether data sets have methods to identify temporal,       challenges of ensuring that analyses which use social
geographical, or individual characteristics, such as       media place results in the right population context
time/date stamps, latitude/longitude information,          [45]. Other information found on the internet such
or unique personal IDs, respectively, will determine       as webpages, blogs, videos, etc. may share similar
to what extent data mashups are possible. However,         problems as those noted above regarding identifying
a challenge that must be addressed in combining            the accurate population being represented as well as
such data sets is to ensure proper aggregation. For        effectively interacting with unstructured data. Point-
example, if one data set is collected every minute,        of-sale or internet sales data are often high-frequency
but another is collected every day, then, to ensure        and high-volume datasets which require effective
comparability, the analyst must carefully consider how     structures to process massive stores of data constantly
to aggregate the minute-based data so that it can be       as well as thoughtfully constructed models which
effectively joined with the daily data.                    appropriately consider the relevant, short-run nature
                                                           of the decisions being made by humans [39].
The media through which behaviors are encoded
into data each present their own series of challenges.     Data interpretation process
Some of these challenges are inherent to the
medium, while others are due to association with           To access, prepare, and analyze data sets effectively
certain features of the data structure or generating       presents a series of institutional and technical
process. Mobile phone data, by their very nature,          challenges. Beginning with access, several
are highly sensitive and should be treated carefully.      institutional challenges must be overcome just to
Although mobile data is highly disaggregated and           enable the ready sharing of data sets. Once data is
can be very rich, it has been observed that its analysis   accessible, several technical and data management
should be validated through corroboration with             challenges must be overcome.
other sources such as household surveys or satellite
data [40]. Satellite data can become very large            Access
very fast, since it is primarily high-resolution image
data. It can also be highly unstructured, particularly     One of the first challenges which must be overcome
when it comes to visual pattern analysis, and may          in order to conduct big data analyses is for data to be
benefit especially from human review. Social media         more openly shared [65]. This is particularly important
data is often in the form of unstructured text which       for those development-oriented institutions which
requires specific analytical capabilities to codify and    do not, themselves, generate the data of interest.
identify useful patterns. One researcher identified        Indeed, one data scientist/artist pointed out how one
techniques such as topic modeling and named entity         of the biggest challenges he faces is simply trying to
recognition to be useful in this regard. In the case of    access data from institutions, including government
the Global Pulse program tracking food price inflation     agencies, which hold vested interests and/or a
in Indonesia through Twitter, a researcher trained         commodity-ownership perspective over the data they
a sentiment classification algorithm by manually           store [25].
classifying Tweets according to various sentiments,
allowing the algorithm to, in turn, classify other         An entire ecosystem is needed to open and use big
Tweets automatically [64]. The benefit of beginning        data effectively [16], [36]. Common data standards and
                                                                                                                    44
sharing incentives constitute two aspects of such an       by private players, they are less willing to share
ecosystem. Leading international agencies will have        their data. One researcher pointed to the fact that
to address the challenge of collaborating to define        accessing data from companies 10 years ago was
and agree on efficient and well-coordinated sharing        much easier, for example, than it is today, partially
mechanisms [55]. Standards for data integration, such      for this same reason [26]. The researchers in the case
as APIs, are needed, as are standards to coordinate        study Understanding Labor Market Shocks using
data generation. Examples of mechanisms to develop         Mobile Phone Data indicated the ability to obtain
both kinds of standards include IEEE, IEC for smart        mobile data would be virtually impossible for private
grid, or the Global Earthquake Model for disaster          sector agencies and extremely challenging even
preparedness.                                              for governments and multilaterals, often relying on
                                                           personal relationships. Much of the research in that
As the challenge of shared standards is overcome, the      area, therefore, is being done by academia [22].
incentive to share the data must be strengthened. To
this end, business models need to be developed to          Whether data is accessed through the public or
ensure that private sectors are willing to share data      private institutions presents different challenges.
[36]. Also, governments need to design policy to help      Public institutions can often release data for free;
capture the value of big data and enable sharing           however, administrative hoops, which the above
across agencies [5]. A particular challenge in this        discussion emphasizes, can present great barriers
regard is the definition of intellectual property rights   to access. Moreover, as the researchers in the
which retain data ownership yet allow researchers          Billion Prices Project discovered, public agencies
and/or decision makers to use the data for perhaps         are often so accustomed to using traditional data
originally unintended purposes [26]. In addition,          sources that they face an additional cultural hurdle
governments may have to consider the design of             to engaging with big data that may take time and
privacy policies for personal and proprietary data,        use to overcome before its value is internalized
safeguards for information misuse, and regulations on      [29]. Private institutions, on the other hand, have
financial transactions and online purchases [12].          massive stores of data, the value of which is being
                                                           increasingly recognized. In this case, security and
As the challenge of opening data in the last               intellectual property concerns may exist. The Billion
decade has partly been addressed, several nascent          Prices Project methodology of scraping price data
phenomena have emerged. Beyond enabling                    from retailers addressed the privacy concerns of
analysis, opening data and making it available to the      enterprises-by building in lag time between the data
public motivates citizens to engage personally with        collection and data sharing, and by sharing data at
the data and, in some cases, correct information and       aggregated levels [29].
improve the accuracy of government databases [5].
Furthermore, opening data can serve as a catalyst          On the other hand, directly sourcing data from crowds
for engaging large groups of citizens to apply their       of people, often via an online aggregation mechanism
technical capacities and desire for social betterment      such as one of the many social media tools, presents
to design novel ways to leverage data for the public       the unique challenge of ensuring wide and high
good (e.g. National Day of Civic Hacking, or the           quality participation. If participation is meager, then
International Space Apps Challenge). Opening data          the data collected will not only be insufficient from a
has also contributed to the emergence of an open           quantitative standpoint, but the perspective may not
science paradigm among academics concerned                 be reflective of a broader set of the population. The
with enabling research accessibility by the public,        Global Pulse project profiled as a case study in this
permitting wider use of research findings, facilitating    report—Tracking Food Price Inflation using Twitter
research collaborations, and/or simply sharing             Data—chose to focus on a part of the world where
datasets with other researchers [66]. An additional        many people Tweet, such that Twitter represents
challenge, however, that has emerged is the fact           a broader segment of the population, for this very
that as data is opened and its value is recognized         reason [64]. Given that individuals who participate in
                                                                                                                45
CASE STUDY                                                                                 DATA PALETTES

Tracking Food                                               Price sentiments posted via Twitter



Price Inflation
using Twitter Data
Motivation

The Global Pulse lab set out to investigate the
possibility of utilizing social media data to give an
indication of social and/or economic conditions.            Official price statistics via public agency surveys
In particular, they investigated the relationship
between food and fuel price Twitter posts and
the corresponding changes in official price index
measures. By conducting the research in the
Indonesian context, the research benefited from a
large user base--the city of Jakarta has the largest
Twitter presence in the world with 20 million user
accounts.

Data Generation

The Twitter data used was generated between
March 2011 and April 2013 and formed a largely unstructured dataset of over 100,000 Tweets that were highly
temporally referenced, spatially referenced by region and identifiable by Twitter account, and at times by
person. This data was complemented by structured public datasets regarding food and fuel prices including
official price indices from the Indonesian State Ministry of National Development Planning (BAPPENAS) and
the World Food Program (WFP). In particular, CPI data for general foodstuffs came from the Indonesian Office
of Statistics (BPS) and data on milk and rice prices from the WFP, both datasets typically generated through
questionnaires and surveys. As Indonesia was also experiencing soybean shortages during the period of study
leading to the import of soy from the U.S., soybean inflation data from the U.S. was also collected from the
World Bank.

Data Interpretation

Data from the Twitter “firehose”, which streams all tweets live, was captured from March 2011 to April 2013.
Full access to the Twitter firehose is difficult to obtain, however Global Pulse was able to secure it through their
use of the Crimson Hexagon software, which collected and stored tweets in a database, which could then be
analyzed. Other services that could provide similar firehose access include DataSift and Gnip.

The Crimson Hexagon ForSight software includes a classification algorithm, which can analyze strings of text
and sort them into categories of interest. For this study, data was categorized through an initial filter of content


                                                                                                                  46
based on keyword searches as being related to food price increases or fuel price increases. Specific words
in the Bahasa Indonesia language were utilized as keywords, which the algorithm used to filter those tweets
which dealt with the aforementioned price increases. Then, a researcher manually classified randomly selected
tweets based on sentiment as “positive”, “negative”, “confused/wondering”, or “realised price high/high-
no emotion.” This manual selection by the researcher essentially “trains” the algorithm, which can, in turn,
automatically classify the remaining tweets automatically.

By the end of the process, over 100,000 tweets were collected and categorized, forming part of a dataset,
which could be analyzed using simple statistical regression techniques. Using such techniques, correlation
coefficients could be estimated to analyze the relationship between twitter conversations and official food
price inflation data, among other questions.

Insight Implementation

Some of the final conclusions of the report indicated a relationship between official food inflation statistics and
the number of tweets about food price increases. In addition, another relationship was discerned between
the topics of food and fuel prices within twitter data. Altogether, this initial effort to analyze social media data
indicated the potential to utilize social media data to analyze public sentiment as well as objective economic
conditions.

That said, the research demonstrated that, while there was certainly a relationship between the twitter data
and official statistics, there was also an abundance of false positive relationships, i.e. large changes in twitter
data with no corresponding change in actual inflation measures. More research is certainly needed to improve
the classification process as well as the process of geolocation--using arbitrary strings in social media profiles
to arrive at exact geographic coordinates--to more fully take advantage of the heterogeneity of social media
data and associate sentiment with particular regions of a country. Finally, higher granularity of official statistics
are needed in order to more effectively compare it to the correspondingly spatially and temporally specific
twitter data.

From Idea to Ongoing Process

The research has indicated that semi-automatic, retrospective analysis is possible for social media data. To the
extent that classification algorithms are strengthened, and more fine grained economic data with which to train
algorithms are made available, the potential to implement ongoing real-time analysis of such data appears to
be closely within reach.

References

UN Global Pulse. (2014). Mining Indonesian Tweets to Understand Food Price Crises. http://www.
unglobalpulse.org/sites/default/files/Global-Pulse-Mining-Indonesian-Tweets-Food-Price-Crises%20copy.pdf

Correspondence with Alex Rutherford, Data Scientist, UN Global Pulse, April 2014.




                                                                                                                    47
crowd-sourced endeavors vary in their skill, interest,                       appropriate medical treatment or a financial product,
and motivation level, it is possible that mechanisms                         consumers may be concerned with sharing such
will need to be put in place to ensure rewarding                             information [5]. Adding to the complexity of this
desired behavior as well as to develop effective                             challenge is the fact that each country has different
quality control processes to check each other’s inputs.                      regulations around data privacy [18]. One way to
                                                                             address such a challenge is the development of an
Preparation                                                                  internationally recognized code of conduct regarding
                                                                             the use of personal data, which might include best
Once data can been accessed, analysts sometimes                              practices regarding data acquisition, sharing, and
consider filtering the quantity and types of data                            anonymization2 [54].
needed. These filters, however, need to be carefully
considered; otherwise they may preclude useful                               Analysis
information [65]. Regarding data filtration, many big
data practitioners estimate that 80% of the effort                           Given a well prepared and structured dataset, a series
when working with data is simply cleaning it so that                         of considerations must be kept in mind. First, large
it can be readily analyzed [34]. A critical step in                          datasets do not always preclude the use of statistical
Global Pulse’s project tracking food price inflation                         methodology to account for the degree to which
in Indonesia was the filtration of the data from the                         a finding is representative of a population. A data
Twitter fire hose, for example the researchers had to                        scientist will be aware of the various ways in which, for
filter out Tweets in English and Indonesia’s numerous                        example, posts by Twitter users are not representative
local dialects to isolate Tweets in the predominant                          of the world’s thoughts and opinions [42]. In particular,
Bahasa Indonesia language [64].                                              the selection bias which occurs when inferring real-
                                                                             world behavior using digital sources of information
Once the data is cleaned, scientists must deal with                          must be kept in mind. In other words, people who
the challenge of how to manage large datasets [65]. In                       use digital sources of information such as blog
managing these, subjectivity can be introduced when                          posts or online shopping, may be systematically
attempting to lend structure to unstructured datasets.                       unrepresentative of the larger population under
With multiple sources, ensuring accurate metadata is                         consideration. A related concern pertains to the fact
critical [65]. When well documented, metadata gives                          that analyzing what has been said about behaviors is
an indication of the provenance of a dataset, thereby                        different from analyzing behaviors themselves [10].
increasing transparency and improving understanding
of results [37], [54]. With regard to combining multiple                     Also important is that modeling take into account
data sources, one interviewee warned against using                           the data generating process. For example, one
too many all at once. Instead, it is helpful to begin                        interviewee pointed out how small price changes
with a couple of relevant data sets and then, as                             derived from retail scanner data may be due to the
capabilities develop and questions are refined, other                        fact that such prices are actually imputed from data
datasets can be added [25].                                                  which often comes in the form of weekly total value
                                                                             and quantity sold. Using such imputed prices may
A frequently cited challenge when managing data fed                          fail to take into account systematic relationships
from various companies and/or agencies is ensuring                           underlying the data to, say, mid-week price changes,
individual privacy and security [36], [65], [12], [32].                      the use of coupons by certain individuals, etc. [29]
For example, although individual-level health or
financial data can be used to assist with specifying


2. The question of anonymization itself must be examined carefully. Even when a dataset with personal identifiers--e.g. name or social security number-
-is randomized, such assignments to individuals may be imputed. One way to deal with this may be simply to aggregate the data sufficiently to enable
analysis but to preclude privacy or security concerns [26].



                                                                                                                                                      48
Another consideration when working with big data is        itself--its generation, features, and interpretation--is
the tendency toward apophenia--or seeing patterns          insufficient, as elucidated in section two, to interact
when there are in fact none [13]. One interviewee          meaningfully with the data. The interpretation must
for this report pointed out how, by its very nature,       effectively give rise to insights which can then be
big data can give rise to highly significant statistical   acted upon to effect the desired change in the world.
correlations. The tendency to assign meaning to these      At one level, the very kinds of insights of interest
statistically significant relationships must, however,     can present their own challenges or considerations.
be tempered by rigorous model development [39].            At another, the process through which insights are
Although the adage which states that correlation is        translated into action must be strengthened by
different from causation is critical to understand, and    overcoming several intertwined challenges. Each of
every big data scientist keeps it in mind, the value of    these aspects is considered, in turn, below.
correlations remains. For example, correlations may
provoke interpretive stories, stimulate sales (such        Insight scope and purpose
as Amazon’s “suggest” feature), or may assist with
forecasts in the very short run. Computers excel at        As discussed in section two, one way to think about
estimating such numerical correlations; however, they      the kinds of insights to generate is in terms of
are less capable of assessing visual patterns. Data        scope. Big data may be used better to understand
visualization therefore leverages the human capability     the situation of a microregion, while another may
to discover such patterns, thereby stimulating further     be used for an entire continent, or even the whole
scrutiny of data [37]. As indicated by the preceding       world. Clearly, the larger the scope, the greater the
ideas, a conducive environment must be created             potential data required. More subtle, however, is
within which data can be managed, visualized, and          the fact that the broader the scope, the greater the
analyzed [65].                                             potential diversity of datasets, especially when no
                                                           singly-managed dataset covers the entire scope of
Given an analysis which integrates discovery and           interest. Analysts will need to ensure that datasets
creativity through direct observation of the data,         can interact with each other to form an accurate
human intuition, and strong model development,             representation of the whole. This requires overcoming
a communicative challenge remains when working             data management challenges as well as making
with big data. Documenting and communicating               careful decisions about aggregation to ensure helpful
relevant data sources is one of these. However, it         comparisons among the various datasets.
pales in comparison with the ability to communicate
the methodology whereby insights were generated.           Alternatively, insights of interest may be considered in
This is a challenge which has to be overcome in order      terms of their primary purpose. Is the use of big data
to create an environment through which others, such        meant to generate insights which heighten awareness,
as policy makers or other collaborators, can make          deepen understanding, and/or enhance forecasting?
sense of the results and understand how to use             If the primary purpose is to enhance awareness,
them. The idea of an “Analytics Cloud Environment”         then it is likely that capabilities around visualization
discussed above indirectly addresses this by giving        will be especially important, since datasets will have
users the opportunity to explore how various               to be shown in an accessible way to enable shared
model assumptions affect results. As potential task        human understanding. If, instead, the primary
complexity grows, institutions will have to face the       purpose is to understand or forecast, capabilities
challenge of considering the cost of analysis using big    around data modeling will be of primary importance.
data as well as integrating feedback into the use of       This is especially the case for endeavors which seek
big data in order to adjust its use [65].                  to understand systems or processes. By utilizing
                                                           sophisticated inductive techniques such as machine
Insights and their implementation                          learning, for example, forecasting may be improved
                                                           through the use of additional variables and functional
Addressing challenges with respect to the data             forms while not necessarily enhancing understanding
                                                                                                                  49
of the causal web underlying a system’s functioning.                           Beyond a culture which stifles data openness, several
Understanding a system requires more than showing                              beliefs or perspectives may inhibit the use of big
statistical significance; it requires the development of                       data by institutions. One study showed that public
sensible, internally consistent models of behavior.                            administrators often have three major viewpoints
                                                                               about big data. Some view big data as presenting an
Insight implementation process                                                 opportunity to improve services and solve practical
                                                                               tasks. Others view big data as a fad which will not
Given a sufficient degree of clarity on the kinds                              necessarily improve services. Instead, these think
of insights of interest, some thought should be                                that big data may actually strengthen governmental
given to the ways in which the analysis of data will                           control of the masses. Finally, others believe that,
be translated into actions which will change those                             while big data may offer opportunities for the
original behaviors of interest. As discussed in section                        government to reach more citizens and tailor services,
two, such an implementation process requires                                   use of big data will not necessarily constitute an
defining next steps, creating structures to take them,                         actual improvement. In fact, those individuals who
and, finally, actually taking those steps. Below are                           are not electronically connected may be marginalized
discussed some considerations and challenges which                             [21]. Particularly in contrast to the last two views, those
are connected with all three aspects.                                          in the public sector seeking to realize the value of
                                                                               big data will have to face the challenge of creating a
Project vs. process                                                            culture which seeks to improve systems and processes
                                                                               based on data3 [5].
Whether big data is used for a project or process
will impact the degree to which a landscape                                    Even if a single institution is convinced of the benefits
analysis should be conducted to understand what                                of using big data, a culture of cooperation and
data sources were used in the past to accomplish                               collaboration among multiple institutions is critical
similar objectives. In addition, if an institution seeks                       [28]. At one level, cooperation is critical for the
ongoingly to utilize big data to assist in managerial                          establishment of standards by which various agencies
tasks, then such a process-orientation will need to                            or individuals may access and use data. Beyond this,
place emphasis on technical capacities to analyze                              collaboration would help elucidate the possible uses
as well as institutional capacities to act upon and                            of big data and help nurture and channel energies
ongoingly inform such analysis.                                                to utilize it. One example of this is the Global
                                                                               Partnership on Development Data recommended by
Culture                                                                        a panel in the United Nations concerned with poverty.
                                                                               This Partnership would bring “together diverse
Challenges at the level of culture can have a                                  but interested stakeholders” who would, as a first
substantial bearing on the fruitful execution of a                             step, “develop a global strategy to fill critical gaps,
project endeavor which utilizes big data. At one                               expand data accessibility, and galvanize international
level, institutions may avoid releasing data due to                            efforts to ensure a baseline for post-2015 [poverty]
paternalistic beliefs that others cannot make use of                           targets” [21]. Whether in public or non-profit sectors,
the data or due to concerns that the data will be used                         cooperation among managers and elected politicians
against them [33]. Overcoming cultural challenges                              is essential to use big data to inform decision
such as these will be particularly critical when opening                       making [12]. One way to begin collaborations with
data for citizen engagement [15].                                              an institution as a whole is to find those individuals
                                                                               within it who are willing to share data and collaborate


3. One author has noted that, where such a data-driven culture exists, the tendency of conceiving of society as simply a set of sub-populations rather
than as a single social body will have to be avoided. Although this appears to be primarily a philosophical concern, it may well have real implications on
defining the public good and defining the means to achieve it, including the role of government [13].



                                                                                                                                                        50
on generating insights for policy or management           know about them and the way that data is utilized
decisions [26].                                           [37]. Beyond awareness, people need to be educated
                                                          on the value of their data and how to control its
Use and abuse                                             dissemination through means as simple as, for
                                                          example, privacy settings on social media [45]. At
Another set of challenges that pertain to big data        an even higher level, substantive ethical questions
relate to people’s interactions with the results          need to be discussed in the public sphere regarding
or prescriptions from big data analyses. In some          the use of predictive, albeit somewhat imperfect,
cases, a challenge that must be overcome is the           information by companies or other institutions
possibility that many organizations, companies, and       seeking to advance their own interests [39].
institutions would rather avoid the use of big data
since it may reveal truths which they would prefer        Due to the somewhat unstoppable nature of the
remain hidden. Such non-truth-seeking behavior            use of big data, dealing with such challenges at
could include pharmaceutical companies avoiding           an institutional level will likely have to be handled
the use of big data to assist with understanding the      through management, rather than preventing
effects of a drug once introduced in the market.          adoption [39] [37], such as by increasing transparency
Alternatively, healthcare companies may avoid health      on how data is used [54]. Institutions may have to
personalization via big data since it will encourage      move beyond hierarchical, centralized, and rule-driven
patients to use more preventive medicine thus             structures in order to deal with the variety of uses and
reducing their income from doctor visits [37].            abuses of big data [37].

As the potential for big data is increasingly             Another, more subtle challenge arises when the use of
understood, the potential for it to be used for           big data is integrated into systems and outputs. When
purposes contrary to the public good and/or ethical       big data is systematically incorporated to generate
principles is just beginning to be explored. Such uses    some kind of output, such as predictions on crime hot
of big data often leverage the seeming predictability     spots or product preferences, the very people whose
of human behavior. Some cases have already begun          behavior or preferences are being predicted may
tentatively to demonstrate the ability to use big data,   consciously interact with the system in a manipulative
for example, to infer when an individual is dating        way. One example of this is “google bombing,” or
[45], the strength of a marriage [46], or the effect      cases in which people craft webpage features to trick
of the size of a retail markdown sign on sales [39].      the Google search algorithm in order to raise their
Such predictive information may be used for various       page-rankings. Another dangerous possibility is for
purposes, not all of which are malicious. However,        criminals to outsmart police agencies relying on big
the fact that big data permits such predictions raises    data. In particular, criminals could use the same data
concerns about ensuring that it is used in a way which    and algorithms to infer the police’s expectations
harmonizes with the public good.                          thereby enabling them to behave in unexpected
                                                          ways [37]. These are examples of a broader class of
At one level, such powerfully predictive deductions       problems arising from the study of social systems as
from big data raise the challenge of ensuring that        compared to less self-aware systems. The subject
people are aware of what companies or institutions        may, in fact, become the object of its own study.




                                                                                                                51
CASE STUDY                                                                                DATA PALETTES

Using Google                                               Google Trend Data



Trends to nowcast
economic activity
in Colombia
Motivation
                                                           GDP Official Data
For the well-timed design of economic policy, it
is desirable to count with reliable statistics that
allow constant monitoring of economic activity
by sectors, preferably in real time. However, it is a
well-recognized problem that policymakers must
make decisions before all data about the current
economic environment is available. In Colombia,
the publication of the leading economic indicators
that the Administrative Department for National
Statistics (DANE- for its acronym in Spanish) uses to
analyze economic activity at the sectorial level has
an average lag of 10 weeks. In this framework, the
Ministry of Finance in Colombia looked for coincident indicators that allow tracking the short-term trends of
economic activity. In particular, the approach was to exploit the information provided by the Google search
statistical tool known as “Google Trends”.

Data Generation

The data for this study comes from Google web searches. Based on the web searches performed by Google
users, Google Trends (GT) provides daily information of the query volume for a search term in a given
geographic region (for Colombia, GT data are available at the departmental level and also for the largest
municipalities). Each individual Google Trend series is relative and not an absolute measure of search volume.
That is, the period in which the search interest of a keyword is highest within the dates of inquiry receives a
value of 100. All other periods for an individual series are measured relative to this highest period. To assess
the performance of the indexes built using GT data, official economic activity data (both at the aggregate level
and at the sectorial level) from the DANE are used. Both GT and DANE data are publicly available.

Data Interpretation

In order to exploit the information provided by GT data, it is critical to choose a set of keywords that can be
used as a proxy of consumer behavior or beliefs. In some sense, GT data takes the place of traditional


                                                                                                                  52
consumer-sentiment surveys. For example, the use of data for a certain keyword (such as the brand for a
certain product) might be justified in the case a drop or a surge in the web searches for that keyword could
be linked to a fall or an increase in its demand and, therefore, a lower or higher production for the specific
sector producing that product. The analysis carried out by the Ministry of Finance in Colombia identifies
keywords meaningfully related to the different economic sectors and leverages GT data for these keywords
to produce leading indicators for economic activity at the sectorial level (ISAAC – for its acronym in Spanish).
It is important to highlight that this approach is used only for some of the key sectors of the economy (such
as agriculture, industry, commerce, construction, and transports). The performance of other sectors (such as
mining or financial services or personal services) cannot be assessed using web searches and other leading
indicators need to be used. Once the sectorial ISAACs are produced, the information is used to produce an
aggregate leading indicator for the economic activity in the country (ISAAC+).

Insight Implementation

The research carried out by the Ministry of Finance in Colombia showed the potential of web searches
information to explain economic activity variations for some specific sectors in Colombia (in particular,
agriculture, industry, commerce, construction, and transports). GT queries allow the construction of leading
indicators which determine in real time the short-term trend of the different economic sectors, as well as their
turning points. In addition, the leading indicator produced for aggregate economic activity (ISAAC+) shows
a high correlation with its reference variable (quarterly GDP annual variation) capturing its turning points
and short-term behavior. The production of these leading indicators reduces the lag associated with the
publication of traditional statistics and help policy makers in the country to make timely decisions. The main
limitation of this work is that the level of Internet penetration in Colombia is still relatively low (about 50%)
and this implies that GT data reflects information from just a part of the country’s consumers: those who have
access to the Internet and use Google for their web searches. As Internet penetration deepens in the future,
the representativeness of GT data will improve and make the ISAAC indicators even more relevant.

From Idea to Ongoing Process

The research project, led by Luis Fernando Mejía, General Director of the Macroeconomic Policy Department
in the Ministry of Finance, raised interest inside and outside Colombia. The ISAAC indicators produced with
GT data are currently published on a monthly basis in reports which show the historical correlation between
the ISAAC and GDP data at the sectorial level and highlight sectorial trends projected by the ISAAC indicators.
Other countries are looking at this interesting project and might start producing similar big data driven
forecasts in the future.

References

L. F. Mejía, D. Monsalve, Y. Parra, S. Pulido and Á. M. Reyes, “Indicadores ISAAC: Siguiendo la actividad
sectorial a partir de Google Trend,” Ministerio de Hacienda y Crédito Público, Bogotá, 2013. Available: http://
www.minhacienda.gov.co/portal/page/portal/HomeMinhacienda/politicafiscal/ reportesmacroeconomicos/
NotasFiscales/22 Siguiendo la actividad sectorial a partir de Google Trends.pdf. [Accessed 28 August 2014]

Correspondence with Luis Fernando Mejía, General Director of Macroeconomic Policy, Ministry of Finance and
Public Credit, Republic of Colombia




                                                                                                                    53
54
SECTION 6
RETROSPECT AND PROSPECT
With the advent of computing, humanity has                  unstructured data, linking diverse data sets, and
entered a new era of an unprecedented and                   scaling systems to respond to increasing volumes of
exponentially rising capacity to generate, store,           data. Scientific challenges include the rising demand
process, and interact with data. Seeking ways to            for data scientists, developing appropriate models
maximize efficiency and increase product offerings          to work with large and diverse data sets, dealing with
by leveraging such data, the private sector has             selection bias in various forms, and communicating
gained substantial experience over the last few             analytical results in order to yield actionable insights.
decades. Such experience undoubtedly built upon             Institutional challenges include limited access to data,
scientific capabilities to analyze data and practical       cultures that don’t promote learning, ethical concerns
business acumen in order to ensure effective, real-         about the utilization of personal data, and the lack of
world applications. In the public sphere, leveraging        standards with regard to data storage.
various, often very large, data sources to effect real
improvements has only just begun in the last decade.        As the technological capacities to generate, store,
The cases in this document testify to the promise for       and process data continue unabated, humanity will
such data to be used to enhance perception, deepen          need to develop a corresponding measure of various
understanding, and hone forecasting abilities.              technical, social, cultural, and institutional capabilities
                                                            to ensure that big data is used toward helpful and
Although experience with big data is relatively             effective ends such as strengthening early warning for
nascent, several conclusions can already be drawn.          disasters, enhancing awareness by providing real-time
For data to be effective, it must be seen in the context    feedback, or better understanding social systems.
of an ongoing process to better understand and              The necessary capabilities enable the integration of
interact with the world. In this light, use of big data     big data into ongoing processes rather than one-time
should begin with a question and a description of the       projects, thereby enabling its value to be continually
behaviors of interest. Use of big data from various         released and refined. Spaces will be needed in which
sources requires effective and scalable methods to          such technical, cultural, and institutional capabilities
access, manage and interpret data. Data must be             can commensurately develop. For example, members
well-documented to ensure traceability. Models              of various institutions, corporations and governments
through which data is interpreted must be carefully         may convene to develop a shared perspective on
selected to correspond to the data generating               the usefulness of big data for poverty reduction and
process. Collaboration among practitioners, social          agree to standards on its utilization. Given the variety
scientists, and data scientists will be critical in order   and pervasiveness of the necessary capabilities to
to ensure the requisite understanding of the real-          utilize big data to address big problems, collaborative
world conditions, data generation mechanisms, and           spaces are needed to enhance the capacity of
methods of interpretation are effectively combined.         individuals, organizations, businesses and institutions
Such collaboration will also enable the overcoming          to elucidate challenges and solutions in an interactive
of major technological, scientific, and institutional       manner, strengthening a global culture of learning to
challenges. Technical challenges include managing           reduce poverty and promote shared prosperity.
                                                                                                                     55
56
REFERENCES
[1] H. Varian, “Big Data: New Tricks for Econometrics,”    [9] S. Lohr, “The age of big data.,” New York Times,
Journal of Economic Perspectives, vol. 28, no. 2, p.       2012.
3–28, 2014.
                                                           [10] United Nations Global Pulse, “Big Data for
[2] Hilbert, M., & López, P. (2011). The world’s           Development: Challenges and Opportunities.,”
technological capacity to store, communicate, and          United Nations, 2012.
compute information. Science, 332(6025), 60-65.
                                                           [11] A. McAfee and E. Brynjolfsson, “Big data: the
[3] V. Mayer-Schonberger and K. Cukier, Big data:          management revolution,” Harvard business review,
A Revolution that will transform how we live, work         vol. 90, no. 10, pp. 60-68, 2012.
and think., New York: Houghton Mifflin Harcourt
Publishing Company, 2013.                                  [12] M. Milakovich, “Anticipatory Government:
                                                           Integrating big data for Smaller Government,” in
[4] I. T. Union, “The World in 2013: ICT Facts and         Oxford Internet Institute “Internet, Politics, Policy
Figures.,” [Online]. Available: http://www.itu.int/en/     2012” Conference, Oxford, 2012.
ITU-D/Statistics/Pages/facts/default.aspx. [Accessed
January 2014].                                             [13] D. Boyd and K. Crawford, “Six provocations for
                                                           big data,” in A Decade in Internet Time: Symposium
[5] J. Manyika, M. Chui, B. Brown, J. Bughin, R.           on the Dynamics of the Internet and Society, 2011.
Dobbs, C. Roxburgh and A. H. Byers, “Big data:
The next frontier for innovation, competition, and         [14] The World Bank, “Open Data Essentials,” 2014.
productivity.,” McKinsey & Company, 2011.                  [Online]. Available: http://data.worldbank.org/about/
                                                           open-government-data-toolkit/knowledge-repository.
[6] D. Fisher, R. DeLine, M. Czerwinski and S. Drucker,    [Accessed July 2014].
“Interactions with big data analytics,” Interactions,
vol. 19, no. 3, pp. 50-59, 2012.                           [15] L. Hoffmann, “Data mining meets city hall,”
                                                           Communications of the ACM, vol. 55, no. 6, pp. 19-21,
[7] R. Bucklin and S. Gupta, “Commercial use of UPC        2012.
scanner data: Industry and academic perspectives,”
Marketing Science, vol. 18, no. 3, pp. 247-273, 1999.      [16] J. Manyika, M. Chui, D. Farrell, S. V. Kuiken, P.
                                                           Groves and E. A. Doshi, “Open Data: Unlocking
[8] F. Diebold, “A personal perspective on the origin(s)   Innovation and Performance with Liquid
and development of ‘big data’: The phenomenon, the         Information.,” McKinsey & Company, 2013.
term, and the discipline,” Working Paper, 2012.



                                                                                                                    57
[17] S. LaValle, E. Lesser, R. Shockley, M. S. Hopkins      [31] S. Thornton, Interviewee, [Interview]. February
and N. Kruschwitz, “Big data, analytics and the path        2014.
from insights to value.,” MIT Sloan Management
Review, vol. 21, 2014.                                      [32] United Nations Global Pulse, “Big Data for
                                                            Development: A Primer,” United Nations Publications,
[18] S. Lucas, Interviewee, [Interview]. December 2013.     2013.

[19] A. Asquer, “The Governance of big data:                [33] W. Hall, N. Shadbolt, T. Tiropanis, K. O’Hara and
Perspectives and Issues,” in First International            T. Davies, “2012,” Nominet Trust, Open data and
Conference on Public Policy, Grenoble, 2013.                charities.

[20] Centre for Economics and Business Research,            [34] A. R. Syed, K. Gillela and C. Venugopal, “The
“Data equity: unlocking the value of big data.,” SAS,       Future Revolution on Big Data,” Future, vol. 2, no. 6,
2012.                                                       2013.

[21] United Nations, “A New Global Partnership:             [35] E. Bertino, P. Bernstein, D. Agrawal, S. Davidson,
Eradicate Poverty and Transform Economies                   U. Dayal, M. Franklin, ... and J. Widom, “Challenges
Through Sustainable Development.,” United Nations           and Opportunities with Big Data,” White Paper, 2011.
Publications, New York, 2013.
                                                            [36] World Economic Forum, “Big Data, Big Impact:
[22] J. Blumenstock, Interviewee, [Interview]. January      New Possibilities for International Development,”
2014.                                                       World Economic Forum, Geneva, 2012.

[23] V. Frias-Martinez, Interviewee, [Interview].           [37] D. Bollier, “The Promise and Peril of Big Data,”
December 2013.                                              The Aspen Institute, Washington, D.C., 2010.

[24] M. Khouja, Interviewee, [Interview]. December          [38] M. Ghasemali, Interviewee, [Interview]. January
2013.                                                       2014.

[25] A. Siegel, Interviewee, [Interview]. December          [39] R. Cross, Interviewee, [Interview]. December 2013.
2013.
                                                            [40] E. Wetter and L. Bengtsson, Interviewees,
[26] N. Eagle, Interviewee, [Interview]. December           [Interview]. January 2014.
2013.
                                                            [41] K. Warner, T. Afifi, T. Rawe, C. Smith and A. De
[27] A. Leshinsky, Interviewee, [Interview]. January        Sherbinin, “Where the rain falls: Climate change, food
2014.                                                       and livelihood security, and migration,” 2012.

[28] R. Kirkpatrick, Interviewee, [Interview]. February     [42] M. Hilbert, “Big Data for Development: From
2014.                                                       Information-to Knowledge Societies,” Working Paper,
                                                            2013.
[29] A. Cavallo, Interviewee, [Interview]. February 2014.
                                                            [43] L. F. Mejía, D. Monsalve, Y. Parra, S. Pulido
[30] N. Scott and S. Batchelor, “Real Time Monitoring       and Á. M. Reyes, “Indicadores ISAAC: Siguiendo
in Disasters,” IDS Bulletin, vol. 44, no. 2, pp. 122-134,   la actividad sectorial a partir de Google Trend,”
2013.                                                       Ministerio de Hacienda y Crédito Público, Bogotá,
                                                            2013. Available: http://www.minhacienda.gov.co/
                                                            portal/page/portal/HomeMinhacienda/politicafiscal/
                                                                                                                     58
reportesmacroeconomicos/NotasFiscales/22                    [55] OECD, “New Data for Understanding the Human
Siguiendo la actividad sectorial a partir de Google         Condition: International Perspectives,” OECD, 2013.
Trends.pdf. [Accessed 28 August 2014]
                                                            [56] M. Lagi, K. Bertrand and Y. Bar-Yam, “The food
[44] L. T. Interviewee, [Interview]. December 2013.         crises and political instability in North Africa and the
                                                            Middle East,” arXiv:1108.2455., 2011.
[45] R. Vasa, Interviewee, [Interview]. January 2014.
                                                            [57] S. Greengard, “Policing the Future,”
[46] R. Smolan and J. Erwitt, The human face of big         Communications of the ACM, vol. 55, no. 3, pp. 19-21,
data., Sausalito, CA: Against all odds productions,         2012.
2012.
                                                            [58] IBM Institute for Business Value, “Analytics: The
[47] N. Eagle, M. Macy and R. Claxton, “Network             real-world use of big data.,” 2012.
Diversity and Economic Development,” Science, vol.
328, no. 5981, pp. 1029-1031, 2010.                         [59] B. Wescott, Interviewee, [Interview]. January 2014.

[48] IBM Global Business Services, “Analytics: The          [60] V. E. M. Lehdonvirta, “Converting the virtual
real-world use of big data in financial services,” IBM      economy into development potential: knowledge
Global Services, Somers, NY, 2013.                          map of the virtual economy,” InfoDev/World Bank
                                                            White Paper, pp. 5-17, 2011.
[49] NYSE Euronext, “NYXdata > Data Products
> NYSE Euronext > FlexTrade,” 2013. [Online].               [61] K. Kitner, Interviewee, [Interview]. December 2013.
Available: http://www.nyxdata.com/nysedata/Default.
aspx?tabid=1171. [Accessed February 2014].                  [62] S. Lohr, “Sure, Big Data is Great. But So Is
                                                            Intuition,” New York Times, 29 December 2012.
[50] L. Barrington, Interviewee, [Interview]. January
2014.                                                       [63] SecondMuse, “Random Hacks of Kindness 2013
                                                            Report,” 2013.
[51] PriceStats, “PriceStats,” 2014. [Online]. Available:
http://www.pricestats.com. [Accessed February 2014].        [64] United Nations Global Pulse, “Mining Indonesian
                                                            Tweets to Understand Food Price Crises,” United
[52] A. Robertson and S. Olson, “Sensing and Shaping        Nations Publications, 2014.
Emerging Conflicts: Report of a Joint Workshop of
the National Academy of Engineering and the United          [65] F. Almeida and C. Calistru, “The main challenges
States Institute of Peace: Roundtable on Technology,        and issues of big data management,” International
Science, and Peacebuilding,” The National                   Journal of Research Studies in Computing, vol. 2, no.
Academies Press, 2013.                                      1, 2012.

[53] UNICEF, Regional Office for Central and Eastern        [66] B. Fecher, “Open Science: One Term, Five
Europe and the Commonwealth of Independent                  Schools of Thought,” in The 1st International
States, “Tracking Anti-Vaccination Sentiment in             Conference on Internet Science, Brussels, 2013.
Eastern European Social Media Networks,” 2013.

[54] A. Howard, Data for the public good, O’Reilly
Media, 2012.




                                                                                                                       59
ANNEX 1
SELECTED BIBLIOGRAPHY
Big data, general topics                                  Fisher, D., DeLine, R., Czerwinski, M., & Drucker,
                                                          S. (2012). Interactions with big data analytics.
Almeida, F. L. F., & Calistru, C. (2012). The main        Interactions, 19(3), 50-59.
challenges and issues of big data management.
International Journal of Research Studies in              Hilbert, M., & López, P. (2011). The world’s
Computing, 2(1).                                          technological capacity to store, communicate, and
                                                          compute information. Science, 332(6025), 60-65.
Asquer, A. (2013). The Governance of Big Data:
Perspectives and Issues. Retrieved from: http://ssrn.     IBM Institute for Business Value. (2012). Analytics:
com/abstract=2272608 or http://dx.doi.org/10.2139/        The real-world use of big data. Executive Report.
ssrn.2272608                                              Retrieved from http://www-935.ibm.com/services/us/
                                                          gbs/thoughtleadership/ibv-big-data-at-work.html
Bertino, E., Bernstein, P., Agrawal, D., Davidson,
S., Dayal, U., Franklin, M., ... & Widom, J. (2011).      LaValle, Steve, et al. (2011). Big data, analytics and the
Challenges and Opportunities with Big Data.               path from insights to value. MIT Sloan Management
Whitepaper presented for Computing Community              Review 52.2 : 21-31.
Consortium.
                                                          Lohr, S. (2012). Sure, Big Data is Great. But So
Bollier, D., & Firestone, C. M. (2010). The promise and   is Intuition. Retrieved from http://www.nytimes.
peril of big data (p. 56). Washington, DC, USA: Aspen     com/2012/12/30/technology/big-data-is-great-but-
Institute, Communications and Society Program.            dont-forget-intuition.html?_r=0


Boyd, D. & K. Crawford. (2011). presented at Oxford       Lohr, S. (2012). The age of big data. New York Times.
Internet Institute’s “A Decade in Internet Time:          Retrieved from http://www.nytimes.com/2012/02/12/
Symposium on the Dynamics of the Internet and             sunday-review/big-datas-impact-in-the-world.
Society” on September 21, 2011.                           html?pagewanted=all


Bucklin, R. E., & Gupta, S. (1999). Commercial            Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R.,
use of UPC scanner data: Industry and academic            Roxburgh, C., & Byers, A. H. (2011). Big data: The next
perspectives. Marketing Science, 18(3), 247-273.          frontier for innovation, competition, and productivity.


Centre for Economics and Business Research. (2012).       Mayer-Schönberger, V., & Cukier, K. (2013). Big Data:
Data equity: unlocking the value of big data. Centre      A Revolution that Will Transform how We Live, Work,
for Economics and Business Research White Paper, 4,       and Think. Eamon Dolan/Houghton Mifflin Harcourt.
7-26.

                                                                                                                  60
McAfee, A., & Brynjolfsson, E. (2012). Big data: the
management revolution. Harvard business review,          Robertson, C., Sawford, K., Daniel, S. L., Nelson, T. A.,
90(10), 60-66.                                           & Stephen, C. (2010). Mobile phone–based infectious
                                                         disease surveillance system, Sri Lanka. Emerging
Smolan, R., & Erwitt, J. (2012). The human face of big   infectious diseases, 16(10), 1524.
data. Sausalito, CA: Against all odds productions.       Scott, N., & Batchelor, S. (2013). Real Time Monitoring
                                                         in Disasters. IDS Bulletin, 44(2), 122-134.
Syed, A. R., Gillela, K., & Venugopal, C. (2013).
The Future Revolution on Big Data. International         UNICEF, Regional Office for Central and Eastern
Journal of Advanced Research in Computer and             Europe and the Commonwealth of Independent
Communication Engineering, 2(6), 2446-2451.              States. (2013). Tracking Anti-Vaccination Sentiment in
                                                         Eastern European Social Media Networks. Retrieved
Big data case studies                                    from: http://www.unicef.org/ceecis/Tracking-anti-
                                                         vaccination-sentiment-in-Eastern-European-social-
Eagle, N., Macy, M., and Claxton, R. (2010). Network     media-networks.pdf
Diversity and Economic Development. Science,
328(5981): 1029-1031. Retrieved from: http://            Big data for development
realitymining.com/pdfs/Eagle_Science10.pdf
                                                         Hilbert, M. (2013). Big Data for Development: From
Greengard, S. (2012). Policing the Future.               Information-to Knowledge Societies. Available at
Communications of the ACM, 55(3), 19-21                  SSRN 2205145.

Herrera, J. C., Work, D. B., Herring, R., Ban, X. J.,    Howard, A. (2012). Data for the public good. O’Reilly.
Jacobson, Q., & Bayen, A. M. (2010). Evaluation of       Karlsrud, J. (2014). Peacekeeping 4.0: Harnessing
traffic data obtained via GPS-enabled mobile phones:     the Potential of Big Data, Social Media, and Cyber
The Mobile Century field experiment. Transportation      Technologies. In Cyberspace and International
Research Part C, 18, 568-583.                            Relations (pp. 141-160). Springer Berlin Heidelberg.

Hoffmann, L. (2012). Data mining meets city hall.        Lehdonvirta, V., & Ernkvist, M. (2011). Converting
Communications of the ACM, 55(6), 19-21.                 the virtual economy into development potential:
                                                         knowledge map of the virtual economy. InfoDev/
Lagi, M., Bertrand, K. Z., & Bar-Yam, Y. (2011). The     World Bank White Paper, 1, 5-17.
Food Crises and Political Instability in North Africa
and the Middle East. arXiv preprint arXiv:1108.2455.     Organisation for Economic Co-operation and
                                                         Development. (2013). New Data for Understanding
Milakovich, M. (2012). Anticipatory Government:          the Human Condition. Retrieved from http://www.
Integrating Big Data for Smaller Government, paper       oecd.org/sti/sci-tech/new-data-for-understanding-
presented at the Oxford Internet Institute “Internet,    the-human-condition.htm
Politics, Policy 2012” Conference, Oxford, 20-21
September.                                               United Nations Global Pulse. (2012). Big Data for
                                                         Development: Challenges & Opportunities. UN, New
Robertson, A. & Olson, S. (2013). Sensing and Shaping    York: NY
Emerging Conflicts. The National Academies Press.
13-14. Retrieved from: http://www.nap.edu/catalog.       United Nations Global Pulse. (2013). Big Data for
php?record_id=18349                                      Development: A Primer. Retrieved from http://www.
                                                         unglobalpulse.org/bigdataprimer



                                                                                                                61
World Economic Forum. (2012). Big Data, Big Impact:
New Possibilities for International Development.
Retrieved from http://www.weforum.org/reports/
big-data-big-impact-new-possibilities-international-
development

Open data / Open science

Fecher, B., & Friesike, S. (2013). Open Science: One
Term, Five Schools of Thought (No. 218). German
Council for Social and Economic Data (RatSWD).

Hall, W., Shadbolt, N., Tiropanis, T., O’Hara, K., &
Davies, T. (2012). Open data and charities. Nominet
Trust. Retrieved from http://www.nominettrust.org.uk/
knowledge-centre/articles/open-data-and-charities

McKinsey Global Institute. (2013). Open Data:
Unlocking innovation and performance with liquid
information. Retrieved from: http://www.mckinsey.
com/insights/business_technology/open_data_
unlocking_innovation_and_performance_with_liquid_
information




                                                        62
ANNEX 2
INTERVIEW LIST
Aaron Siegel                                         Jean Francois Barsoum
Head of Interaction and Online Experience, Fabrica   Senior Managing Consultant, Smarter Cities
Benetton                                             Water and Transportation, IBM
Topic areas: data visualization                      Topic areas: big data, urban infrastructure and opera-
                                                     tions
Alberto Cavallo
Billion Price Project Lead                           Joshua Blumenstock
MIT – Sloan School of Management                     Assistant Professor
Topic area: economics, price indices                 University of Washington
                                                     Topic areas: mobile and social media data
Anthony Leshinsky
Media Services Lead                                  Kathi Kitner
Coldlight Solutions                                  Senior Researcher, Anthropologist
Topic area: business intelligence, data analytics    Intel Labs
                                                     Topic area: big data for development
Bill Wescott
Executive Vice President                             Laura Crow
The CoSMo Company                                    Principal Product Manager
Topic areas: satellite data, systems modeling        M-PESA
                                                     Topic areas: mobile data and financial data
Bayan Bruss, Mohamad Khouja, Jian Khod-
dedad, Jon Pouya Ehsani                              Linus Bengtsson
Co-Founders                                          Executive Director
Logawi                                               Flowminder/Karolinska Institutet
Topic areas: Text analytics and analysis             Topic areas: mobile data, big data for development

Erik Wetter                                          Luke Barrington
Assistant Professor, Stockholm School of Economics   Senior Manager, Research and Development
Co-founder, Flowminder                               Digital Globe
Topic areas: mobile data, big data for development   Topic areas: satellite data, big data analysis

Graham Dodge                                         Mahyar Ghasemali
CEO and Founder                                      Partner and Co-founder
Sickweather                                          dbSeer
Topic areas: Social media data                       Topic areas: data infrastructure and processing


                                                                                                          63
Mohamad Khouja                                         Robin Cross
Data Scientist and Big Data Solutions Specialist       Research Director
IREX/Logawi                                            DemandLink
Topic areas: opinion mining, sentiment and lexical     Topic areas: Retail data and predictive analysis
analysis
                                                       Sean Thornton
Nathan Eagle                                           Research Fellow
CEO                                                    Data-Smart City Solutions
Jana                                                   Topic areas: data and public policy
Topic areas: mobile data, big data for development
                                                       Shannon Lucas
Rajesh Vasa                                            Senior Enterprise Innovation Manager
Senior Lecturer, Faculty of Information and Communi-   Vodafone
cation Technologies                                    Topic areas: mobile and financial data
Swinburne University of Technology
Topic areas: social media big data                     Vanessa Frias-Martinez
                                                       Research Scientist
Renato de Gusmao Cerqueria                             Telefonica Research
Senior Manager, Natural Resources Solutions            Topic area: mobile data
IBM Research Brazil
Topic areas: data analytics and urban systems

Robert Kirkpatrick
Director
UN Global Pulse
Topic areas: big data for development




                                                                                                          64
ANNEX 3
GLOSSARY
3 “V”s - A term defining certain properties of big data    Data migration - The transition of data from one
as volume (the quantity of data), velocity (the speed at   format or system to another.
which data is processed) and variety (the various types
of data).                                                  Data science - The gleaning of knowledge from
                                                           data as a discipline that includes elements of
Algorithm - A formula or step-by-step procedure for        programming, mathematics, modeling, engineering
solving a problem.                                         and visualization.

Anonymization - The process of removing specific           Data silos - Fixed or isolated data repositories that do
identifiers (often personal information) from a dataset.   not interact dynamically with other systems.
                                                           Data warehousing - the practice of copying data from
API (Application Programming Interface) - A set of         operational systems into secondary, offline databases.
tools and protocols for building software applications
that specify how software components interact.             Geospatial analysis - a form of data visualization
                                                           that overlays data on maps to facilitate better
Business intelligence - The use of software tools to       understanding of the data.
gain insight and understanding into a company’s
operations.                                                Hadoop - an open source platform for developing
                                                           distributed, data-intensive applications.
Clickstream analytics (analysis) - The collection,
analysis and reporting of data about the quantity and      Internet of things - The unique digital identifiers in
succession of mouse clicks made by website visitors.       objects that can automatically share data and be
                                                           represented in a virtual environment.
Crowdsourced - The collection of data through
contributions from a large number of individuals.          Latency - the delay in the delivery of data from one
                                                           point to another, or when one system responds to
Data cleaning/cleansing - The detection and                another.
removal, or correction, of inaccurate records in a
dataset.                                                   Machine learning - The creation of systems that can
                                                           learn or improve themselves on the basis of data;
Data exhaust - Data that is collected as a digital by-     often linked to artificial intelligence.
product of other behaviors.
                                                           Map/reduce - a method of breaking up a complex
Data governance - The process of handling and              problem into many chunks, distributing them across
management of data being utilized in an endeavor,          many computers and then reassembling them into a
including policies, data quality and risk management.      single answer.
                                                                                                                    65
Mashup - The use of data from more than one source     Relational database - A database in which information
to generate new insight.                               is formally described and organized in tables
                                                       representing relations.
Metadata - Information about, and descriptions of,
data.                                                  Sentiment analysis (Opinion Mining) - The use of text
                                                       analysis and natural language processing to assess
Nowcasting - A combination of “now” and                the attitudes of a speaker or author, or a group.
“forecasting,” used in both meteorology and
economics referring to immediate term forecasting on   Structured data - Data arranged in an organized data
the basis of real time data flow.                      model, like a spreadsheet or relational database.

Open data - Public, freely available data.             Terabyte - 1 thousand gigabytes.

Open science - An initiative to make scientific data   Text analytics - the process of deriving insight from
and research openly accessible.                        unstructured, text-based data.

Petabyte - 1 thousand terabytes.                       Topic modelling - The use of statistical models or
                                                       algorithms to decipher themes and structure in
Predictive analytics/modeling - The analysis of        datasets.
contemporary and historic trends using data and
modeling to predict future occurrences.                Tweet - a post via the Twitter social networking site
                                                       restricted to a string up to 140 characters
Quantitative data analysis - the use of complex
mathematical or statistical modeling to explain, or    Unstructured data - Data that cannot be stored in a
predict, financial and business behavior.              relational database and can be more challenging to
                                                       analyze from documents and tweets to photos and
Reality mining - The study and analysis of human       videos.
interactions and behavior through the usage of
mobile phones, GPS and other machine-sensed            Visualization - Graphic ways of presenting data that
environmental data.                                    help people to make sense of huge amounts of
                                                       information.




                                                                                                               66
Photo credits: cover page: Mor Naaman / flickr.com; page 2 Martin Sojka / flickr.com; page 7: Uncul-
tured / flickr.com; page 12: CGIAR Climate / flickr.com; page 25: CGIAR Climate / flickr.com; page
35: Linh Nguyen; page 42 / unsplash.com : NASA’s Marshall Space Flight Center / flickr.com; page
54 / unsplash.com; page 56: Wojtek Witkowski / flickr.com




                                                                                                   67