FINAL REPORT: Digital Jobs for Urban Resilience
Azavea Tree Canopy Analysis

OVERVIEW
The World Bank GFDRR team engaged Azavea to aid them in the creation of a machine learning algorithm that
would enable them to predict tree canopy cover in two African cities, Freetown, Sierra Leone and Dar-es-
Salaam, Tanzania. As part of the effort, a team of students at a local university participated in a limited
labeling campaign using Azavea’s labeling application, GroundWork. Simultaneously, Azavea’s third-party
vendor and labeling partner, CloudFactory, labeled the same imagery. Both labeled datasets were used to
produce, test, or validate the tree canopy model that Azavea developed. Overall, we feel the pilot project was
successful, produced good results, and could be replicated and scaled with relative ease.

METHODOLOGY

GROUNDWORK
In order to facilitate labeling, several projects for both the Freetown and Dar-es-Salaam areas of interest
(AOIs) were created in Azavea’s geospatial labeling application, GroundWork. The Students focused almost
exclusively on one such project for the Freetown AOI. Within the project, the tasks (which is defined in
GroundWork as a specific piece of the image available for labeling) were assigned randomly, as is the case
with all GroundWork projects. Labelers were allowed to skip tasks or flag them as not fit for labeling. In the
beginning of the labeling process, Azavea staff provided a series of materials to aid the Students, most
notably a PowerPoint presentation with both examples of images and explicit instructions. Once the labelers
began, Azavea staff reviewed a sampling of tasks, aiming to look at at least one from each labeler, in order to
assess quality and provide constructive feedback. In general, we found the Students were not labeling the
entire task at the start, but did remedy this behavior after receiving the feedback. Over the course of the three
weeks, fifty Students labeled 4,681 tasks.

DATA
The modeling process and all the experiments were based on the following data sources:
    ● 8-band WorldView 2 satellite imagery for Freetown (acquired January 2020, 21707 by 25132 pixels
       covering approximately 117.12 square kilometers) and Dar-es-Salaam (acquired December 2019, 18130
       by 28364 pixels covering approximately 78.94 square kilometers).
    ● CloudFactory (CF) labels for Freetown and Dar-es-Salaam (please see Figures F.1 and F.2 for the label
       and validated label footprints).
    ● Student labels for Freetown (the image was almost entirely labeled but the labels were unvalidated).

Because Freetown was the only city for which we had both sets of labels (Student and CF), and because
labeling for Dar-es-Salam started only after Freetown was finished, most of the experiments discussed below
were performed on Freetown imagery only.

It should be noted that due to different rates of completion of the labelling tasks available in Groundwork by
Student labels (97%) and CF labels (~50%). The number of images in the Student-labeled and the CF-labeled


Azavea Tree Canopy Analysis                                                                             1
data were different. This turned out to be unimportant, as discussed in the “Modeling: Effect of Data Size”
section below.




Figure F.1: The Freetown imagery (left) with the footprint of CloudFactory labels (middle) and the footprint of the subset
of CloudFactory labels that were validated (or double-checked) (right). Note that these images do not show the labels
themselves, they show the areas that have been labeled.




Figure F.2: The Dar-es-Salaam imagery (left) with the footprint of CloudFactory labels (middle) and the footprint of
validated CloudFactory labels (right).

DATA: TRAINING AND VALIDATION SPLITS
The Freetown imagery was split such that the left 80% of the city became the training set while the right 20%
became the validation set12. This was to ensure a cleaner separation and to prevent data leakage between the


1
  We completed k-fold validation experiments using the other four possible 20% wide slices and obtained similar F1
scores for detecting tree canopy using models trained on validated CF labels (0.945 when the right-most 20% is the
validation set and 0.948 when all five possible 20% vertical slices are taken in turn as the validation set and the results
are averaged).
2
  For purposes of the analysis in this document, it is possible to consider the two partitions of Freetown to respectively
be training and validation splits with the whole of Dar-es-Salaam the test split (please see the section on Model
Transferability).


Azavea Tree Canopy Analysis                                                                                        2
training and validation sets. Therefore, the models would be validated on a part of the city that they had not
seen at all during training, which would provide a better measure of their performance.

MODELING

MODELING: OBJECTIVES
The main objectives of the modeling process were understood to be the following:
    1. Obtain a "good" model using some combination of imagery and labels;
            ○ We identified Tree Canopy F13 score to be the most important metric for measuring model
                performance
    2. Evaluate how the model performance changes with the amount of data it is trained on;
    3. Compare the quality of Student labels with that of CF labels; and
    4. Evaluate the transferability of a model trained on one city to another city.

MODELING: ESTABLISHING A BASELINE
We first sought to establish baseline results --results produced by a minimal model that should be surpassed
by any sophisticated model-- to measure the subjective difficulty of the inference problem and provide a
point-of-comparison for our progress.

BASELINE: THRESHOLDING
This baseline model is based on simple thresholding based on pixel values. Thresholding of normalized
indices is a common baseline technique in GIS, but for our tests we strengthened the baseline by building it
directly from the image statistics.

We noticed that the respective distributions of tree canopy and background pixel values (Figure B.0) differed
significantly and the distribution of tree pixels was fairly narrow. This indicated that it might be possible to go
a long way in identifying trees by simply classifying as tree-pixels all pixels that are close enough to the mean
of the tree distribution (say, within 3 standard deviations). This can be seen as a simple form of the
fundamental technique of thresholding in image segmentation, which involves separating a foreground class
(in this case: trees) from a background class by finding a threshold that separates their respective pixel value
distributions.




3
  We chose F1 score as our metric for evaluating segmentation quality. This is a standard metric used for measuring
segmentation quality because it achieves its maximum when both recall (percentage of true positive pixels found) and
precision (percentage of positive responses which are true positives) are maximized, and penalizes failure in either of
those areas.


Azavea Tree Canopy Analysis                                                                                      3
Figure B.0: Histograms showing the distribution of pixel values for the red, green, and blue channels for trees and
background. The dashed lines represent the respective means. These plots are based on 50 randomly sampled images
from the training set.


To summarize: we calculated the mean and standard deviation of the intensities of the red, green, and blue
bands for pixels labeled as “Tree Canopy” and for pixels labeled as “Background”. All pixels whose red, green,
and blue values were all less than three standard deviations from the respective channel means for “Tree
Canopy” were given that label, others were labeled as “Background”.

On the CloudFactory labels, this procedure produced an overall F1 score of 0.873, a Background F1 score of
0.789, and a Tree Canopy F1 score of 0.905. On the Student-produced labels, this procedure produced an
overall F1 score of 0.736, a Background F1 score of 0.482, and a Tree Canopy F1 score of 0.780. Generally
speaking, this approach does a fair job of detecting actual tree canopy, but does not do a good job of
distinguishing between tree canopy and other types of foliage. Some examples are below.




Azavea Tree Canopy Analysis                                                                                  4
Figure B.1: Baseline results produced from Student-made labels. Note that the “Ground truth labels” in the top row show
an example of imperfect labeling. The sharp edges likely correspond to task boundaries in Groundwork, where each task
might have been labeled by a different labeler.




Figure B.2: Baseline results produced from CF labels.


The predictions shown in Figures B.1 and B.2 reinforce the earlier speculation that the baseline technique
does not do a good job of discriminating between tree canopy and other types of foliage. The baseline
technique also tends to produce many false positive and false negative pixels (i.e., the dark dots in the yellow
regions and vice versa in Figures B.1 and B.2), mainly because it does not take the pixel’s spatial context into
account. (Also notice that comparison of the columns labeled “Ground Truth labels”4 in Figures B.1 and B.2
anecdotally shows the difference in quality between the Student-produced labels and CloudFactory-produced
labels.)

BASELINE: NORMALIZED INDICES
As an extension of the baseline work described above, we experimented with a shallow (that is, not deep)
architecture that mimics the structure of normalized indices while being easily tunable from data. The
purpose of this experiment was to provide an upper bound on the expected performance of a typical
normalized index, such as NDVI, on this problem. We took this approach rather than testing NDVI directly
because it is not clear that NDVI is the optimal choice among all possible normalized indices.



4
  Here, the term “ground truth labels” refers to labels that have been generated by human inspection of the imagery, not
to data that were generated by physically visiting the site.


Azavea Tree Canopy Analysis                                                                                     5
Using all 8 bands available in the imagery for input and using validated CloudFactory labels on Freetown with
an 80/20 training/validation split, this approach produced the following results: an average F1 score of 0.891, a
Background F1 score of 0.836, and a Tree Canopy F1 score of 0.917.

Using all 8 bands available in the imagery for input and using validated CloudFactory labels on both Freetown
and Dar-es-Salaam with an 80/20 training/validation split, this approach produced an average F1 score of
0.846, a Background F1 score of 0.827, and a Tree Canopy F1 score of 0.896.

The similarity of the performance of this model to that of the baseline indicates that one can reasonably
expect to achieve a Tree Canopy of F1 score in the neighborhood of 0.9 by relying solely on spectral properties,
but (as will be discussed later) some higher-level vision capability is seemingly needed to go beyond that.

MODELING: DEEP LEARNING

DEEP LEARNING: MODEL ARCHITECTURES
We used two different model architectures in our experiments:
    1. Panoptic FPN with a ResNet-18 backbone (Kirillov et al., 2019), and
    2. DeepLab V3 with a ResNet-50 backbone (Chen et al., 2017)
We found Panoptic FPN models to be computationally much more efficient than DeepLab models, while
producing the same quality of results. Therefore, that is what we chose to use for the majority of our
experiments.

DEEP LEARNING: TRAINING
All models used were pre-trained on ImageNet. The models were trained via Azavea’s open source machine
learning library, Raster Vision, for 20 epochs, using a batch size of 8, a learning rate of 0.0001 with a 1-cycle
schedule (Smith & Topin, 2019), and optimized using the Adam optimizer (Kingma & Ba, 2015).

DEEP LEARNING: INITIAL RESULTS
Initial results from training on RGB imagery produced a significant improvement over the baseline results
(Table B.1). The Student models seemed to do poorly in comparison to CF models in terms of scores, but visual
inspection of model outputs showed them to be of similar quality to those of the CF models (Figure B.6).

This suggested the possibility that the validation sets for both Student and CF datasets contained poorly
labeled images which were causing the validation scores to suffer even though the models were producing
good results.


                                                    Tree Canopy F1 score          Tree Canopy F1 score
               Model                Bands
                                                         (CF labels)                 (Student labels)

         Panoptic FPN                RGB                       0.932 ± 0.003                 0.873 ± 0.002

         DeepLab                     RGB                       0.930 ± 0.000                 0.872 ± 0.001




Azavea Tree Canopy Analysis                                                                                  6
         Table B.1: Initial model validation results. The numbers in this and subsequent tables are averages of the
         individual scores from 4 parallel training runs of the same model. The error margins represent 95% confidence
         intervals. Both rounded to 3 significant figures.



DEEP LEARNING: MORE RELIABLE MODEL VALIDATION
To get a more reliable measure of model performance, we re-computed these scores for both Student and CF
models on only a subset of CF-labeled images (in the validation set) formed by only keeping those images
whose labeling had been validated by expert labelers and also manually discarding some other instances of
bad labeling. This subset consisted of 538 images in total.

This restricted validation set was used to compute all the scores reported in subsequent tables.


                                                                                        Tree Canopy F1 score
                       Model                  Bands               Trained on
                                                                                       (on validated CF labels)

                Baseline                       RGB             CF labels                                   0.913


                Panoptic FPN                   RGB             CF labels                        0.957 +/- 0.001


                Baseline                       RGB             Student labels                             0.854


                Panoptic FPN                   RGB             Student labels                   0.931 +/- 0.000


                Table B.2: Validation results on the more authoritative subset of CF labels.



These scores on this more authoritative validation set lead us to conclude that:
   ● Student models are much better than the original Student-label validation set suggested (see section
        “Modeling: Effect of Label Quality” for more discussion), and
   ● CF models are objectively a little better than Student models, highlighting the importance of having
        more accurate labels in the training data.


DEEP LEARNING: MAKING USE OF OTHER BANDS
With a reliable validation set established, we focused on approaches to improving the results. For this, one of
the things we tried was to make use of the remaining bands (other than RGB) that were available in the 8-band
imagery. Another promising possibility was making use of computed bands such as the Normalized Difference
Vegetation Index (NDVI) that are specialized for detecting vegetation. To this end, we trained a model that
made use of all 8-bands, and another model that worked on just the NDVI band. See table B.3 for results.




Azavea Tree Canopy Analysis                                                                                              7
For the 8-band model, we used a custom implementation adapted from the fusion technique described in the
Fusenet paper (Hazirbas et al., 2016). This involves adding a new backbone (that takes in 5-channel inputs) to
the neural network, parallel to the existing backbone (that takes in 3-channel inputs), with connections at
multiple points between the two backbones. The new backbone was initialized with pretrained weights.

For the NDVI model, we replaced the first convolutional layer of the backbone that normally takes in 3-channel
inputs with one that takes in 1-channel inputs. The pretrained weights were retained.




Figure B.3: Sample input (first 3 columns) and output of the 8-band model. The bands shown in this image have been split
into multiple sub-images for visualization purposes only.




Azavea Tree Canopy Analysis                                                                                    8
z
    Figure B.4: Sample input and output of the NDVI model.

    DEEP LEARNING: DATA AUGMENTATION
    Another attempt included training an RGB model with heavy color-based data augmentation. This would have
    the effect of jittering the colors of the image and force the model to not rely too much on color and instead
    learn to recognize other properties of trees such as texture. See table B.3 for results.

    DEEP LEARNING: MODEL ENSEMBLE
    Upon closer inspection of the individual predictions of RGB, NDVI, and 8-band models, we noticed that their
    predictions differed significantly in some regions, so that one or two of them would get it right even if the
    others failed. This suggested the possibility of ensembling the models together so that their predictions
    complement each other. See table B.3 for results

    DEEP LEARNING: FINAL RESULTS


                                                                            Tree Canopy F1 score
                                   Model                     Bands
                                                                            (Validated CF labels)

                     Panoptic FPN                             RGB                    0.957 +/- 0.001

                     Panoptic FPN                             NDVI                   0.957 +/- 0.001

                     Panoptic FPN                            8 bands                0.956 +/- 0.001

                     Panoptic FPN + augmentation              RGB                   0.959 +/- 0.001




    Azavea Tree Canopy Analysis                                                                            9
                                                           RGB, NDVI, 8
                Ensemble: Panoptic FPN x3                                                0.961
                                                             bands

                Table B.3: Results from all the approaches described above.



From this we can conclude the following:
   ● The NDVI band on its own is informative enough to get a competitive model (a model that uses NDVI
        as its sole input band produces competitive results).
   ● Using all 8 bands does not get us any improvement over just using RGB or NDVI (which is based on Red
        and InfraRed). This suggests that, given that you already have Red and IR bands, there is little to no
        additional information (relevant to this task) that can be learned from the remaining bands; at least
        within the constraints of this dataset, model architecture, and training procedure.
   ● Color-based data augmentation is helpful for this problem.
   ● Ensembling the models that work on different bands gets us a small performance boost (which is
        small in magnitude but outside of the margin of error as seen in Table B.3).




Azavea Tree Canopy Analysis                                                                          10
MODELING: EFFECT OF DATA SIZE
Using the model configurations in Table B.2 (i.e. RGB Panoptic FPN models trained on Freetown CF and
Student labels), we performed experiments to measure the relationship between quantity of training data and
model quality.

We built models with CloudFactory labels and separately with Student-made labels. The CloudFactory-derived
models were validated on CloudFactory labels and Student-derived models were validated on Student labels.
Two figures (for CloudFactory labels and Student labels, respectively) are below.




Figure B.5: Performance metrics vs. percentage of training data used- CF labels. In each subplot, the solid blue line is the
average of the lines from individual runs (shown in light gray) of the same training configuration. The error bars represent
95% confidence intervals.




Azavea Tree Canopy Analysis                                                                                       11
Figure B.6: Performance metrics vs. percentage of training data used- Student labels. In each subplot, the solid blue line
is the average of the lines from individual runs (shown in light gray) of the same training configuration. The error bars
represent 95% confidence intervals.

In the case of both CloudFactory labels and Student labels, it can be seen that near-final performance is
achieved when 10% of the available training data is used. Trends in precision, recall, and F1 after the 10%
mark, for both sets of labels, can be seen to be minimal and possibly within the noise when the scale of the
graphs is considered. This behavior is consistent with the hypothesis that we gave earlier: it seems to be
possible to reliably detect tree canopy with mostly pixel-level statistics (which can be reliably established with
a small subset of the data) combined with a small amount of higher-level vision; one can speculate that the
higher-level vision component is probably organized around the difference in texture between tree canopy
and other foliage and that that difference can be learned with relatively little data.

These results indicate that a model can internalize the particulars of a scene with relatively little training data.
With that said, we caution that one should not conclude that fractional labels for one scene are sufficient to
produce a good high quality model that works everywhere/anywhere. It is highly likely that labels from a




Azavea Tree Canopy Analysis                                                                                       12
variety of locations (and possibly a variety of seasons, if seasonal robustness is desired) will be required for
that purpose, albeit with a relatively small amount of label data for each scene.

MODELING: EFFECT OF LABEL QUALITY
We have shown instances above where models made with the Student-made labels seem to underperform
CloudFactory ones. In this section we make that more explicit.

The two figures below relate to the CloudFactory labels and the Student labels, respectively. The left
subfigure of each figure shows the training loss dropping with each subsequent training epoch, as expected.
The right subfigures show the loss on the validation set in each subsequent training epoch. The validation
loss of the CloudFactory model has a validation loss that is trendless and relatively low, while the Student
model has a higher loss that trends upward.




                                                               (a)




                                                              (b)


 Figure B.7: Training and validation loss profiles during training for (a) CF labels and (b) Student labels. In each subplot,



Azavea Tree Canopy Analysis                                                                                           13
    the solid blue line is the average of the lines from individual runs (shown in light gray) of the same training
    configuration. The error bars represent 95% confidence intervals.
The early overfitting5 seen when using Student labels is consistent with the Student labels containing less
information (being more “random”) than the CloudFactory labels6. In such a scenario the training process
would drive the model to memorize more and more of the particular peculiarities of the training data rather
than useful patterns that can also be used to understand the validation data. As training proceeds, the
training loss decreases, but that is due to more and more sterile memorization rather than learning.

Despite this, the output of models, shown in Figure B.6, trained on Student labels still looks good visually,
indicating that the low scores have more to do with the quality of the labels in the validation set than the
model itself.




5
  Experiments using all five possible 20% vertical slices as validation sets on Student-produced labels found the early
overfitting to be very evident when the two rightmost 20% slices are used as validation sets and becomes less evident as
one moves to the left. The phenomenon is most evident when the rightmost 20% slice is used as the validation set.
6
  This is an empirical conclusion based on analysis of validation sets that contain larger percentages of tree canopy.


Azavea Tree Canopy Analysis                                                                                           14
Figure B.8: Sample output of a model trained on Student labels. The results look good despite the low validation scores.
The qualitative evidence from Figure B.8, together with the quantitative results in Table B.2, shows that these
models trained on imperfect labels are still able to recognize trees relatively well. This is in line with prior
research (Rolnick et al., 2017), and our prior experience, indicating that Deep Learning models tend to be fairly
robust to noise in the training dataset.

MODELING: MODEL TRANSFERABILITY
We did experiments to test the transferability of models trained on one location (Freetown) to another (Dar-
es-Salaam). (Please note that we have delivered a model trained on both locations; this test is interesting
because it might be predictive of how the model we delivered will generalize to different places and/or seasons.)
For this, we validated some models trained on Freetown imagery on Dar-es-Salam imagery.



Azavea Tree Canopy Analysis                                                                                      15
The Dar-es-Salaam data was restricted to validated labels only, comprising 440 images in total.

The models used for this included (from Table B.3): the RGB-with-augmentation model, the 8-band model, and
an ensemble of the two. All trained on Freetown imagery only. The results are shown in Figure B.9.




Figure B.9: Comparison of metrics of different models trained on Freetown and evaluated on Dar-es-Salaam.
These results for the baseline model show poor generalization. This is unsurprising since the statistics used
for the baseline model were derived from a single location which was entirely different from the area under
test. The various deep learning models exhibit fairly respectable recalls, depressed precisions, and therefore
somewhat depressed F1 scores. The ensemble model is again able to eke out a little more performance than
individual models.

The output predictions of the ensemble model (Figure B.10 and Figure B.11), however, look much better than
the numbers would suggest, indicating that the restricted validation set we used is probably not
representative of the city as a whole. This qualitative observation, taken together with the significant
geographical distance between the two cities, indicates fairly decent transferability of this model.




Azavea Tree Canopy Analysis                                                                                 16
 Figure B.8: Tree canopy predictions on Dar-es-Salaam of a model that was only trained on Freetown. Darker green
 indicates higher probability of tree canopy.




Azavea Tree Canopy Analysis                                                                                  17
 Figure B.11: Comparison between human-expert-defined canopy boundary (top) with the model predictions (bottom).
 The pink region represents areas with little or no trees. The bright green regions represent pixels predicted by the
 model to be a part of tree canopy.




Azavea Tree Canopy Analysis                                                                                    18
Some failure modes of this model include (Figure B.12): failing to detect trees that are under particularly dark
shadows from clouds and a few false positives out in the sea. The former may be mitigated by using satellite
images taken on clear sunny days, while the latter can be mitigated by restricting the AOI to land only.




                                    (a)                                                            (b)

 Figure B.12: (a) Trees (circled in red) in shadow that were missed by the model. (b) False positives in the sea. Darker
 green indicates higher probability of tree canopy.




Azavea Tree Canopy Analysis                                                                                         19
PILOT INDICATORS

MAPPING
The Students were able to map the equivalent of tree canopy over ~42.75 square km of Freetown after a total
of 2,250 hours of mapping (discussed below). CloudFactory was able to map the equivalent of tree canopy
over ~27.37 km^2 of Freetown and ~3.12 km^2 over Dar-es-Salaam after a total of 827.5 hours of mapping (also
discussed below).

EMPLOYMENT
The Azavea team is not aware of the specific employment terms that were used with the Student labelers for
this project, but as far as we understand the work placement requirements, we assumed each Student labeled
for up to three hours per day for fifteen days (45 hours), for a collective total of 2,250 hours. Anecdotally, it
seems unlikely that each Student spent 45 hours labeling, however, if we assume that was the case, the
average number of tasks labeled per Student was 2.08 per hour, whereas the median was 1.48, the minimum
was .38, and the maximum was 14.16 (one individual labeled over 600 tasks).

The validation labels provided by CloudFactory was the result of work performed by a rotating team of eleven
(three females and eight males) working remotely in Nepal. One of the main reasons Azavea works with
CloudFactory is their goal to provide a living wage to remote workers in Nepal and Kenya. Their Workforce
Stategy can be found at https://www.cloudfactory.com/managed-workforce-strategy. The CloudFactory team
spent 827.5 hours labeling tree canopy, with a median of 5.5 tasks per hour (for each individual). This work
was performed as part of Azavea’s long-term engagement with CloudFactory, which costs $5,000 per month
and covers any client work that may be underway at any given time. CloudFactory ensures ongoing project
management and provides laptops and/or desktops as well as reliable internet connectivity to each of their
employees. Individual employee salaries are not available.

SKILLS DEVELOPMENT (for the Students)
In order to provide both an introduction to GroundWork and the tree canopy mapping project to the Students,
the Azavea team provided a slide deck with instructions, screenshots, and short videos to illustrate the
various tools within the application and specific examples of mapped trees. In addition, we created a shared
Google spreadsheet to house questions that we answered asynchronously, as well as completion statistics.
Azavea staff provided feedback regarding label quality at the start of the project. A number of Students were
able to label more than 100, 200, or (in only one case) 600 tasks, however Azavea is unaware of any skills
development monitoring and evaluation effort that may have been in place.

TOOLS/WORKFLOWS
The use of GroundWork was new to the students, but beyond that the tools/workflows we believe are new to
the program were used to perform the modeling work, as outlined below.

RASTER VISION
We used Azavea’s Open Source machine learning framework Raster Vision for generating training data and for
training the models. Raster Vision’s data generation works by splitting a large geospatial scene into small



Azavea Tree Canopy Analysis                                                                             20
rasters using a sliding window. These rasters can then be treated like any other image dataset in the machine
learning process.

For model building and training, Raster Vision internally uses PyTorch and TorchVision. It is able to make use
of models provided by TorchVision as well as any external models written in PyTorch. For data augmentation,
Raster Vision utilizes the Albumentations library.

MODELS
The DeepLab models (Chen et al., 2017) used were made from TorchVision’s implementation. The Panoptic FPN
models used were from a custom implementation of the Panoptic Feature Pyramid Networks described in
(Kirillov et al., 2019); the ResNet (He et al., 2016) backbone was made using TorchVision’ implementation. For
handling more than 3 channels, we used a custom implementation adapted from the fusion technique
described in Fusenet (Hazirbas et al., 2016).

ENSEMBLE PREDICTION
The ensemble predictions were made using a custom script adapted from Raster Vision. It involved averaging
the outputs of the individual models after converting them into probabilities using softmax.

REPLICATION & SCALING
These results can be replicated and scaled easily. While the labels provided by CloudFactory resulted in more
accurate models, the labels provided by the Students resulted in what could be considered an acceptable
model. The ensemble model, as mentioned above, may be somewhat transferable and may result in somewhat
performant results when applied to other similar geographic areas (during similar seasons), however in order
to ensure the best results, labels from desirable geographic areas should be used to retrain the model.

COSTING FUTURE INITIATIVES
Using the rough numbers of mapped area cited above, we can estimate that the Students would require 526
hours to map 10 km^2 of tree canopy, whereas CloudFactory could map the same area in about 271 hours. As
mentioned previously, the label quality is more important than quantity. As professional labelers, CloudFactory
can provide high-quality labels in a shorter amount of time, however, given the aims of the digital jobs
program, it is reasonable to say that with a longer, more intensive training period, student labels are likely to
increase in quality and would likely produce a viable model.

Although Azavea can’t speak to the projected costs of managing a student labeling cohort, we can certainly
provide insight into the cost of replicating the study. This particular pilot project requires very little technical
staff since it uses a publicly available labeling application, GroundWork. If a similar number of students would
be participating over a prolonged period of time, World Bank would need a GroundWork Pro subscription for
$10,000/year. If individual projects would be taking place over limited periods of time, it might be best to
engage Azavea in a custom contract, as was the case for this pilot project. Finally, while the application
includes a tutorial upon first log in, if World Bank would also like to replicate the level of training and
troubleshooting, and/or labeling by a professional team to use as validation, a subsequent contract with
Azavea and/or CloudFactory would be required.



Azavea Tree Canopy Analysis                                                                                 21
It is our hope that with the model results and documentation provided, World Bank staff would be able to run
the model on any new labeled imagery with relative ease, noting of course the model transferability points
mentioned above. That said, Azavea would be happy to retrain one or more models on additional imagery.
Each engagement would need to be scoped and priced individually, however, we would anticipate being able
to apply lessons learned from this pilot and spend more time on model refinement than model development.

DIGITAL WORKER ENGAGEMENT SURVEY
The students were managed by university staff and beyond answering questions, providing completion
metrics and guidance regarding label quality, the Azavea team did not interact with the Students. Anecdotally,
we heard from the World Bank team that the Students enjoyed the project a great deal and it was considered
one of the better placements for the term.

RESULTS SUMMARY
After a labeling period leveraging GroundWork, we produced deep-learning models based on the ResNet-18
backbone from both Student-produced and CloudFactory label data. When compared to baseline statistical
methods, we found that deep-learning models outperform simple models substantially (in terms of objective
validation metrics and in terms of subjective quality). Our experiments show that deep-learning models
derived from CloudFactory training data are objectively better than the Student-derived models (based on
validation scores), but subjectively there is very little difference between the two. Finally, we found that
training labels from a small percentage of a scene was sufficient to produce a model that performs well on
that scene, in our experiments. (The sufficient percentage was found to be approximately 10% as seen in
Figures B.3 and B.4 and the surrounding discussion.)

Based on the discussion in the “Modeling: Effect of Data Size” and “Modeling: Effect of Label Quality” sections,
some of key findings are that:
   ● We don’t need a very large quantity of data to get good models
   ● It is possible to get good models even when the labeling quality is not very high
   ● Having high quality labels helps in getting a more accurate measure of model performance and is
        therefore useful in validation for comparing different models

Although we found that good performance can be achieved on a scene with only a small sample of labeled data
from each scene, we caution against concluding that a small amount of labeled imagery is sufficient to
produce a model that does well on any scene. The nice generalization of models trained only on Freetown to
inference on Dar-es-Salaam is a hopeful sign, but our prior experience urges caution. One should not expect
the models presented here (or any models produced with the training data used here) to generalize to any
superficially-similar scene; diversity of the location and season of training scenes is a precondition for that
hope.

In short, the performance of the model will be directly affected by both the use of imagery specific to the AOI
and the quality of the labels used to train the model. We would recommend any replication of this pilot project
include the procurement of additional, location- and season-specific imagery, a period of labeling (and if using
students, provide a longer training window than we had here), followed by model retraining and refinement.



Azavea Tree Canopy Analysis                                                                             22
                                                 Bibliography

       Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking Atrous Convolution for Semantic

       Image Segmentation. ArXiv. https://arxiv.org/abs/1706.05587

       Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). FuseNet: Incorporating Depth into Semantic

       Segmentation via Fusion-Based CNN Architecture. ACCV 2016. 10.1007/978-3-319-54181-5_14

       He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016.

       10.1109/cvpr.2016.90

       Kingma, D., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ArXiv.

       https://arxiv.org/abs/1412.6980

       Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic Feature Pyramid Networks. CVPR 2019.

       10.1109/CVPR.2019.00656

       Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep Learning is Robust to Massive Label Noise.

       ArXiv. https://arxiv.org/abs/1705.10694

       Smith, L., & Topin, N. (2019). Super-convergence: very fast training of neural networks using large

       learning rates. Defense + Commercial Sensing 2019. 10.1117/12.2520589




Azavea Tree Canopy Analysis                                                                              23