FINAL REPORT: Digital Jobs for Urban Resilience Azavea Tree Canopy Analysis OVERVIEW The World Bank GFDRR team engaged Azavea to aid them in the creation of a machine learning algorithm that would enable them to predict tree canopy cover in two African cities, Freetown, Sierra Leone and Dar-es- Salaam, Tanzania. As part of the effort, a team of students at a local university participated in a limited labeling campaign using Azavea’s labeling application, GroundWork. Simultaneously, Azavea’s third-party vendor and labeling partner, CloudFactory, labeled the same imagery. Both labeled datasets were used to produce, test, or validate the tree canopy model that Azavea developed. Overall, we feel the pilot project was successful, produced good results, and could be replicated and scaled with relative ease. METHODOLOGY GROUNDWORK In order to facilitate labeling, several projects for both the Freetown and Dar-es-Salaam areas of interest (AOIs) were created in Azavea’s geospatial labeling application, GroundWork. The Students focused almost exclusively on one such project for the Freetown AOI. Within the project, the tasks (which is defined in GroundWork as a specific piece of the image available for labeling) were assigned randomly, as is the case with all GroundWork projects. Labelers were allowed to skip tasks or flag them as not fit for labeling. In the beginning of the labeling process, Azavea staff provided a series of materials to aid the Students, most notably a PowerPoint presentation with both examples of images and explicit instructions. Once the labelers began, Azavea staff reviewed a sampling of tasks, aiming to look at at least one from each labeler, in order to assess quality and provide constructive feedback. In general, we found the Students were not labeling the entire task at the start, but did remedy this behavior after receiving the feedback. Over the course of the three weeks, fifty Students labeled 4,681 tasks. DATA The modeling process and all the experiments were based on the following data sources: ● 8-band WorldView 2 satellite imagery for Freetown (acquired January 2020, 21707 by 25132 pixels covering approximately 117.12 square kilometers) and Dar-es-Salaam (acquired December 2019, 18130 by 28364 pixels covering approximately 78.94 square kilometers). ● CloudFactory (CF) labels for Freetown and Dar-es-Salaam (please see Figures F.1 and F.2 for the label and validated label footprints). ● Student labels for Freetown (the image was almost entirely labeled but the labels were unvalidated). Because Freetown was the only city for which we had both sets of labels (Student and CF), and because labeling for Dar-es-Salam started only after Freetown was finished, most of the experiments discussed below were performed on Freetown imagery only. It should be noted that due to different rates of completion of the labelling tasks available in Groundwork by Student labels (97%) and CF labels (~50%). The number of images in the Student-labeled and the CF-labeled Azavea Tree Canopy Analysis 1 data were different. This turned out to be unimportant, as discussed in the “Modeling: Effect of Data Size” section below. Figure F.1: The Freetown imagery (left) with the footprint of CloudFactory labels (middle) and the footprint of the subset of CloudFactory labels that were validated (or double-checked) (right). Note that these images do not show the labels themselves, they show the areas that have been labeled. Figure F.2: The Dar-es-Salaam imagery (left) with the footprint of CloudFactory labels (middle) and the footprint of validated CloudFactory labels (right). DATA: TRAINING AND VALIDATION SPLITS The Freetown imagery was split such that the left 80% of the city became the training set while the right 20% became the validation set12. This was to ensure a cleaner separation and to prevent data leakage between the 1 We completed k-fold validation experiments using the other four possible 20% wide slices and obtained similar F1 scores for detecting tree canopy using models trained on validated CF labels (0.945 when the right-most 20% is the validation set and 0.948 when all five possible 20% vertical slices are taken in turn as the validation set and the results are averaged). 2 For purposes of the analysis in this document, it is possible to consider the two partitions of Freetown to respectively be training and validation splits with the whole of Dar-es-Salaam the test split (please see the section on Model Transferability). Azavea Tree Canopy Analysis 2 training and validation sets. Therefore, the models would be validated on a part of the city that they had not seen at all during training, which would provide a better measure of their performance. MODELING MODELING: OBJECTIVES The main objectives of the modeling process were understood to be the following: 1. Obtain a "good" model using some combination of imagery and labels; ○ We identified Tree Canopy F13 score to be the most important metric for measuring model performance 2. Evaluate how the model performance changes with the amount of data it is trained on; 3. Compare the quality of Student labels with that of CF labels; and 4. Evaluate the transferability of a model trained on one city to another city. MODELING: ESTABLISHING A BASELINE We first sought to establish baseline results --results produced by a minimal model that should be surpassed by any sophisticated model-- to measure the subjective difficulty of the inference problem and provide a point-of-comparison for our progress. BASELINE: THRESHOLDING This baseline model is based on simple thresholding based on pixel values. Thresholding of normalized indices is a common baseline technique in GIS, but for our tests we strengthened the baseline by building it directly from the image statistics. We noticed that the respective distributions of tree canopy and background pixel values (Figure B.0) differed significantly and the distribution of tree pixels was fairly narrow. This indicated that it might be possible to go a long way in identifying trees by simply classifying as tree-pixels all pixels that are close enough to the mean of the tree distribution (say, within 3 standard deviations). This can be seen as a simple form of the fundamental technique of thresholding in image segmentation, which involves separating a foreground class (in this case: trees) from a background class by finding a threshold that separates their respective pixel value distributions. 3 We chose F1 score as our metric for evaluating segmentation quality. This is a standard metric used for measuring segmentation quality because it achieves its maximum when both recall (percentage of true positive pixels found) and precision (percentage of positive responses which are true positives) are maximized, and penalizes failure in either of those areas. Azavea Tree Canopy Analysis 3 Figure B.0: Histograms showing the distribution of pixel values for the red, green, and blue channels for trees and background. The dashed lines represent the respective means. These plots are based on 50 randomly sampled images from the training set. To summarize: we calculated the mean and standard deviation of the intensities of the red, green, and blue bands for pixels labeled as “Tree Canopy” and for pixels labeled as “Background”. All pixels whose red, green, and blue values were all less than three standard deviations from the respective channel means for “Tree Canopy” were given that label, others were labeled as “Background”. On the CloudFactory labels, this procedure produced an overall F1 score of 0.873, a Background F1 score of 0.789, and a Tree Canopy F1 score of 0.905. On the Student-produced labels, this procedure produced an overall F1 score of 0.736, a Background F1 score of 0.482, and a Tree Canopy F1 score of 0.780. Generally speaking, this approach does a fair job of detecting actual tree canopy, but does not do a good job of distinguishing between tree canopy and other types of foliage. Some examples are below. Azavea Tree Canopy Analysis 4 Figure B.1: Baseline results produced from Student-made labels. Note that the “Ground truth labels” in the top row show an example of imperfect labeling. The sharp edges likely correspond to task boundaries in Groundwork, where each task might have been labeled by a different labeler. Figure B.2: Baseline results produced from CF labels. The predictions shown in Figures B.1 and B.2 reinforce the earlier speculation that the baseline technique does not do a good job of discriminating between tree canopy and other types of foliage. The baseline technique also tends to produce many false positive and false negative pixels (i.e., the dark dots in the yellow regions and vice versa in Figures B.1 and B.2), mainly because it does not take the pixel’s spatial context into account. (Also notice that comparison of the columns labeled “Ground Truth labels”4 in Figures B.1 and B.2 anecdotally shows the difference in quality between the Student-produced labels and CloudFactory-produced labels.) BASELINE: NORMALIZED INDICES As an extension of the baseline work described above, we experimented with a shallow (that is, not deep) architecture that mimics the structure of normalized indices while being easily tunable from data. The purpose of this experiment was to provide an upper bound on the expected performance of a typical normalized index, such as NDVI, on this problem. We took this approach rather than testing NDVI directly because it is not clear that NDVI is the optimal choice among all possible normalized indices. 4 Here, the term “ground truth labels” refers to labels that have been generated by human inspection of the imagery, not to data that were generated by physically visiting the site. Azavea Tree Canopy Analysis 5 Using all 8 bands available in the imagery for input and using validated CloudFactory labels on Freetown with an 80/20 training/validation split, this approach produced the following results: an average F1 score of 0.891, a Background F1 score of 0.836, and a Tree Canopy F1 score of 0.917. Using all 8 bands available in the imagery for input and using validated CloudFactory labels on both Freetown and Dar-es-Salaam with an 80/20 training/validation split, this approach produced an average F1 score of 0.846, a Background F1 score of 0.827, and a Tree Canopy F1 score of 0.896. The similarity of the performance of this model to that of the baseline indicates that one can reasonably expect to achieve a Tree Canopy of F1 score in the neighborhood of 0.9 by relying solely on spectral properties, but (as will be discussed later) some higher-level vision capability is seemingly needed to go beyond that. MODELING: DEEP LEARNING DEEP LEARNING: MODEL ARCHITECTURES We used two different model architectures in our experiments: 1. Panoptic FPN with a ResNet-18 backbone (Kirillov et al., 2019), and 2. DeepLab V3 with a ResNet-50 backbone (Chen et al., 2017) We found Panoptic FPN models to be computationally much more efficient than DeepLab models, while producing the same quality of results. Therefore, that is what we chose to use for the majority of our experiments. DEEP LEARNING: TRAINING All models used were pre-trained on ImageNet. The models were trained via Azavea’s open source machine learning library, Raster Vision, for 20 epochs, using a batch size of 8, a learning rate of 0.0001 with a 1-cycle schedule (Smith & Topin, 2019), and optimized using the Adam optimizer (Kingma & Ba, 2015). DEEP LEARNING: INITIAL RESULTS Initial results from training on RGB imagery produced a significant improvement over the baseline results (Table B.1). The Student models seemed to do poorly in comparison to CF models in terms of scores, but visual inspection of model outputs showed them to be of similar quality to those of the CF models (Figure B.6). This suggested the possibility that the validation sets for both Student and CF datasets contained poorly labeled images which were causing the validation scores to suffer even though the models were producing good results. Tree Canopy F1 score Tree Canopy F1 score Model Bands (CF labels) (Student labels) Panoptic FPN RGB 0.932 ± 0.003 0.873 ± 0.002 DeepLab RGB 0.930 ± 0.000 0.872 ± 0.001 Azavea Tree Canopy Analysis 6 Table B.1: Initial model validation results. The numbers in this and subsequent tables are averages of the individual scores from 4 parallel training runs of the same model. The error margins represent 95% confidence intervals. Both rounded to 3 significant figures. DEEP LEARNING: MORE RELIABLE MODEL VALIDATION To get a more reliable measure of model performance, we re-computed these scores for both Student and CF models on only a subset of CF-labeled images (in the validation set) formed by only keeping those images whose labeling had been validated by expert labelers and also manually discarding some other instances of bad labeling. This subset consisted of 538 images in total. This restricted validation set was used to compute all the scores reported in subsequent tables. Tree Canopy F1 score Model Bands Trained on (on validated CF labels) Baseline RGB CF labels 0.913 Panoptic FPN RGB CF labels 0.957 +/- 0.001 Baseline RGB Student labels 0.854 Panoptic FPN RGB Student labels 0.931 +/- 0.000 Table B.2: Validation results on the more authoritative subset of CF labels. These scores on this more authoritative validation set lead us to conclude that: ● Student models are much better than the original Student-label validation set suggested (see section “Modeling: Effect of Label Quality” for more discussion), and ● CF models are objectively a little better than Student models, highlighting the importance of having more accurate labels in the training data. DEEP LEARNING: MAKING USE OF OTHER BANDS With a reliable validation set established, we focused on approaches to improving the results. For this, one of the things we tried was to make use of the remaining bands (other than RGB) that were available in the 8-band imagery. Another promising possibility was making use of computed bands such as the Normalized Difference Vegetation Index (NDVI) that are specialized for detecting vegetation. To this end, we trained a model that made use of all 8-bands, and another model that worked on just the NDVI band. See table B.3 for results. Azavea Tree Canopy Analysis 7 For the 8-band model, we used a custom implementation adapted from the fusion technique described in the Fusenet paper (Hazirbas et al., 2016). This involves adding a new backbone (that takes in 5-channel inputs) to the neural network, parallel to the existing backbone (that takes in 3-channel inputs), with connections at multiple points between the two backbones. The new backbone was initialized with pretrained weights. For the NDVI model, we replaced the first convolutional layer of the backbone that normally takes in 3-channel inputs with one that takes in 1-channel inputs. The pretrained weights were retained. Figure B.3: Sample input (first 3 columns) and output of the 8-band model. The bands shown in this image have been split into multiple sub-images for visualization purposes only. Azavea Tree Canopy Analysis 8 z Figure B.4: Sample input and output of the NDVI model. DEEP LEARNING: DATA AUGMENTATION Another attempt included training an RGB model with heavy color-based data augmentation. This would have the effect of jittering the colors of the image and force the model to not rely too much on color and instead learn to recognize other properties of trees such as texture. See table B.3 for results. DEEP LEARNING: MODEL ENSEMBLE Upon closer inspection of the individual predictions of RGB, NDVI, and 8-band models, we noticed that their predictions differed significantly in some regions, so that one or two of them would get it right even if the others failed. This suggested the possibility of ensembling the models together so that their predictions complement each other. See table B.3 for results DEEP LEARNING: FINAL RESULTS Tree Canopy F1 score Model Bands (Validated CF labels) Panoptic FPN RGB 0.957 +/- 0.001 Panoptic FPN NDVI 0.957 +/- 0.001 Panoptic FPN 8 bands 0.956 +/- 0.001 Panoptic FPN + augmentation RGB 0.959 +/- 0.001 Azavea Tree Canopy Analysis 9 RGB, NDVI, 8 Ensemble: Panoptic FPN x3 0.961 bands Table B.3: Results from all the approaches described above. From this we can conclude the following: ● The NDVI band on its own is informative enough to get a competitive model (a model that uses NDVI as its sole input band produces competitive results). ● Using all 8 bands does not get us any improvement over just using RGB or NDVI (which is based on Red and InfraRed). This suggests that, given that you already have Red and IR bands, there is little to no additional information (relevant to this task) that can be learned from the remaining bands; at least within the constraints of this dataset, model architecture, and training procedure. ● Color-based data augmentation is helpful for this problem. ● Ensembling the models that work on different bands gets us a small performance boost (which is small in magnitude but outside of the margin of error as seen in Table B.3). Azavea Tree Canopy Analysis 10 MODELING: EFFECT OF DATA SIZE Using the model configurations in Table B.2 (i.e. RGB Panoptic FPN models trained on Freetown CF and Student labels), we performed experiments to measure the relationship between quantity of training data and model quality. We built models with CloudFactory labels and separately with Student-made labels. The CloudFactory-derived models were validated on CloudFactory labels and Student-derived models were validated on Student labels. Two figures (for CloudFactory labels and Student labels, respectively) are below. Figure B.5: Performance metrics vs. percentage of training data used- CF labels. In each subplot, the solid blue line is the average of the lines from individual runs (shown in light gray) of the same training configuration. The error bars represent 95% confidence intervals. Azavea Tree Canopy Analysis 11 Figure B.6: Performance metrics vs. percentage of training data used- Student labels. In each subplot, the solid blue line is the average of the lines from individual runs (shown in light gray) of the same training configuration. The error bars represent 95% confidence intervals. In the case of both CloudFactory labels and Student labels, it can be seen that near-final performance is achieved when 10% of the available training data is used. Trends in precision, recall, and F1 after the 10% mark, for both sets of labels, can be seen to be minimal and possibly within the noise when the scale of the graphs is considered. This behavior is consistent with the hypothesis that we gave earlier: it seems to be possible to reliably detect tree canopy with mostly pixel-level statistics (which can be reliably established with a small subset of the data) combined with a small amount of higher-level vision; one can speculate that the higher-level vision component is probably organized around the difference in texture between tree canopy and other foliage and that that difference can be learned with relatively little data. These results indicate that a model can internalize the particulars of a scene with relatively little training data. With that said, we caution that one should not conclude that fractional labels for one scene are sufficient to produce a good high quality model that works everywhere/anywhere. It is highly likely that labels from a Azavea Tree Canopy Analysis 12 variety of locations (and possibly a variety of seasons, if seasonal robustness is desired) will be required for that purpose, albeit with a relatively small amount of label data for each scene. MODELING: EFFECT OF LABEL QUALITY We have shown instances above where models made with the Student-made labels seem to underperform CloudFactory ones. In this section we make that more explicit. The two figures below relate to the CloudFactory labels and the Student labels, respectively. The left subfigure of each figure shows the training loss dropping with each subsequent training epoch, as expected. The right subfigures show the loss on the validation set in each subsequent training epoch. The validation loss of the CloudFactory model has a validation loss that is trendless and relatively low, while the Student model has a higher loss that trends upward. (a) (b) Figure B.7: Training and validation loss profiles during training for (a) CF labels and (b) Student labels. In each subplot, Azavea Tree Canopy Analysis 13 the solid blue line is the average of the lines from individual runs (shown in light gray) of the same training configuration. The error bars represent 95% confidence intervals. The early overfitting5 seen when using Student labels is consistent with the Student labels containing less information (being more “random”) than the CloudFactory labels6. In such a scenario the training process would drive the model to memorize more and more of the particular peculiarities of the training data rather than useful patterns that can also be used to understand the validation data. As training proceeds, the training loss decreases, but that is due to more and more sterile memorization rather than learning. Despite this, the output of models, shown in Figure B.6, trained on Student labels still looks good visually, indicating that the low scores have more to do with the quality of the labels in the validation set than the model itself. 5 Experiments using all five possible 20% vertical slices as validation sets on Student-produced labels found the early overfitting to be very evident when the two rightmost 20% slices are used as validation sets and becomes less evident as one moves to the left. The phenomenon is most evident when the rightmost 20% slice is used as the validation set. 6 This is an empirical conclusion based on analysis of validation sets that contain larger percentages of tree canopy. Azavea Tree Canopy Analysis 14 Figure B.8: Sample output of a model trained on Student labels. The results look good despite the low validation scores. The qualitative evidence from Figure B.8, together with the quantitative results in Table B.2, shows that these models trained on imperfect labels are still able to recognize trees relatively well. This is in line with prior research (Rolnick et al., 2017), and our prior experience, indicating that Deep Learning models tend to be fairly robust to noise in the training dataset. MODELING: MODEL TRANSFERABILITY We did experiments to test the transferability of models trained on one location (Freetown) to another (Dar- es-Salaam). (Please note that we have delivered a model trained on both locations; this test is interesting because it might be predictive of how the model we delivered will generalize to different places and/or seasons.) For this, we validated some models trained on Freetown imagery on Dar-es-Salam imagery. Azavea Tree Canopy Analysis 15 The Dar-es-Salaam data was restricted to validated labels only, comprising 440 images in total. The models used for this included (from Table B.3): the RGB-with-augmentation model, the 8-band model, and an ensemble of the two. All trained on Freetown imagery only. The results are shown in Figure B.9. Figure B.9: Comparison of metrics of different models trained on Freetown and evaluated on Dar-es-Salaam. These results for the baseline model show poor generalization. This is unsurprising since the statistics used for the baseline model were derived from a single location which was entirely different from the area under test. The various deep learning models exhibit fairly respectable recalls, depressed precisions, and therefore somewhat depressed F1 scores. The ensemble model is again able to eke out a little more performance than individual models. The output predictions of the ensemble model (Figure B.10 and Figure B.11), however, look much better than the numbers would suggest, indicating that the restricted validation set we used is probably not representative of the city as a whole. This qualitative observation, taken together with the significant geographical distance between the two cities, indicates fairly decent transferability of this model. Azavea Tree Canopy Analysis 16 Figure B.8: Tree canopy predictions on Dar-es-Salaam of a model that was only trained on Freetown. Darker green indicates higher probability of tree canopy. Azavea Tree Canopy Analysis 17 Figure B.11: Comparison between human-expert-defined canopy boundary (top) with the model predictions (bottom). The pink region represents areas with little or no trees. The bright green regions represent pixels predicted by the model to be a part of tree canopy. Azavea Tree Canopy Analysis 18 Some failure modes of this model include (Figure B.12): failing to detect trees that are under particularly dark shadows from clouds and a few false positives out in the sea. The former may be mitigated by using satellite images taken on clear sunny days, while the latter can be mitigated by restricting the AOI to land only. (a) (b) Figure B.12: (a) Trees (circled in red) in shadow that were missed by the model. (b) False positives in the sea. Darker green indicates higher probability of tree canopy. Azavea Tree Canopy Analysis 19 PILOT INDICATORS MAPPING The Students were able to map the equivalent of tree canopy over ~42.75 square km of Freetown after a total of 2,250 hours of mapping (discussed below). CloudFactory was able to map the equivalent of tree canopy over ~27.37 km^2 of Freetown and ~3.12 km^2 over Dar-es-Salaam after a total of 827.5 hours of mapping (also discussed below). EMPLOYMENT The Azavea team is not aware of the specific employment terms that were used with the Student labelers for this project, but as far as we understand the work placement requirements, we assumed each Student labeled for up to three hours per day for fifteen days (45 hours), for a collective total of 2,250 hours. Anecdotally, it seems unlikely that each Student spent 45 hours labeling, however, if we assume that was the case, the average number of tasks labeled per Student was 2.08 per hour, whereas the median was 1.48, the minimum was .38, and the maximum was 14.16 (one individual labeled over 600 tasks). The validation labels provided by CloudFactory was the result of work performed by a rotating team of eleven (three females and eight males) working remotely in Nepal. One of the main reasons Azavea works with CloudFactory is their goal to provide a living wage to remote workers in Nepal and Kenya. Their Workforce Stategy can be found at https://www.cloudfactory.com/managed-workforce-strategy. The CloudFactory team spent 827.5 hours labeling tree canopy, with a median of 5.5 tasks per hour (for each individual). This work was performed as part of Azavea’s long-term engagement with CloudFactory, which costs $5,000 per month and covers any client work that may be underway at any given time. CloudFactory ensures ongoing project management and provides laptops and/or desktops as well as reliable internet connectivity to each of their employees. Individual employee salaries are not available. SKILLS DEVELOPMENT (for the Students) In order to provide both an introduction to GroundWork and the tree canopy mapping project to the Students, the Azavea team provided a slide deck with instructions, screenshots, and short videos to illustrate the various tools within the application and specific examples of mapped trees. In addition, we created a shared Google spreadsheet to house questions that we answered asynchronously, as well as completion statistics. Azavea staff provided feedback regarding label quality at the start of the project. A number of Students were able to label more than 100, 200, or (in only one case) 600 tasks, however Azavea is unaware of any skills development monitoring and evaluation effort that may have been in place. TOOLS/WORKFLOWS The use of GroundWork was new to the students, but beyond that the tools/workflows we believe are new to the program were used to perform the modeling work, as outlined below. RASTER VISION We used Azavea’s Open Source machine learning framework Raster Vision for generating training data and for training the models. Raster Vision’s data generation works by splitting a large geospatial scene into small Azavea Tree Canopy Analysis 20 rasters using a sliding window. These rasters can then be treated like any other image dataset in the machine learning process. For model building and training, Raster Vision internally uses PyTorch and TorchVision. It is able to make use of models provided by TorchVision as well as any external models written in PyTorch. For data augmentation, Raster Vision utilizes the Albumentations library. MODELS The DeepLab models (Chen et al., 2017) used were made from TorchVision’s implementation. The Panoptic FPN models used were from a custom implementation of the Panoptic Feature Pyramid Networks described in (Kirillov et al., 2019); the ResNet (He et al., 2016) backbone was made using TorchVision’ implementation. For handling more than 3 channels, we used a custom implementation adapted from the fusion technique described in Fusenet (Hazirbas et al., 2016). ENSEMBLE PREDICTION The ensemble predictions were made using a custom script adapted from Raster Vision. It involved averaging the outputs of the individual models after converting them into probabilities using softmax. REPLICATION & SCALING These results can be replicated and scaled easily. While the labels provided by CloudFactory resulted in more accurate models, the labels provided by the Students resulted in what could be considered an acceptable model. The ensemble model, as mentioned above, may be somewhat transferable and may result in somewhat performant results when applied to other similar geographic areas (during similar seasons), however in order to ensure the best results, labels from desirable geographic areas should be used to retrain the model. COSTING FUTURE INITIATIVES Using the rough numbers of mapped area cited above, we can estimate that the Students would require 526 hours to map 10 km^2 of tree canopy, whereas CloudFactory could map the same area in about 271 hours. As mentioned previously, the label quality is more important than quantity. As professional labelers, CloudFactory can provide high-quality labels in a shorter amount of time, however, given the aims of the digital jobs program, it is reasonable to say that with a longer, more intensive training period, student labels are likely to increase in quality and would likely produce a viable model. Although Azavea can’t speak to the projected costs of managing a student labeling cohort, we can certainly provide insight into the cost of replicating the study. This particular pilot project requires very little technical staff since it uses a publicly available labeling application, GroundWork. If a similar number of students would be participating over a prolonged period of time, World Bank would need a GroundWork Pro subscription for $10,000/year. If individual projects would be taking place over limited periods of time, it might be best to engage Azavea in a custom contract, as was the case for this pilot project. Finally, while the application includes a tutorial upon first log in, if World Bank would also like to replicate the level of training and troubleshooting, and/or labeling by a professional team to use as validation, a subsequent contract with Azavea and/or CloudFactory would be required. Azavea Tree Canopy Analysis 21 It is our hope that with the model results and documentation provided, World Bank staff would be able to run the model on any new labeled imagery with relative ease, noting of course the model transferability points mentioned above. That said, Azavea would be happy to retrain one or more models on additional imagery. Each engagement would need to be scoped and priced individually, however, we would anticipate being able to apply lessons learned from this pilot and spend more time on model refinement than model development. DIGITAL WORKER ENGAGEMENT SURVEY The students were managed by university staff and beyond answering questions, providing completion metrics and guidance regarding label quality, the Azavea team did not interact with the Students. Anecdotally, we heard from the World Bank team that the Students enjoyed the project a great deal and it was considered one of the better placements for the term. RESULTS SUMMARY After a labeling period leveraging GroundWork, we produced deep-learning models based on the ResNet-18 backbone from both Student-produced and CloudFactory label data. When compared to baseline statistical methods, we found that deep-learning models outperform simple models substantially (in terms of objective validation metrics and in terms of subjective quality). Our experiments show that deep-learning models derived from CloudFactory training data are objectively better than the Student-derived models (based on validation scores), but subjectively there is very little difference between the two. Finally, we found that training labels from a small percentage of a scene was sufficient to produce a model that performs well on that scene, in our experiments. (The sufficient percentage was found to be approximately 10% as seen in Figures B.3 and B.4 and the surrounding discussion.) Based on the discussion in the “Modeling: Effect of Data Size” and “Modeling: Effect of Label Quality” sections, some of key findings are that: ● We don’t need a very large quantity of data to get good models ● It is possible to get good models even when the labeling quality is not very high ● Having high quality labels helps in getting a more accurate measure of model performance and is therefore useful in validation for comparing different models Although we found that good performance can be achieved on a scene with only a small sample of labeled data from each scene, we caution against concluding that a small amount of labeled imagery is sufficient to produce a model that does well on any scene. The nice generalization of models trained only on Freetown to inference on Dar-es-Salaam is a hopeful sign, but our prior experience urges caution. One should not expect the models presented here (or any models produced with the training data used here) to generalize to any superficially-similar scene; diversity of the location and season of training scenes is a precondition for that hope. In short, the performance of the model will be directly affected by both the use of imagery specific to the AOI and the quality of the labels used to train the model. We would recommend any replication of this pilot project include the procurement of additional, location- and season-specific imagery, a period of labeling (and if using students, provide a longer training window than we had here), followed by model retraining and refinement. Azavea Tree Canopy Analysis 22 Bibliography Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. ArXiv. https://arxiv.org/abs/1706.05587 Hazirbas, C., Ma, L., Domokos, C., & Cremers, D. (2016). FuseNet: Incorporating Depth into Semantic Segmentation via Fusion-Based CNN Architecture. ACCV 2016. 10.1007/978-3-319-54181-5_14 He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. CVPR 2016. 10.1109/cvpr.2016.90 Kingma, D., & Ba, J. (2015). Adam: A Method for Stochastic Optimization. ArXiv. https://arxiv.org/abs/1412.6980 Kirillov, A., Girshick, R., He, K., & Dollár, P. (2019). Panoptic Feature Pyramid Networks. CVPR 2019. 10.1109/CVPR.2019.00656 Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep Learning is Robust to Massive Label Noise. ArXiv. https://arxiv.org/abs/1705.10694 Smith, L., & Topin, N. (2019). Super-convergence: very fast training of neural networks using large learning rates. Defense + Commercial Sensing 2019. 10.1117/12.2520589 Azavea Tree Canopy Analysis 23