Abstract:
In the City of Cape Town Metropolitan (CoCT), South Africa, GIS analysts currently delineate building footprints by digitizing aerial imagery and stereo-aerial images. This approach requires a lot of manual work. It takes a long time, is expensive, and inefficient. Recent studies have explored automatic and semi-automatic methods for extracting building footprints. Automatic extraction of building footprints from remotely sensed data is useful for urban planning, service delivery, and humanitarian efforts. However, there is currently no readily available method that can automatically extract footprints while considering the unique characteristics of the landscape, such as formal residential areas, industrial zones, and informal settlements. Therefore, the main goal of this research is to find a suitable and efficient spatial analysis method that accurately extracts building footprints of different sizes and shapes within the City of Cape Town, South Africa, using high-resolution aerial imagery and LiDAR-derived nDSM. To achieve this goal, a literature review is conducted to explore different building footprint extraction algorithms. The review identified Mask Regional Convolutional Neural Network (R-CNN) as an effective algorithm for instance segmentation and object extraction. Thus, an experiment is conducted to implement Mask R-CNN models that extract building footprints from aerial imagery and LiDAR-derived normalized Digital Surface Model (nDSM) for each of the three areas: formal residential, industrial, and informal settlements. The training focused on the Blaauwberg district, which includes formal residential areas, industrial zones, and informal settlements. Each trained model is separately tested on testing datasets for formal residential, industrial areas, and informal settlements. Evaluation metrics such as precision, recall, F1-score, and Average Precision (AP) score are calculated for each model to assess their performance in extracting building footprints from aerial imagery and LiDAR-derived nDSM in formal residential, industrial areas, and informal settlements. The Mask R-CNN algorithm proved to be very effective in extracting building footprints from high-resolution aerial imagery and LiDAR-derived nDSM in formal residential areas, achieving satisfactory precision, recall, F1-score, and AP score. In industrial areas, the Mask R-CNN algorithm is found to be highly effective in extracting footprints from LiDAR-derived nDSM. However, when extracting shacks in densely populated settlements, the Mask R-CNN algorithm performed inadequately, with an AP score of 0.28 and 0.31 from aerial imagery and LiDAR-derived nDSM, respectively. Nevertheless, the fusion of footprints extracted from LiDAR-derived nDSM and high-resolution aerial imagery improved the AP score to 0.52. Hence, this study concludes that the Mask R-CNN algorithm is highly effective in extracting building footprints in formal residential areas from both aerial imagery and LiDAR-derived nDSM, as well as industrial building footprints from LiDAR-derived nDSM. For optimal performance in informal settlements, the fusion of footprints extracted from aerial imagery and LiDAR-derived nDSM is necessary. Overall, these trained Mask R-CNN models demonstrated satisfactory performance. To enhance the existing 2D building footprint layer, these models can supplement by extracting building footprints. This updated layer will be more comprehensive and current. Various departments within the CoCT can utilize this layer for infrastructure planning, service delivery planning, land use planning, and change detection. For better performance, it is recommended to add more informal and industrial training datasets with sufficient roof variability. Fine-tuning the Mask R-CNN models will ensure accurate extraction of shacks and industrial building footprints by allowing the models to learn effectively.