Abstract:
The volume of big data increases daily. Big data poses challenges in storage, management, processing, analysis and visualisation. One technique of handling big data is the use of subset or sample that is good representation of the data. For storage alleviation purposes, a subset of the big data can be obtained from metadata. This paper obtains metadata of a remote sensing image dataset for crop classification. This research proposes a sampling algorithm which makes use of multivariate stratification with the aim of obtaining a sample that best represents the population while minimising the number of images sampled. The proposed sampling algorithm performs effectively on a big spatial image dataset of crop types. The results are assessed by measuring the number of images sampled and as well as matching the proportionality of the population crop percentages. The samples obtained from the proposed algorithm are then used for land cover classification, these will be referred to as the proposed samples. An ensemble method called random forest is trained on the different samples and the accuracy is assessed. Precision, recall and F1-scores per crop type are computed as well as the overall accuracy. The random forest classifier performed best on the proposed sample with the least number of images, followed by the proposed sample with the second least number of images. The classifier performed better on the proposed samples than it did on the random samples as the proposed samples contained the most informative data. This research encourages the use of metadata for classification purposes as well as an effective way of sampling big data for crop classification.