Unsupervised machine learning in air pollution epidemiology in South Africa : artificial intelligence subset application

Mwase, Nandi Sisasenkosi

UPSpace Home
→
University of Pretoria: Research Output
→
Theses and Dissertations (University of Pretoria)
→
View Item

dc.contributor.advisor	Wichmann, Janine
dc.contributor.coadvisor	Junger, Washington
dc.contributor.postgraduate	Mwase, Nandi Sisasenkosi
dc.date.accessioned	2023-08-24T06:44:25Z
dc.date.available	2023-08-24T06:44:25Z
dc.date.created	2023-09
dc.date.issued	2023
dc.description	Thesis (PhD (Epidemiology))--University of Pretoria, 2023.	en_US
dc.description.abstract	Clean air is a human right and a condition for healthy living, but air pollution remains a global concern. The World Health Organization (WHO) has stated the detrimental health effects of air pollution, equating the effects to other health risks including an unhealthy diet and smoking tobacco. Air pollution is a complex mixture of droplets, solid particles, and gases, such as particulate matter (PM), nitrogen dioxide (NO2), ground-level ozone (O3), and sulphur dioxide (SO2). Air pollution is globally recognised as the most significant environmental threat to human health. Exposure to air pollution is associated with increased risk of respiratory diseases, cardiovascular diseases, and cancers, as well as increased risk of mortality. The global estimation of the number of deaths from air pollution ranges from 6.7 to 7 million deaths. Low- and middle-income countries (LMIC) are reported to account for a substantial proportion of these fatalities, with Africa accounting for approximately one-million deaths. Long-term exposure to household air pollution has also contributed 4% of global deaths. There are a number of pollutants that have been associated with negative health effects. As of 2019, in South Africa, the State of Global Air estimated 24 800 premature deaths due to exposure to PM2.5. However, this may be an underestimation as there are only a few studies in South Africa sampling PM2.5 and associating the pollutant with mortality. Ground-level ozone has contributed to approximately 365 000 deaths, equating to 11% of chronic obstructive pulmonary disease (COPD) deaths globally. However, all air pollutant estimations and the associated number of deaths are reliant on exposure-response functions derived from epidemiological studies that are predominantly conducted in developed countries. Currently, there are limited studies conducted in LMIC, like South Africa that provide a comprehensive understanding of the impact of air pollution. Hence, it is critical for more epidemiological studies on air pollution to be conducted in countries such as South Africa. The epidemiological evidence on the health effects of air pollution mixtures is lacking globally. This could indicate a current underestimation of the health risks from merely adding air pollutants together in statistical models. There are various traditional statistical methods that have been proposed to investigate the health effects of air pollution mixtures, such as multi-linear regression, classification and regression tree analysis (CART), cox proportional hazards regression, etc. Recently researchers have also applied Machine Learning (ML) methods, which is a subset of Artificial Intelligence (AI), to address this topic. The majority of studies have applied unsupervised ML, such as k-means clustering, however, such studies are lacking in Africa. Additionally, there are multiple sources, both man-made and natural, that can lead to different mixtures of air pollutants, such as PM10 and PM2.5. While many epidemiological studies mainly focus on the mass of PM10 and PM2.5, few studies investigate the chemical composition and identification of their sources. Positive Matrix Factorization (PMF) is a well-regarded method for source apportionment. Similar to other research areas, ML methods such as k-means and spectral clustering are being used as alternative source apportionment methods. Even fewer studies in South Africa are investigating the use of ML as a source apportionment method. Therefore, the aim of this PhD thesis was to address some of the research gaps identified above, namely, the lack of studies in Africa on the health effects of air pollution mixtures and PM2.5 source apportionment, whilst also assessing the applicability of AI methods, such as unsupervised ML, in air pollution epidemiology in South Africa. The thesis objectives were to: • Assess the perceptions and attitudes regarding AI in public health among postgraduate students registered for the online Postgraduate Diploma in Public Health at the School of Health Systems and Public Health (SHSPH), University of Pretoria (UP). • Determine the joint effects of SO2, NO2, O3, PM2.5, and PM10 on hospital admissions for respiratory disease (RD) and cardiovascular disease (CVD) in Vereeniging and Vanderbijlpark, Gauteng, using traditional statistical analysis, specifically, classification and regression trees. Thereafter, unsupervised Machine Learning methods are utilised to determine the joint effects of the air pollutants on RD and CVD hospital admissions. • Compare two methods of source apportionment of PM2.5 in Pretoria – a traditional method such as Positive Matrix Factorization (PMF) and unsupervised Machine Learning clustering methods. Method: The PhD project was divided into three parts. The first was a cross-sectional survey among students enrolled in the Postgraduate Diploma in Public Health at UP to assess perceptions and attitudes regarding AI in public health. The second part of the project was to determine the joint effects of SO2, NO2, O3, PM2.5, and PM10 on RD and CVD hospital admissions in Vereeniging and Vanderbijlpark, in the Vaal Triangle Airshed Priority Area (VTAPA), South Africa. There was a total of 3 346 observations from 2 January 2011 to 29 February 2020 (before the first recorded COVID-19 case in South Africa). The statistical CART analysis was used to assess the joint effects. Seven air pollution mixtures were created in the analyses, i.e. (mixture 1) PM10, NO2, and SO2, (mixture 2) PM2.5, NO2, and SO2, (mixture 3) PM10, NO2, and O3, (mixture 4) PM2.5, NO2, and O3, (mixture 5) PM10, SO2, and O3, (mixture 6) PM2.5, SO2, and O3, and (mixture 7) O3, NO2, and SO2. Thereafter, unsupervised ML clustering methods – k-means, spectral clustering, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) – were applied to the air pollution data to determine their joint effects on RD and CVD hospital admissions. Lastly, source apportionment for PM2.5 in Pretoria was performed using PMF analysis and unsupervised ML clustering methods, i.e. k-means, spectral clustering and principle component analysis (PCA). There was a total of 428 observations collected from 18 April 2017 to 12 February 2021. Gravimetric analysis was used to calculate the concentration levels and species identification was done through X-ray Fluorescence (XRF). The following fifteen identified species were used in the PMF model: PM2.5, BC, UV-PM, S, Cl, K, Ca, Ti, Fe, Ni, Cu, Zn, Br, U, and Si. Results: 618 respondents completed an online survey (81.5% response rate). Generally, respondents thought AI would be capable of performing various tasks that did not provide direct care to individuals. Most (69%) agreed that the introduction of AI could reduce job availability in public health fields. Respondents agreed that AI in public health could raise ethical (84%), social (77%), and health equity (77%) challenges. Relatively few respondents (52%) thought they were being adequately trained to work alongside AI tools and the majority (76%) felt training of AI competencies should begin at an undergraduate level. The air pollution (SO2, NO2, O3, PM2.5, and PM10) and meteorological data (relative humidity and temperature) used was from 1 January 2011 to 29 February 2020 (before the first recorded COVID-19 case in South Africa). Due to the missing air pollution and meteorological data for the VTAPA area, data was imputed using the multiple imputation by chain equations (mice) method. There were 54 822 respiratory disease (RD) hospital admissions in VTAPA from 2 January 2011 to 29 February 2020 (before the first recorded COVID-19 case). Generally, the risk of RD hospital admissions increased by 1.04 (95% CI 1.01, 1.08) when exposed to mixtures with high levels of NO2 and varying levels of SO2, O3, PM2.5, and PM10. There were 22 205 cardiovascular disease (CVD) hospital admissions in VTAPA during the study period. The RRs of CVD hospital admissions increased among those exposed to air pollution mixtures numbered (2), (3), (4), (6), and (7) by 1.11 (95% CI 1.02, 1.20), 1.15 (95% CI 1.04, 1.29), 1.13 (95% CI 1.05, 1.21), 1.11 (95% CI 1.02, 1.20), and 1.14 (95% CI 1.06, 1.22), respectively. Similar to findings for RD, the highest risk for CVD hospitalisation was found when exposed to high levels of NO2 and varying levels of SO2, O3, PM2.5, and PM10. The unsupervised ML clustering methods used – k-means clustering and spectral clustering – showed that the air pollution data SO2, NO2, O3, PM2.5, and PM10 were best grouped into two clusters. However a three-cluster spectral clustering model using the normalised Laplacian matrix, showed that the risk of RD hospital admission increased when exposed to SO2, NO2, PM2.5, and PM10 in higher concentration levels, and lower levels of O3 by 1.04 (95% CI 1.01-1.08). None of the formed cluster mixtures were found to increase the risk of CVD hospital admission. The DBSCAN clustering method did not prove to be an appropriate clustering method, as it greatly reduced the dataset and produced ill-distributed observations within formed clusters. A seven-factor PMF model was assigned to PM2.5 data collected over a 46-month period in Pretoria, South Africa. The seven contributing sources identified included mining (43.2%), biomass/coal burning (14.2%), secondary sulphur (12.1%), road traffic (11.3%), industry/base metal (8.7%), resuspended dust (8.5%), and general exhaust emissions (2.0%). PMF analysis was relatively easy to conduct and analyse, however, the process proved to be computationally taxing for medium to large datasets. Additionally, the modelled PM2.5 concentration levels was lower than the actual PM2.5 concentration levels; the correlation between modelled PM2.5 and actual PM2.5 data was R2 = 0.6. The seven-cluster spectral clustering model, using the normalized Laplacian matrix, showed feasible sources for the PM2.5 data during the 46-month period in Pretoria, South Africa. The possible identified sources of PM2.5 were coal burning (42.89%), industry (22.0%), resuspended dust (10.4%), base metal (6.7%), road traffic (6.8%), general exhaust emissions (5.8%), and secondary sulphur (5.5%). Spectral clustering was easy to run, not computationally taxing, and utilised the complete dataset within the clustering. This suggests that it was a good dimension reduction tool that can produce plausible results for source apportionment. However, there was an issue of overlapping clusters and a lack of external validation for the formed clusters. This is a reason of concern when using spectral clustering for source apportionment. Conclusion: The study contributes to the limited, but growing, knowledge and application of ML and AI in public health and air pollution epidemiology. The survey yielded a variety of views. There was a general assumption that AI in public health could assist in performing particular tasks at different health levels that did not involve direct care. There was also a general consensus that AI had the potential to raise unemployment and ethical challenges in the public health field in South Africa. SO2, NO2, O3, PM2.5, and PM10 mixtures proved to be associated with RD and CVD hospital admission. The mixtures showed that a higher concentration of NO2 in combination with varying concentrations of SO2, O3, PM2.5, and PM10 can lead to increased risk of both RD and CVD hospitalisation. This result contributes epidemiological evidence that can help policy makers to introduce stricter policies for improving the air quality of national priority areas, such as VTAPA in South Africa. Unsupervised ML could be useful in determining joint effects of air pollutants on hospital admission and other health outcomes. K-means and spectral clustering were both relatively easy to run and analyse; they were also less time consuming in comparison to the CART analyses. The process also showed promise for analysing more than three air pollutants, in spite of the different interactions. However, it is evident that further study is needed before unsupervised ML can be considered a reliable and definite tool to study the joint effects of air pollution on different health outcomes. PMF modelling suggested that mining and industry were the main contributing factors to PM2.5 in Pretoria. However, there is a great need for more studies that sample PM2.5 in Africa. Source apportionment studies are vital in the evaluation of policies intended to protect communities from the detrimental health effects of PM2.5. The PMF software was relatively easy to use and the data produced was relatively easy to analyse for possible sources of PM2.5. However, the three model runs only showed 0.4 to 0.6 correlation with the original data. Unsupervised ML for source apportionment is still a relatively new concept and needs to be further explored. In comparison with PMF, spectral clustering showed potential as a dimension reducing tool for source apportionment. Although the sources identified in the spectral clustering model showed similar sources identified in the PMF model, there were some noticeable limitations. Extensive studies are needed to continue exploring the potential of clustering for source apportionment studies. Furthermore, there is a need to increase air pollution epidemiology and source apportionment studies in South Africa. This will increase African-based evidence of the detrimental effects of air pollution. Air pollution studies using unsupervised ML has the potential to be used in air pollution and public health studies. This project produces a baseline in the current perceptions of AI in public health and could lead to more in-depth studies on the topic. With hopes to initiate conversation around including AI in public health, this project shows epidemiological evidence that can be used to advocate for stricter, more effectively enforced air quality standards and management plans in VTAPA. Lastly, the project also produces a baseline framework for including the application of ML in epidemiological and source apportionment studies. Spectral clustering provided plausible results in comparison to the results obtained using statistical and traditional models. Although the study used a limited number of unsupervised ML methods, it is highly recommended that other unsupervised ML methods be used in further public health studies to continue investigating the practical implementation of AI in public health.	en_US
dc.description.availability	Unrestricted	en_US
dc.description.degree	PhD (Epidemiology)	en_US
dc.description.department	School of Health Systems and Public Health (SHSPH)	en_US
dc.description.sponsorship	University of Pretoria- Postgraduate Bursary	en_US
dc.identifier.citation	*	en_US
dc.identifier.doi	https://doi.org/10.25403/UPresearchdata.23937777.v1	en_US
dc.identifier.uri	http://hdl.handle.net/2263/92027
dc.language.iso	en	en_US
dc.publisher	University of Pretoria
dc.rights	© 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject	UCTD	en_US
dc.subject	Machine learning	en_US
dc.subject	Epidemiology	en_US
dc.subject	Artificial intelligence	en_US
dc.subject	South Africa	en_US
dc.subject	Air pollution	en_US
dc.title	Unsupervised machine learning in air pollution epidemiology in South Africa : artificial intelligence subset application	en_US
dc.type	Thesis	en_US