Theses and Dissertations (Statistics)
Permanent URI for this collectionhttp://hdl.handle.net/2263/32483
Browse
Recent Submissions
Now showing 1 - 20 of 142
Item Aspect-based sentiment analysis using topic modelling on student evaluations(University of Pretoria, 2024-11) Mazarura, Jocelyn; Fabris-Rotelli, Inger Nicolette ; u18015817@tuks.co.za; Du Toit, JanaAspect-Based Sentiment Analysis (ABSA) is a Natural Language Processing (NLP) task that focuses on identifying and extracting sentiment related to specific aspects or components of various subjects, including but not limited to products or services. In ABSA, the process typically involves several steps. First, aspects or features relevant to the product or service are identified from the text. These aspects could encompass specific attributes, functionalities, or components. Next, sentiment analysis is performed to determine the polarity (positive, negative, or neutral) associated with each aspect based on the context within the sentence or document. Finally, the results are aggregated to provide an overall sentiment for each aspect. This mini-dissertation investigates a proposed novel approach for aspect-based sentiment analysis using topic modelling on student evaluation data from the Department of Statistics provided by the University of Pretoria. Using ABSA in the higher educational field is significant since it provides insights on how students view certain aspects. These insights are useful for lecturers, as well as the Head of the Department or even the Dean, because they can make certain decisions based on the insights. The mini-dissertation utilises topic models for aspect extraction. Among these, the Latent Dirichlet Allocation (LDA) topic model is widely recognised. However, literature indicates that the LDA model performs better on longer texts, such as newspaper articles or e-books, rather than shorter texts like tweets. Since the student evaluations used in this research are short texts, the LDA model may not be the most suitable. Therefore, two alternative topic models, the Biterm Topic Model (BTM) and the Dirichlet Multinomial Mixture model (DMM), which are designed for short texts, are also applied to the data. These three topic models are applied in conjunction with an automatic text summarisation method for aspect extraction. As expected, the LDA topic model did not perform as well as the BTM an DMM models. Analysing the results from the BTM and DMM models, it was evident that the coherence scores from the BTM model were higher than the DMM, which indicates that the BTM model has a better ability to capture the underlying topics and relationships within the data compared to the DMM. After the topic modelling was applied, two sentiment analysis methods, the Multinomial Na¨ıve Bayes method, which is a machine learning technique, and the VADER method, which is a lexicon-based approach, were applied to the educational data. When these two methods were applied to the data it was found that the Multinomial Na¨ıve Bayes approach produced sentiments that were skewed to the negative side. On the other hand, the VADER method produced sentiments that were more evenly spread between positive, neutral and negative sentiments. Therefore, the VADER method was the preferred method. These findings underscore the importance of selecting an appropriate topic modelling approach and sentiment method for aspect-based sentiment analysis tasks. Key insights and recommendations from analysing the student evaluation data using the proposed new approach to aspect-based sentiment analysis highlight several improvements that the lecturers could consider. These include incorporating pre-recorded videos into the curriculum to accommodate various learning preferences, establishing a peer-review system to reduce errors in assignments and tests, and decreasing the number of pre-class and post-class tests for senior students to better manage their workload. Additionally, customising support and resources to address the specific needs of different student groups and enhancing communication channels between students and faculty to ensure student queries are effectively addressed are also recommended. These recommendations aim to improve the overall learning experience and meet the diverse needs of students.Item Maximum likelihood estimation for Cox regression under risk set sampling(University of Pretoria, 2025-02) Nakhaeirad, Najmeh; Nasejje, Justine; u19044438@tuks.co.za; Mashinini, NontokozoIn certain epidemiological studies, researchers aim to investigate specific events, such as disease outcomes, and their associated risk factors within a cohort. However, in the era of big data, analyzing the entire cohort can be time-consuming due to the large volume of data. To address this challenge, a nested case-control design can be employed, allowing for quicker and more efficient analysis by focusing on a sample of cases and matched controls within the same population. In survival analysis, the cohort dataset is crucial for defining the risk sets in Cox proportional hazards (CPH) model optimization. These risk sets are integral to the Cox partial likelihood function, which is used to fit the model. This research seeks to apply the nested case-control design to these risk sets via a simulation study, specifically exploring various case-control structures such as 1:1, 1:2, 1:4, and 1:8.The study aims to investigate whether the size of sampled risk sets impacts the time efficiency of the model and the precision of the estimated parameters using two optimization methods: Newton Raphson (NR) and Stochastic Gradient Descent (SGD). Results from optimizing the four different case-control structures using NR suggest that the CPH model's parameter estimates converge to the true values, with bias decreasing as the number of controls per case decreases although there are minor fluctuations in some controls.(for example the positive bias values for $\beta_1$ obtained via the four different case-control structures are: 0.041, 0.039, 0.080, 0.133, 0.002). The CPH model fitted with NR performed well with a complete risk set in a large-sized datasets and continued to perform well with small-sized datasets, though not as effectively as with the larger one. When the CPH model is optimized using SGD across the four different case-control structures, it converges to the true parameter values, particularly when the sample size is large and a complete risk set is used. This study demonstrates how large datasets can be efficiently scaled in survival analysis studies, providing valuable insights relating to parameter precision.The estimates derived from the real data sets using both NR and SGD optimization techniques were generally similar, though with slight differences across the various case-control structures. The full risk set estimates were used as a reference for comparison with those from the different case-control structures. We have discovered in this research that in risk set sampling with a nested case-control design, using fewer controls per case leads to a case-control framework that more closely approximates the true values providing valuable insights into the trade-offs between time efficiency and precision in parameter estimation.Item Spatial catchment areas using fuzzy lattice data structures(University of Pretoria, 2024-11) Fabris-Rotelli, Inger Nicolette; dklerkm@gmail.com; De Klerk, MichelleThis thesis presents a comprehensive framework for defining and optimising service catchment areas through innovative approaches, addressing accessibility and resource allocation challenges, particularly in low-resource settings. The first methodology introduces fuzzy lattice catchment areas, using a semi-supervised, probabilistic approach to create overlapping service zones. By enabling communities to access multiple points of interest (POIs) within their range and incorporating drive-time thresholds, this approach ensures a more equitable distribution of demand and supply, minimising spatial imbalances. Building on this, the second methodology extends the fuzzy lattice framework by integrating attribute based connections, combining structural and contextual attributes to more accurately capture spatial dynamics. This dual consideration allows for a refined propagation of demand across networks, addressing limitations in traditional connectivity only models. The final methodology applies attribute based spatial segmentation, creating tailored macro-regions that align with local environmental and socio-economic factors. By leveraging probabilistic clustering, it optimises service placements and identifies both spatially accessible and disjoint regions. Collectively, these approaches advance the field of spatial planning by offering flexible, data driven solutions that adapt to regional characteristics, enhancing service accessibility and equitable resource distribution. The applications demonstrate significant potential across healthcare, urban planning, and beyond, providing a robust foundation for addressing evolving accessibility challenges.Item Enhanced point pattern analysis on nonconvex spatial domains(University of Pretoria, 2024-11) Fabris-Rotelli, Inger Nicolette; u14194237@tuks.co.za; Mahloromela, KabeloPoint pattern analysis is the study of the spatial arrangement of points in space, usually two-dimensional space. The points arise from a stochastic mechanism, termed a point process, whose characteristics are of scientific interest. The properties of point patterns are characterised using statistical measures that are a function of the study area and distance. Consequently, the domain in which points are observed and the distance metric used to quantify proximity between points plays an important role. Convex domains with the Euclidean distance are often used. This choice of domain and distance measure, however, makes an implicit assumption that all points are connected in a space without obstacles. In real-world applications, points may be constrained by their environments, thus a convex window and the Euclidean distance may not correctly capture spatial proximity relationships and restrictions imposed by the domain’s geometry. In this thesis, a presentation of methodology that accounts for the nonconvex structure of the spatial domain in point pattern analysis is provided. Firstly, consideration is given to the selection of nonconvex windows (when unknown) for point patterns realised from a process that is governed by a covariate. The proposed algorithm uses a weighted distance-based outlier scoring scheme that considers the distribution of covariates at observed data point locations. The robustness of the algorithm is demonstrated through a simulation study. Subsequently, a framework is developed to quantify proximity relationships using a graph theoretic approach based on visibility graphs. This characterisation of distance is used to extend first- and second-order point pattern measures for appropriate use on nonconvex domains. Finally, we provide an implementation strategy to efficiently compute summary measures based on a query to the visibility graph.Item Spatial linear network Voronoi analysis to quantify accessibility of police stations in South Africa(University of Pretoria, 2024-11) Fabris-Rotelli, Inger Nicolette ; Stander, Rene; Thiede, Renate; a.antonio@tuks.co.za; Antonio, ArthurThis study quantifies the overlap between existing police precinct boundaries and theoretically optimal boundaries derived from Voronoi diagrams based on Euclidean and network distances. Spatial similarity measures are used to analyse the relationship between boundary overlap and police station accessibility, hypothesising that reduced overlap corresponds to decreased accessibility. Accessibility, in this mini-dissertation, refers to how easily an individual can reach a police station, with closer points being more accessible. The analysis extends to the potential effects of boundary placement on crime rates, suggesting that greater inaccessibility of police stations may correlate with less crimes reported in that precinct. By quantifying these relationships, this research evaluates the effectiveness of current precinct boundaries and their potential influence on crimes reported. For precincts with low similarity values, indicating low accessibility, we analyse the proportional change in the number of crimes reported after boundary modifications. A decrease in reported crimes within the new boundaries generated by Voronoi models would support our hypothesis. This decrease would suggest that a significant portion of crimes are now being reported to other, nearer, and more accessible police stations.Item Determining the number of clusters using penalised k-means clustering(University of Pretoria, 2024-11) Millard, Sollie M.; Kanfer, F.H.J. (Frans); robert.w.greyling@gmail.com; Greyling, Robert WilliamClustering is an important part of statistics. However the issue of pre-initialisation of the number of clusters is still persistent. In this minor dissertation we consider a procedure to eliminate the pre-initialisation of the number of clusters in the k-means algorithm. This important advancement reduces manual effort in clustering tasks. This procedure aims to automatically eliminate the determination of the correct value of k. Following the approach by Sinaga and Yang; we modify the traditional k-means objective function by adding two entropy terms as penalty terms. An additional step was added to the algorithm to ensure that the initial clusters are not empty. A simulation study was conducted using multiple datasets with varying true cluster counts k, data dimensionalities D, and sample sizes n. Results indicate that the proposed algorithm performs well in identifying distinct clusters, particularly in lower-dimensional data.Item Economic recession prediction using modified gradient boosting and principal component neural network algorithms(University of Pretoria, 2025-02) Nakhaeirad, Najmeh; u19098309@tuks.co.za; Krishnannair, AnuroopIn the ever-evolving landscape of global economics, predicting and understanding economic recessions remain paramount challenges for policymakers, researchers, and financial analysts. The outbreak of the COVID-19 pandemic in 2019 has introduced unprecedented complexities, reshaping the economic dynamics of nations worldwide. A limited number of economic recessions have been accurately predicted months in advance. To mitigate the growing impact of these downturns, it is essential to develop more effective predictive models that can assist businesses and governments in formulating policies to support millions of people before these periods occur, given the economy’s critical role in policy development. Traditionally, machine learning algorithms have been widely applied in pattern recognition,however, limited research has explored their use in finance, especially for predicting recessions. Given the novel application of machine learning for recession forecasting in finance, there are very few studies available in this area. This research gives the best performing models to assist businesses in predicting prior recession periods and identifies the most important variables to improve the overall performance of the models addressing the concern that previous studies have shown biases due to imbalanced class ratios. To achieve this, in addition to Artificial Neural Networks , machine learning techniques such as Random Forest and Support Vector Machines are used to provide an efficient prediction model to avoid greater government deficits, growing inequality, significantly decreased income, and higher unemployment. In this study, an ensemble approach of Logistic Regression and Non-Linear Principal Component Analysis Logistic Regression (NLPCA-LR) plus a Modified Gradient Boosting Neural Network (MGBNN) are proposed and compared to the latter models. A real dataset on historical recession periods in African countries is employed to demonstrate the performance of the proposed algorithms in practice. The performance analysis across the various models highlights the superior capabilities of the MGBNN and the NLPCA-LR models. This demonstrates the potential for machine learning models predictive power in the financial domain and thus alleviates the concern of these models as being black boxes.Item A contaminated generalized t model for cryptocurrency returns(University of Pretoria, 2025-02) Bekker, Andriette, 1958-; Ferreira, Johan; Arashi, Mohammad; thembinkosimanyeruke@gmail.com; Manyeruke, Thembinkosi JohannesIn financial analytics, a key aim of currency data analysis is to determine the distribution of returns. Considering the extensive utilization of cryptocurrencies, it is essential to offer a highly flexible model for distributions with heavier tails to analyze bitcoin data. A recent study by Punzo and Bagnato (2021) demonstrated that cryptocurrency returns have traits of high peakedness, heavy tails, and large excess kurtosis. To improve control over tail behaviour in flexible models, we recommend employing the generalized elliptical family of distributions for cryptocurrency returns. We systematically construct this family of distributions, obtaining the Bernoulli-Laplace distribution from Punzo and Bagnato (2021) and the contaminated generalized t distribution as constituents of this family. Both distributions have heavy tails, pronounced peaks, and large adjustable kurtosis. Additionally, the suggested framework allows for the division of the real line into two regions: one containing typical points and the other containing atypical points. We illustrate the effectiveness of the suggested framework utilizing four cryptocurrencies: USDJ USD, Frax USD, Gnosis USD, and Ethereum USD, in comparison to alternative distributions frequently employed in financial literature. The findings demonstrated that the suggested framework surpasses the other evaluated distributions. Moreover, the contaminated generalized t distribution is optimal for data with significant excess kurtosis, whereas the Bernoulli-Laplace distribution is preferable for data with comparatively lower kurtosis, while still leptokurtic.Item Finite mixture of factorization machines(University of Pretoria, 2024-12-13) Kanfer, F.H.J. (Frans); dian.degenaar@gmail.com; Degenaar, DianThis mini-dissertation will introduce a novel mixture model of factorization machines. Factorization machines (FM) are a supervised learning class capable of learning pairwise interactions between response variables which can also be extended to interactions in higher dimensions. They are based on matrix factorization techniques attributing to their success in prediction tasks. The FM factorizes interaction terms, obtaining prediction accuracy on par with multiple linear regression (MLR). The FM also achieves this using less variables, and the model performance exceeds MLR under sparsity. Finite Gaussian mixture models (FGMM) are adept at modeling non-homogeneous populations and detecting subgroups; however, they are constructed as a combination of multiple Gaussian linear regression components. The novel model will be constructed using a combination of multiple Gaussian factorization machines to exploit the advantages of FMs when it comes to pairwise interaction terms and sparsity. The model will be estimated in an expectation-maximization (EM) algorithm setting using a coordinate descent (CD) method to estimate the FM model equation. Compared to FGMM in a sparse data setting, the novel model achieves a better fit to the data using fewer parameters and a shorter computation time.Item Explainable Bayesian networks : taxonomy, properties and approximation methods(University of Pretoria, 2024-07-22) De Waal, Alta; inekederks1@gmail.com; Derks, Iena PetronellaTechnological advances have integrated artificial intelligence (AI) into various scientific fields, necessitating understanding AI-derived decisions. The field of explainable artificial intelligence (XAI) has emerged to address transparency concerns, offering both transparent models and post-hoc explanation techniques. Recent research emphasises the importance of developing transparent models, with a focus on enhancing the interpretability of these models. An example of a transparent model that would benefit from enhanced post-hoc explainability is Bayesian networks. This research investigates the current state of explainability in Bayesian networks. Literature includes three categories of explanation: explanation of the model, reasoning, and evidence. Drawing upon these categories, we formulate a taxonomy of explainable Bayesian networks. Following this, we extend the taxonomy to include explanation of decisions, an area recognised as neglected within the broader XAI research field. This includes using the same-decision probability, a threshold-based confidence measure, as a stopping and selection criteria for decision-making. Additionally, acknowledging computational efficiency as a concern in XAI, we introduce an approximate forward-gLasso algorithm as a solution for efficiently solving the most relevant explanation. We compare the proposed algorithm with a local, exhaustive forward search. The forward-gLasso algorithm demonstrates accuracy comparable to the forward search while reducing the average neighbourhood size, leading to computationally efficient explanations. All coding was done in R, building on existing packages for Bayesian networks. As a result, we develop an open-source R package capable of generating explanations of evidence for Bayesian networks. Lastly, we demonstrate the practical insights gained from applying post-hoc explanations on real-world data, such as the South African Victims of Crime Survey 2016 - 2017.Item Hypersphere candidates emanating from the Dirichlet and its extension(University of Pretoria, 2024-07) Makgai, Seitebaleng; Bekker, Andriette, 1958-; u18243020@tuks.co.za; Leshilo, Ramadimetje LethaboCompositional datasets consist of observations that are proportional and are subject to non-negativity and unit-sum constraints. These datasets arise naturally in a multiplicity of fields such as agriculture, archaeology, economics, geology, health sciences, and psychology. There is a strong footprint in the literature on the Dirichlet distribution for modelling compositional datasets, followed by several generalizations of the Dirichlet distribution, with more flexible structures. In this study, we consider a transformation of two Dirichlet-type random variables W1,W2, . . . ,Wm by applying the square-root transformation Xi = √Wi for i = 1, 2, . . . , m. With this square-root transformation, we propose and develop a new distribution that is defined on the positive orthant of the hypersphere, that accommodates both positive and negative covariance structure. This novel model is a flexible offering to the spherical-Dirichlet models. We perform several simulation studies for the proposed model. The maximum likelihood is used for parameter estimation. Two applications of the models to biological and archaeological compositional datasets are presented, to illustrate the flexibility of the proposed model.Item Essays on estimation strategies addressing label-switching in Gaussian mixtures of semi- and non-parametric regressions(University of Pretoria, 2024-04-30) Millard, Sollie M.; Kanfer, F.H.J. (Frans); spiwe.skhosana@up.ac.za; Skhosana, Sphiwe BonakeleGaussian mixtures of non-parametric regressions (GMNRs) are a flexible class of Gaussian mixtures of regressions (GMRs). These models assume that some or all of the parameters of GMRs are non-parametric functions of the covariates. This flexibility gives these models wide applicability for studying the dependence of one variable on one or more covariates when the underlying population is made up of unobserved subpopulations. The predominant approach used to estimate the GMRs model is maximum likelihood via the Expectation-Maximisation (EM) algorithm. Due to the presence of non-parametric terms in GMNRs, the model estimation poses a computational challenge. A local-likelihood estimation of the non-parametric functions via the EM algorithm may be subject to label-switching. To estimate the non-parametric functions, we have to define a local-likelihood function for each local grid point on the domain of a covariate. If we separately maximise each local-likelihood function, using the EM algorithm, the labels attached to the mixture components may switch from one local grid point to the next. The practical consequence of this label-switching is characterised by non-parametric estimates that are non-smooth, exhibiting irregular behaviour at local points where the switch took place. In this thesis, we propose effective estimation strategies to address label-switching. The common thread that underlies the proposed strategies is the replacement of the separate maximisations of the local-likelihood functions with simultaneous maximisation. The effectiveness of the proposed methods is demonstrated on finite sample data using simulations. Furthermore, the practical usefulness of the proposed methods is demonstrated through applications on real data.Item Multiscale decomposition of spatial lattice data for hotspot prediction(University of Pretoria, 2023-11-27) Fabris-Rotelli, Inger Nicolette; Chen, Ding-Geng (Din); rene.stander@up.ac.za; Stander, RenéBeing able to identify areas with potential risk of becoming a hotspot of disease cases is important for decision makers. This is especially true in the case such as the recent COVID-19 pandemic where it was needed to incorporate prevention strategies to restrain the spread of the disease. In this thesis, we first extend the Discrete Pulse Transform (DPT) theory for irregular lattice data as well as consider its efficient implementation, the Roadmaker's Pavage algorithm (RMPA), and visualisation. The DPT was derived considering all possible connectivities satisfying the morphological definition of connection. Our implementation allows for any connectivity applicable for regular and irregular lattices. Next, we make use of the DPT to decompose spatial lattice data along with the multiscale Ht-index and the spatial scan statistic as a measure of saliency on the extracted pulses to detect significant hotspots. In the literature, geostatistical techniques such as Kriging has been used in epidemiology to interpolate disease cases from areal data to a continuous surface. Herein, we extend the estimation of a variogram to spatial lattice data. In order to increase the number of data points from only the centroids of each spatial unit (representative points), multiple points are simulated in an appropriate way to represent the continuous nature of the true underlying event occurrences more closely. We thus represent spatial lattice data accurately by a continuous spatial process in order to capture the spatial variability using a variogram. Lastly, we incorporate the geographically and temporally weighted regression spatio-temporal Kriging (GTWR-STK) method to forecast COVID-19 cases to a next time step. The GTWR-STK method is applied to spatial lattice data where the spatio-temporal variogram is estimated by extending the proposed variogram for spatial lattice data. Hotspots are predicted by applying the proposed hotspot detection method to the forecasted cases.Item Enhancing spatial image analysis : modelling perspectives on the usefulness of level-sets(University of Pretoria, 2024-03) Fabris-Rotelli, Inger Nicolette; Loots, Mattheus Theodor; u15002536@tuks.co.za; Stander, Jean-PierreThis thesis presents a comprehensive exploration of level-sets applied to various stages of image analysis, aiming to enhance understanding, modelling, and interpretability of image data. The research focuses on three critical aspects namely, data cleaning, data modelling, and explainability. In data cleaning, the adaptive median filter is a commonly used technique removing noise from images which compares individual pixels to an adaptive window around it. Herein the adaptive median filter is improved by acting on level-sets rather than individual pixels. The proposed level-sets adaptive median filter demonstrates effective noise removal while preserving edges in the images better than the traditional adaptive median filter. Secondly, this work considers representing images as graphical models, with the nodes corresponding to the fuzzy level-sets of the images. This novel representation successfully preserves and maps critical image information required for understanding of image context in a binary classification scenario. Further, this representation is used to propose a novel method for modelling images, which enables inference to be applied on image content directly. Finally, within the realm of deep learning object detection saliency maps, the detector randomised input sampling for explanation (D-RISE) is extended using informative level set sampling. A key, yet computationally expensive, component of the former is the generation of a suitable number of masks. The proposed methodology in this work, namely the adaptive D-RISE, harnesses proportional level-sets sampling of masks to reduce the required number of masks and improves the convergence of attribution.Item New characterisations of spatial linear networks for geographical accessibility(University of Pretoria, 2024-02-13) Fabris-Rotelli, Inger Nicolette; Debba, Pravesh; Cleghorn, Christopher W; renate.thiede@up.ac.za; Thiede, Renate NicoleTarget 9.1 of the United Nations Sustainable Development Goals specifies the need for affordable, equitable access for all. In South Africa, where most travel occurs via the road network, apartheid policies designed the historical road network to segregate rather than integrate. Since the end of apartheid, there has been an increased need for integrated urban accessibility. Since government initiatives are typically enacted at a regional level, it is relevant to model accessibility between regions. Very few methods exist in the literature that model road-based inter-regional accessibility, and none account for structural characteristics of the road network. The aim of this thesis is to develop a novel stochastic model that estimates road-based inter-regional accessibility, and that is able to take the homogeneity of road networks into account. The accessibility model utilises Markov chain theory. Each region represents a state, and the average inverse distances between regions act as transition probabilities. Transition probabilities between adjacent regions are stored in a 1-step transition probability matrix (TPM). Assuming the Markov property holds, raising the TPM to the power n gives transition probabilities between regions up to n steps away. Letting n→∞ gives the prominence index, which quantifies the accessibility of a region regardless of the journey’s starting point. Road network homogeneity is tested by extending a test for the homogeneity of spatial point patterns to spatial linear networks. An unsupervised clustering method is then developed which subdivides a road network into regions that are as homogeneous as possible. Finally, road-based accessibility is calculated between these regions. The accessibility model was first applied to electoral wards in the City of Tshwane. Based on the wards, the central business district (CBD) was most accessible, but there was poor accessibility to the CBD from outlying townships. The homogeneity test showed that distinct residential neighbourhoods were internally homogeneous, and was thus able to identify neighbourhoods within a road network. The unsupervised clustering method was then used to identify two new regionalisations of the road network within the City of Tshwane at different spatial scales, and the accessibility model was applied to these regionalisations. For one regionalisation, an emerging economic area was most accessible, while for the other, a central educational area was most accessible. Although accessibility was not correlated with road network homogeneity, different spatial scales and regionalisations had a great impact on the accessibility results. This thesis develops a new characterisation of spatial linear networks based on their homogeneity, and uses this to investigate the state of inter-regional road-based accessibility in the City of Tshwane. This is a crucial area of research in the move towards a more equitable and sustainable future.Item Spatial-temporal topic modelling of COVID-19 tweets in South Africa(University of Pretoria, 2023-12-07) Mazarura, Jocelyn; Fabris-Rotelli, Inger Nicolette; u18073159@tuks.co.za; Jafta, Papama Hlumela GandhiIn the era of social media, the analysis of Twitter data has become increasingly important for understanding the dynamics of online discourse. This research introduces a novel approach for tracking the spatial and temporal evolution of topics in Twitter data. Leveraging the spatial and temporal labels provided by Twitter for tweets, we propose the Clustered Biterm Topic Model. This model combines the Biterm Topic Model with K-medoid clustering to uncover the intricate topic development patterns over space and time. To enhance the accuracy and applicability of our model, we introduce an innovative element: a covariate-dependent matrix. This matrix incorporates essential covariate information and geographic proximity into the dissimilarity matrix used by K-Medoids clustering. By considering the inherent semantic relationships between topics and the contextual information provided by covariates and geographic proximity, our model captures the complex interplay of topics as they emerge and evolve across different regions and timeframes on Twitter. The proposed Clustered Biterm Topic Model offers a robust and versatile tool for researchers, policymakers, and businesses to gain deeper insights into the dynamic landscape of online conversations, which are inherently shaped by space and time.Item A robust simulation to compare meaningful batting averages in cricket(University of Pretoria, 2023-11-17) Van Staden, Paul J.; Fabris-Rotelli, Inger Nicolette; u17150818@tuks.co.za; Vorster, Johannes S.In cricket, the traditional batting average is the most common measure of a cricket player’s batting performance. However, the batting average can easily be inflated by a high number of not-out innings. Therefore, in this research eight alternative methods are used and compared to the traditional batting average to estimate the true batting average. It is also known that there is a range of different batters within a cricket team, namely first order, middle order, tail-enders and a special class of players who can both bat and bowl known as all-rounders. There are also different formats of international cricket, namely Test, One-Day International (ODI), and Twenty20 International (T20I) cricket, where Test cricket has unlimited overs compared to the limited overs of ODI and T20I cricket. A method for estimating the batting average should be able to account for this variability. The chosen method should also work for a player’s career as well as a short series or tournament. By using the traditional bootstrap and the smoothed bootstrap in this study, the variability of each estimation method is compared for a player’s career and a series or tournament, respectively. An R Shiny application introduces alternative cricket batting performance measures, enabling accessible analysis beyond the conventional average for a comprehensive understanding of player capabilities.Item A mixture model approach to extreme value analysis of heavy tailed processes(University of Pretoria, 2023-12-07) Maribe, Gaonyalelwe; Kanfer, Frans; Millard, Sollie; lizosanqela@gmail.com; Sanqela, LizoExtreme value theory (EVT) encompasses statistical tools for modelling extreme events, which are defined in the peaks-over-threshold methodology as excesses over a certain high threshold. The estimation of this threshold is a crucial problem and an ongoing area of research in EVT. This dissertation investigates extreme value mixture models which bypass threshold selection. In particular, we focus on the Extended Generalised Pareto Distribution (EGPD). This is a model for the full range of data characterised by the presence of extreme values. We consider the non-parametric EGPD based on a Bernstein polynomial approximation. The ability of the EGPD to estimate the extreme value index (EVI) is investigated for distributions in the Frechet, Gumbel and Weibull domains through a simulation study. Model performance is measured in terms of bias and mean squared error. We also carry out a case study on rainfall data to illustrate how the EGPD fits as a distribution for the full range of data. The case study also includes quantile estimation. We further propose substituting the Pareto distribution, in place of the GPD, as the tail model of the EGPD in the case of heavy-tailed data. We give the mathematical background of this new model and show that it is a member of the EGPD family and is thus in compliance with EVT. We compare this new model's bias and mean squared error in EVI estimation to the old EGPD through a simulation study. Furthermore, the simulation study is extended to include other estimators for Frechet-type data. Moreover, a case study is carried out on the Belgian Secura Re data.Item An interactive R shiny application for learning multivariate data analysis and time series modelling(University of Pretoria, 2024-02-07) Salehi, Mahdi; Bekker, Andriette, 1958-; Arashi, Mohammad; francesmotala@gmail.com; Frances, Motala CharlesMultivariate analysis and time series modelling are essential data analysis techniques that provide a comprehensive approach for understanding complex datasets and supporting data-driven decision-making. Multivariate analysis involves the simultaneous examination of multiple variables, enabling the exploration of intricate relationships, dependencies, and patterns within the data. Time series modelling, on the other hand, focuses on data evolving over time, facilitating the detection of trends, seasonal patterns, and forecasting future values. In addition to the multivariate and time series analysis techniques, we expand our focus to include machine learning, a field dedicated to developing algorithms and models for data-driven predictions and decisions. The primary contribution of this dissertation is the development of an innovative R Shiny application known as the Advanced Modelling Application (AM application). The AM application revolutionizes multivariate analysis, machine learning, and time series modelling by bridging the gap between complexity and usability. With its intuitive interface and advanced statistical techniques, the application empowers users to explore intricate datasets, discover hidden patterns, and make informed decisions. Interactive visualizations and filtering capabilities enable users to identify correlations, dependencies, and influential factors among multiple variables. Moreover, the integration of machine learning algorithms empowers users to leverage predictive analytics, allowing for the creation of robust models that uncover latent insights within the data and make accurate predictions for informed decision-making. Additionally, the application incorporates state-of-the-art algorithms for time series analysis, simplifying the analysis of temporal patterns, forecasting future trends, and optimizing model parameters. This ground-breaking tool is designed to unlock the full potential of data, enabling users to drive impactful outcomes.Item Breaking the norm : approaches for symmetric, positive, and skewed data(University of Pretoria, 2023-11-06) Bekker, Andriette, 1958-; Arashi, Mohammad; matthias@dilectum.co.za; Wagener, MatthiasThis research contributes to the advancement of flexible and interpretable models within distribution theory, which is a fundamental aspect of numerous academic disciplines. This study investigates and presents the derivative-kernel approach for extending distributions. This method yields new distributions for symmetric, skew, and positive data, making it applicable for a wide range of modelling tasks. These newly derived distributions enhance the normal and gamma distributions by incorporating easily interpretable and identifiable parameters while retaining tractable mathematical properties. Furthermore, these models have a solid statistical foundation for simulation and prediction through stochastic representations. Additionally, these models demonstrate proficient flexibility and modelling performance when applied to real data. The introduced skew distribution presents a new skewing mechanism that combines the best features of current leading methods. Consequently, this leads to improved accuracy and flexibility when modelling skewed data patterns. In today's rapidly evolving data landscape, with increasingly intricate data structures, these advancements provide vital tools for effectively interpreting and analysing diverse data patterns encountered in economics, psychology, engineering, and biology.