Maximum likelihood estimation for Cox regression under risk set sampling

doi:https://doi.org/10.25403/UPresearchdata.28395101

Maximum likelihood estimation for Cox regression under risk set sampling

dc.contributor.advisor	Nakhaeirad, Najmeh
dc.contributor.coadvisor	Nasejje, Justine
dc.contributor.email	u19044438@tuks.co.za	en_US
dc.contributor.postgraduate	Mashinini, Nontokozo
dc.date.accessioned	2025-02-11T20:54:38Z
dc.date.available	2025-02-11T20:54:38Z
dc.date.created	2025-05
dc.date.issued	2025-02
dc.description	Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2025.	en_US
dc.description.abstract	In certain epidemiological studies, researchers aim to investigate specific events, such as disease outcomes, and their associated risk factors within a cohort. However, in the era of big data, analyzing the entire cohort can be time-consuming due to the large volume of data. To address this challenge, a nested case-control design can be employed, allowing for quicker and more efficient analysis by focusing on a sample of cases and matched controls within the same population. In survival analysis, the cohort dataset is crucial for defining the risk sets in Cox proportional hazards (CPH) model optimization. These risk sets are integral to the Cox partial likelihood function, which is used to fit the model. This research seeks to apply the nested case-control design to these risk sets via a simulation study, specifically exploring various case-control structures such as 1:1, 1:2, 1:4, and 1:8.The study aims to investigate whether the size of sampled risk sets impacts the time efficiency of the model and the precision of the estimated parameters using two optimization methods: Newton Raphson (NR) and Stochastic Gradient Descent (SGD). Results from optimizing the four different case-control structures using NR suggest that the CPH model's parameter estimates converge to the true values, with bias decreasing as the number of controls per case decreases although there are minor fluctuations in some controls.(for example the positive bias values for $\beta_1$ obtained via the four different case-control structures are: 0.041, 0.039, 0.080, 0.133, 0.002). The CPH model fitted with NR performed well with a complete risk set in a large-sized datasets and continued to perform well with small-sized datasets, though not as effectively as with the larger one. When the CPH model is optimized using SGD across the four different case-control structures, it converges to the true parameter values, particularly when the sample size is large and a complete risk set is used. This study demonstrates how large datasets can be efficiently scaled in survival analysis studies, providing valuable insights relating to parameter precision.The estimates derived from the real data sets using both NR and SGD optimization techniques were generally similar, though with slight differences across the various case-control structures. The full risk set estimates were used as a reference for comparison with those from the different case-control structures. We have discovered in this research that in risk set sampling with a nested case-control design, using fewer controls per case leads to a case-control framework that more closely approximates the true values providing valuable insights into the trade-offs between time efficiency and precision in parameter estimation.	en_US
dc.description.availability	Unrestricted	en_US
dc.description.degree	MSc (Advanced Data Analytics)	en_US
dc.description.department	Statistics	en_US
dc.description.faculty	Faculty of Natural and Agricultural Sciences	en_US
dc.description.sdg	SDG-03: Good health and well-being	en_US
dc.identifier.citation	*	en_US
dc.identifier.doi	https://doi.org/10.25403/UPresearchdata.28395101	en_US
dc.identifier.other	A2025	en_US
dc.identifier.uri	http://hdl.handle.net/2263/100744
dc.language.iso	en	en_US
dc.publisher	University of Pretoria
dc.rights	© 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject	UCTD	en_US
dc.subject	Sustainable Development Goals (SDGs)	en_US
dc.subject	Cox proportional hazard model	en_US
dc.subject	Newton Raphson	en_US
dc.subject	Stochastic gradient descent	en_US
dc.subject	Risk set sampling	en_US
dc.subject	Nested case control sampling	en_US
dc.title	Maximum likelihood estimation for Cox regression under risk set sampling	en_US
dc.type	Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Mashinini_Maximum_2025.pdf
Size:: 4.37 MB
Format:: Adobe Portable Document Format
Description:: Dissertation

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations (University of Pretoria)
Theses and Dissertations (Statistics)

Simple item page