Maximum likelihood estimation for Cox regression under risk set sampling

dc.contributor.advisorNakhaeirad, Najmeh
dc.contributor.coadvisorNasejje, Justine
dc.contributor.emailu19044438@tuks.co.zaen_US
dc.contributor.postgraduateMashinini, Nontokozo
dc.date.accessioned2025-02-11T20:54:38Z
dc.date.available2025-02-11T20:54:38Z
dc.date.created2025-05
dc.date.issued2025-02
dc.descriptionDissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2025.en_US
dc.description.abstractIn certain epidemiological studies, researchers aim to investigate specific events, such as disease outcomes, and their associated risk factors within a cohort. However, in the era of big data, analyzing the entire cohort can be time-consuming due to the large volume of data. To address this challenge, a nested case-control design can be employed, allowing for quicker and more efficient analysis by focusing on a sample of cases and matched controls within the same population. In survival analysis, the cohort dataset is crucial for defining the risk sets in Cox proportional hazards (CPH) model optimization. These risk sets are integral to the Cox partial likelihood function, which is used to fit the model. This research seeks to apply the nested case-control design to these risk sets via a simulation study, specifically exploring various case-control structures such as 1:1, 1:2, 1:4, and 1:8.The study aims to investigate whether the size of sampled risk sets impacts the time efficiency of the model and the precision of the estimated parameters using two optimization methods: Newton Raphson (NR) and Stochastic Gradient Descent (SGD). Results from optimizing the four different case-control structures using NR suggest that the CPH model's parameter estimates converge to the true values, with bias decreasing as the number of controls per case decreases although there are minor fluctuations in some controls.(for example the positive bias values for $\beta_1$ obtained via the four different case-control structures are: 0.041, 0.039, 0.080, 0.133, 0.002). The CPH model fitted with NR performed well with a complete risk set in a large-sized datasets and continued to perform well with small-sized datasets, though not as effectively as with the larger one. When the CPH model is optimized using SGD across the four different case-control structures, it converges to the true parameter values, particularly when the sample size is large and a complete risk set is used. This study demonstrates how large datasets can be efficiently scaled in survival analysis studies, providing valuable insights relating to parameter precision.The estimates derived from the real data sets using both NR and SGD optimization techniques were generally similar, though with slight differences across the various case-control structures. The full risk set estimates were used as a reference for comparison with those from the different case-control structures. We have discovered in this research that in risk set sampling with a nested case-control design, using fewer controls per case leads to a case-control framework that more closely approximates the true values providing valuable insights into the trade-offs between time efficiency and precision in parameter estimation.en_US
dc.description.availabilityUnrestricteden_US
dc.description.degreeMSc (Advanced Data Analytics)en_US
dc.description.departmentStatisticsen_US
dc.description.facultyFaculty of Natural and Agricultural Sciencesen_US
dc.description.sdgSDG-03: Good health and well-beingen_US
dc.identifier.citation*en_US
dc.identifier.doihttps://doi.org/10.25403/UPresearchdata.28395101en_US
dc.identifier.otherA2025en_US
dc.identifier.urihttp://hdl.handle.net/2263/100744
dc.language.isoenen_US
dc.publisherUniversity of Pretoria
dc.rights© 2023 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subjectUCTDen_US
dc.subjectSustainable Development Goals (SDGs)en_US
dc.subjectCox proportional hazard modelen_US
dc.subjectNewton Raphsonen_US
dc.subjectStochastic gradient descenten_US
dc.subjectRisk set samplingen_US
dc.subjectNested case control samplingen_US
dc.titleMaximum likelihood estimation for Cox regression under risk set samplingen_US
dc.typeDissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mashinini_Maximum_2025.pdf
Size:
4.37 MB
Format:
Adobe Portable Document Format
Description:
Dissertation

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: