Maximum likelihood estimation for Cox regression under risk set sampling
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Pretoria
Abstract
In certain epidemiological studies, researchers aim to investigate specific events, such as disease outcomes, and their associated risk factors within a cohort. However, in the era of big data, analyzing the entire cohort can be time-consuming due to the large volume of data. To address this challenge, a nested case-control design can be employed, allowing for quicker and more efficient analysis by focusing on a sample of cases and matched controls within the same population. In survival analysis, the cohort dataset is crucial for defining the risk sets in Cox proportional hazards (CPH) model optimization. These risk sets are integral to the Cox partial likelihood function, which is used to fit the model. This research seeks to apply the nested case-control design to these risk sets via a simulation study, specifically exploring various case-control structures such as 1:1, 1:2, 1:4, and 1:8.The study aims to investigate whether the size of sampled risk sets impacts the time efficiency of the model and the precision of the estimated parameters using two optimization methods: Newton Raphson (NR) and Stochastic Gradient Descent (SGD). Results from optimizing the four different case-control structures using NR suggest that the CPH model's parameter estimates converge to the true values, with bias decreasing as the number of controls per case decreases although there are minor fluctuations in some controls.(for example the positive bias values for $\beta_1$ obtained via the four different case-control structures are: 0.041, 0.039, 0.080, 0.133, 0.002). The CPH model fitted with NR performed well with a complete risk set in a large-sized datasets and continued to perform well with small-sized datasets, though not as effectively as with the larger one. When the CPH model is optimized using SGD across the four different case-control structures, it converges to the true parameter values, particularly when the sample size is large and a complete risk set is used. This study demonstrates how large datasets can be efficiently scaled in survival analysis studies, providing valuable insights relating to parameter precision.The estimates derived from the real data sets using both NR and SGD optimization techniques were generally similar, though with slight differences across the various case-control structures. The full risk set estimates were used as a reference for comparison with those from the different case-control structures. We have discovered in this research that in risk set sampling with a nested case-control design, using fewer controls per case leads to a case-control framework that more closely approximates the true values providing valuable insights into the trade-offs between time efficiency and precision in parameter estimation.
Description
Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2025.
Keywords
UCTD, Sustainable Development Goals (SDGs), Cox proportional hazard model, Newton Raphson, Stochastic gradient descent, Risk set sampling, Nested case control sampling
Sustainable Development Goals
SDG-03: Good health and well-being
Citation
*