Abstract:
The increasing complexity of software applications can lead to operational failures that have disastrous consequences. In order to prevent the recurrence of such failures, a thorough post-mortem investigation is required to identify the root causes involved. This root-cause analysis must be based on reliable digital evidence to ensure its objectivity and accuracy. However, current approaches to software failure analysis do not promote the collection of digital evidence for causal analysis. This leaves the system vulnerable to the reoccurrence of a similar failure.
A promising alternative is offered by the field of digital forensics. Digital forensics uses proven scientific methods and principles of law to determine the cause of an event based on forensically sound evidence. However, being a reactive process, digital forensics can only be applied after the occurrence of costly failures. This limits its effectiveness as volatile data that could serve as potential evidence may be destroyed or corrupted after a system crash.
In order to address this limitation of digital forensics, it is suggested that the evidence collection be started at an earlier stage, before the software failure actually unfolds, so as to detect the high-risk conditions that can lead to a major failure. These forerunners to failures are known as near misses. By alerting system users of an upcoming failure, the detection of near misses provides an opportunity to collect at runtime failure-related data that is complete and relevant.
The detection of near misses is usually performed through electronic near-miss management systems (NMS). An NMS that combines near-miss analysis and digital forensics can contribute significantly to the improvement of the accuracy of the failure analysis. However, such a system is not available yet and its design still presents several challenges due to the fact that neither digital forensics nor near-miss analysis is currently used to investigate software failures and their existing methodologies and processes are not directly applicable to failure analysis.
This research therefore presents the architecture of an NMS specifically designed to address the above challenges in order to facilitate the accurate forensic investigation of software failures. The NMS focuses on the detection of near misses at runtime with a view to maximising the collection of appropriate digital evidence of the failure. The detection process is based on a mathematical model that was developed to formally define a near miss and calculate its risk level. A prototype of the NMS has been implemented and is discussed in the thesis.