Identifying financial risk through natural language processing of company annual reports

dc.contributor.advisorMarivate, Vukosi
dc.contributor.emailu19333634@tuks.co.za
dc.contributor.postgraduateTheron, Jacques Lamont
dc.date.accessioned2022-01-12T06:00:02Z
dc.date.available2022-01-12T06:00:02Z
dc.date.created2021/04/13
dc.date.issued2020
dc.descriptionMini Dissertation (MIT (Big Data Science))--University of Pretoria, 2020.
dc.description.abstractA pipeline was developed to source annual reports of South African banks and convert them into a novel corpus. Plain text was extracted from unstructured reports whilst maintaining lineage to its coordinates in the original Portable Document Format (PDF). Initial experiments with Natural Language Processing (NLP) and machine learning classification aim at exposing financial risk inherent in the text as opposed to analysing the numerical financial values. Failed financial or governance events related to banks in the public domain were used to label annual reports as high risk. The balance of the reports were annotated as low risk to formulate a binary classification problem for machine learning. Bag of words and word embedding techniques were applied and supplemented with linguistic features like tone, uncertainty and causality based on available wordlists. Classifiers were built using traditional logistic regression and Support Vector Machine (SVM), as well as modern Long Short-Term Memory (LSTM) and Convolutional Neural Network (CNN) deep learning models. The corpus and initial findings provide a baseline for further research. Applications include an early warning system for regulators as well as question answering based on the content.
dc.description.availabilityUnrestricted
dc.description.degreeMIT (Big Data Science)
dc.description.departmentComputer Science
dc.identifier.citation*
dc.identifier.otherA2021
dc.identifier.urihttp://hdl.handle.net/2263/83181
dc.language.isoen
dc.publisherUniversity of Pretoria
dc.rights© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subjectUCTD
dc.subjectFinancial risk
dc.subjectCompany annual reports
dc.subjectNatural language processing (NLP)
dc.subjectMachine learning
dc.subjectClassification
dc.subjectClosed domain question answering
dc.titleIdentifying financial risk through natural language processing of company annual reports
dc.typeMini Dissertation

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Theron_Identifying_2020.pdf
Size:
8.31 MB
Format:
Adobe Portable Document Format