Identifying Financial Risk through Natural Language Processing of Company Annual Reports
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
University of Pretoria
Abstract
A pipeline was developed to source annual reports of South African banks and convert
them into a novel corpus. Plain text was extracted from unstructured reports whilst
maintaining lineage to its coordinates in the original Portable Document Format (PDF).
Initial experiments with Natural Language Processing (NLP) and machine learning clas-
si cation aim at exposing nancial risk inherent in the text as opposed to analysing the
numerical nancial values. Failed nancial or governance events related to banks in the
public domain were used to label annual reports as high risk. The balance of the re-
ports were annotated as low risk to formulate a binary classi cation problem for machine
learning. Bag of words and word embedding techniques were applied and supplemented
with linguistic features like tone, uncertainty and causality based on available wordlists.
Classi ers were built using traditional logistic regression and Support Vector Machine
(SVM), as well as modern Long Short-Term Memory (LSTM) and Convolutional Neural
Network (CNN) deep learning models. The corpus and initial ndings provide a baseline
for further research. Applications include an early warning system for regulators as well
as question answering based on the content.
Description
Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2020.
Keywords
UCTD, financial risk, company annual reports, natural language processing, machine learning, classi cation, closed domain question answering
Sustainable Development Goals
Citation
*