dc.contributor.advisor |
De Waal, Alta |
|
dc.contributor.postgraduate |
Jardim, Claudio |
|
dc.date.accessioned |
2023-02-08T06:50:28Z |
|
dc.date.available |
2023-02-08T06:50:28Z |
|
dc.date.created |
2023-05 |
|
dc.date.issued |
2022 |
|
dc.description |
Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022. |
en_US |
dc.description.abstract |
The classification of molecules is of particular importance to the drug discovery process and several other
use cases. Data in this domain can be partitioned into structural and sequence/text data. Several tech-
niques such as deep learning are able to classify molecules and predict their functions using both types of
data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of
a molecule. However, the use of a molecule’s structural information typically requires large amounts of
computational power with deep learning models that take a long time to train. In this study, we present
a different approach to molecule classification that addresses the limitations of other techniques. This
approach uses natural language processing techniques in the form of count vectorisation, term frequency-
inverse document frequency, word2vec and latent Dirichlet allocation to feature engineer molecular text
data. Through this approach we aim to make a robust and explainable embedding that is fast to im-
plement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we
investigate the usefulness of these explainable embeddings for machine learning models, for representing
a corpus of data in vector space and for protein-protein interaction prediction using embedding similarity.
We apply the techniques on three different types of molecular text data: FASTA sequence data, Simpli-
fied Molecular Input Line Entry Specification data and Protein Data Bank data. We show that these
embeddings provide excellent performance for classification and protein-protein bind prediction. |
en_US |
dc.description.availability |
Unrestricted |
en_US |
dc.description.degree |
MSc (Advanced Data Analytics) |
en_US |
dc.description.department |
Statistics |
en_US |
dc.identifier.citation |
* |
en_US |
dc.identifier.doi |
10.25403/UPresearchdata.22043297 |
en_US |
dc.identifier.other |
A2023 |
|
dc.identifier.uri |
https://repository.up.ac.za/handle/2263/89279 |
|
dc.language.iso |
en |
en_US |
dc.publisher |
University of Pretoria |
|
dc.rights |
© 2022 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria. |
|
dc.subject |
UCTD |
en_US |
dc.subject |
machine learning |
en_US |
dc.subject |
Data science |
en_US |
dc.subject |
Statistics |
en_US |
dc.subject |
Biology |
en_US |
dc.subject |
Molecules |
en_US |
dc.subject |
Embeddings |
en_US |
dc.title |
Feature engineered embeddings for machine learning on molecular data |
en_US |
dc.type |
Mini Dissertation |
en_US |