Feature engineered embeddings for machine learning on  molecular data

doi:10.25403/UPresearchdata.22043297

Feature engineered embeddings for machine learning on molecular data

dc.contributor.advisor	De Waal, Alta
dc.contributor.email	u17029008@tuks.co.za	en_US
dc.contributor.postgraduate	Jardim, Claudio
dc.date.accessioned	2023-02-08T06:50:28Z
dc.date.available	2023-02-08T06:50:28Z
dc.date.created	2023-05
dc.date.issued	2022
dc.description	Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.	en_US
dc.description.abstract	The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several techniques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present a different approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency-inverse document frequency, word2vec and latent Dirichlet allocation to feature engineer molecular text data. Through this approach we aim to make a robust and explainable embedding that is fast to implement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these explainable embeddings for machine learning models, for representing a corpus of data in vector space and for protein-protein interaction prediction using embedding similarity. We apply the techniques on three different types of molecular text data: FASTA sequence data, Simplified Molecular Input Line Entry Specification data and Protein Data Bank data. We show that these embeddings provide excellent performance for classification and protein-protein bind prediction.	en_US
dc.description.availability	Unrestricted	en_US
dc.description.degree	MSc (Advanced Data Analytics)	en_US
dc.description.department	Statistics	en_US
dc.identifier.citation	*	en_US
dc.identifier.doi	10.25403/UPresearchdata.22043297	en_US
dc.identifier.other	A2023
dc.identifier.uri	https://repository.up.ac.za/handle/2263/89279
dc.language.iso	en	en_US
dc.publisher	University of Pretoria
dc.rights	© 2022 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject	UCTD	en_US
dc.subject	Machine learning	en_US
dc.subject	Data science	en_US
dc.subject	Statistics	en_US
dc.subject	Biology	en_US
dc.subject	Molecules	en_US
dc.subject	Embeddings	en_US
dc.title	Feature engineered embeddings for machine learning on molecular data	en_US
dc.type	Mini Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Jardim_Feature_2022.pdf
Size:: 3.18 MB
Format:: Adobe Portable Document Format
Description:: Mini Dissertation

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.75 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations (University of Pretoria)
Theses and Dissertations (Statistics)

Simple item page