Feature engineered embeddings for machine learning on  molecular data

Jardim, Claudio

UPSpace Home
→
University of Pretoria: Research Output
→
Theses and Dissertations (University of Pretoria)
→
View Item

Feature engineered embeddings for machine learning on molecular data

Jardim, Claudio

URI: https://repository.up.ac.za/handle/2263/89279

Date: 2022

Abstract:

The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several tech- niques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present a different approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency- inverse document frequency, word2vec and latent Dirichlet allocation to feature engineer molecular text data. Through this approach we aim to make a robust and explainable embedding that is fast to im- plement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these explainable embeddings for machine learning models, for representing a corpus of data in vector space and for protein-protein interaction prediction using embedding similarity. We apply the techniques on three different types of molecular text data: FASTA sequence data, Simpli- fied Molecular Input Line Entry Specification data and Protein Data Bank data. We show that these embeddings provide excellent performance for classification and protein-protein bind prediction.

Description:

Mini Dissertation (MSc (Advanced Data Analytics))--University of Pretoria, 2022.

Show full item record

Files in this item

Name: Jardim_Feature_20 ...

Size: 3.177Mb

Format: PDF

Description: Mini Dissertation

View/Open

This item appears in the following Collection(s)

Search UPSpace

Browse

All of UPSpace
This Collection
- Issue Date
- Authors
- Titles
- Subjects
- Supervisor
- UP Author
- UP Postgraduate
- Type

Feature engineered embeddings for machine learning on molecular data

Feature engineered embeddings for machine learning on molecular data

Abstract:

Description:

Files in this item

This item appears in the following Collection(s)

Search UPSpace

Browse

All of UPSpace

This Collection

My Account

UPSpace Workspace