Feature engineered embeddings for classification of molecular data

Jardim, Claudio; De Waal, Alta; Fabris-Rotelli, Inger Nicolette; Rad, Najmeh Nakhaei; Mazarura, Jocelyn; Sherry, Dean

Feature engineered embeddings for classification of molecular data

Files

Jardim_Feature_2024.pdf (1.11 MB)

Date

2024-06

Authors

Jardim, Claudio

De Waal, Alta

Fabris-Rotelli, Inger Nicolette

Rad, Najmeh Nakhaei

Mazarura, Jocelyn

Sherry, Dean

Publisher

Elsevier

Abstract

The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several techniques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present an alternative approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency-inverse document frequency, word2vec and Latent Dirichlet Allocation to feature engineer molecular text data. Through this approach, we aim to make a robust and easily reproducible embedding that is fast to implement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these embeddings for machine learning models. We apply the techniques to two different types of molecular text data: FASTA sequence data and Simplified Molecular Input Line Entry Specification data. We show that these embeddings provide excellent performance for classification.

Keywords

Property prediction, Latent dirichlet allocation (LDA), Molecular data, Embedding techniques, Text data, Text embedding, Machine learning, Simplified molecular input line entry specification (SMILES), FASTA, SDG-09: Industry, innovation and infrastructure

Sustainable Development Goals

SDG-09: Industry, innovation and infrastructure

Citation

Jardim, C., De Waal, A., Fabris-Rotelli, I. et al. 2024, 'Feature engineered embeddings for classification of molecular data', Computational Biology and Chemistry, vol. 110, art. 108056, pp. 1-10, doi : 10.1016/j.compbiolchem.2024.108056.

URI

http://hdl.handle.net/2263/97903

Collections

Research Articles (Statistics)
Research Articles (University of Pretoria)

Full item page

Feature engineered embeddings for classification of molecular data

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Sustainable Development Goals

Citation

URI

Collections