Feature engineered embeddings for classification of molecular data

dc.contributor.authorJardim, Claudio
dc.contributor.authorDe Waal, Alta
dc.contributor.authorFabris-Rotelli, Inger Nicolette
dc.contributor.authorRad, Najmeh Nakhaei
dc.contributor.authorMazarura, Jocelyn
dc.contributor.authorSherry, Dean
dc.contributor.emailu17029008@tuks.co.zaen_US
dc.date.accessioned2024-08-28T08:11:25Z
dc.date.available2024-08-28T08:11:25Z
dc.date.issued2024-06
dc.description.abstractThe classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several techniques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present an alternative approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency-inverse document frequency, word2vec and Latent Dirichlet Allocation to feature engineer molecular text data. Through this approach, we aim to make a robust and easily reproducible embedding that is fast to implement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these embeddings for machine learning models. We apply the techniques to two different types of molecular text data: FASTA sequence data and Simplified Molecular Input Line Entry Specification data. We show that these embeddings provide excellent performance for classification.en_US
dc.description.departmentStatisticsen_US
dc.description.librarianhj2024en_US
dc.description.sdgSDG-09: Industry, innovation and infrastructureen_US
dc.description.sponsorshipIn part by the RDP grant at the University of Pretoria, and the National Research Foundation (NRF) of South Africa.en_US
dc.description.urihttps://www.elsevier.com/locate/cbacen_US
dc.identifier.citationJardim, C., De Waal, A., Fabris-Rotelli, I. et al. 2024, 'Feature engineered embeddings for classification of molecular data', Computational Biology and Chemistry, vol. 110, art. 108056, pp. 1-10, doi : 10.1016/j.compbiolchem.2024.108056.en_US
dc.identifier.issn1476-9271 (print)
dc.identifier.issn1476-928X (online)
dc.identifier.other10.1016/j.compbiolchem.2024.108056
dc.identifier.urihttp://hdl.handle.net/2263/97903
dc.language.isoenen_US
dc.publisherElsevieren_US
dc.rights© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC license.en_US
dc.subjectProperty predictionen_US
dc.subjectLatent dirichlet allocation (LDA)en_US
dc.subjectMolecular dataen_US
dc.subjectEmbedding techniquesen_US
dc.subjectText dataen_US
dc.subjectText embeddingen_US
dc.subjectMachine learningen_US
dc.subjectSimplified molecular input line entry specification (SMILES)en_US
dc.subjectFASTAen_US
dc.subjectSDG-09: Industry, innovation and infrastructureen_US
dc.titleFeature engineered embeddings for classification of molecular dataen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Jardim_Feature_2024.pdf
Size:
1.11 MB
Format:
Adobe Portable Document Format
Description:
Article

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: