Feature engineered embeddings for classification of molecular data

Jardim, Claudio; De Waal, Alta; Fabris-Rotelli, Inger Nicolette; Rad, Najmeh Nakhaei; Mazarura, Jocelyn; Sherry, Dean

We are excited to announce that the repository will soon undergo an upgrade, featuring a new look and feel along with several enhanced features to improve your experience. Please be on the lookout for further updates and announcements regarding the launch date. We appreciate your support and look forward to unveiling the improved platform soon.

Show simple item record

dc.contributor.author	Jardim, Claudio
dc.contributor.author	De Waal, Alta
dc.contributor.author	Fabris-Rotelli, Inger Nicolette
dc.contributor.author	Rad, Najmeh Nakhaei
dc.contributor.author	Mazarura, Jocelyn
dc.contributor.author	Sherry, Dean
dc.date.accessioned	2024-08-28T08:11:25Z
dc.date.available	2024-08-28T08:11:25Z
dc.date.issued	2024-06
dc.description.abstract	The classification of molecules is of particular importance to the drug discovery process and several other use cases. Data in this domain can be partitioned into structural and sequence/text data. Several techniques such as deep learning are able to classify molecules and predict their functions using both types of data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of a molecule. However, the use of a molecule’s structural information typically requires large amounts of computational power with deep learning models that take a long time to train. In this study, we present an alternative approach to molecule classification that addresses the limitations of other techniques. This approach uses natural language processing techniques in the form of count vectorisation, term frequency-inverse document frequency, word2vec and Latent Dirichlet Allocation to feature engineer molecular text data. Through this approach, we aim to make a robust and easily reproducible embedding that is fast to implement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we investigate the usefulness of these embeddings for machine learning models. We apply the techniques to two different types of molecular text data: FASTA sequence data and Simplified Molecular Input Line Entry Specification data. We show that these embeddings provide excellent performance for classification.	en_US
dc.description.department	Statistics	en_US
dc.description.librarian	hj2024	en_US
dc.description.sdg	SDG-09: Industry, innovation and infrastructure	en_US
dc.description.sponsorship	In part by the RDP grant at the University of Pretoria, and the National Research Foundation (NRF) of South Africa.	en_US
dc.description.uri	https://www.elsevier.com/locate/cbac	en_US
dc.identifier.citation	Jardim, C., De Waal, A., Fabris-Rotelli, I. et al. 2024, 'Feature engineered embeddings for classification of molecular data', Computational Biology and Chemistry, vol. 110, art. 108056, pp. 1-10, doi : 10.1016/j.compbiolchem.2024.108056.	en_US
dc.identifier.issn	1476-9271 (print)
dc.identifier.issn	1476-928X (online)
dc.identifier.other	10.1016/j.compbiolchem.2024.108056
dc.identifier.uri	http://hdl.handle.net/2263/97903
dc.language.iso	en	en_US
dc.publisher	Elsevier	en_US
dc.rights	© 2024 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC license.	en_US
dc.subject	Property prediction	en_US
dc.subject	Latent dirichlet allocation (LDA)	en_US
dc.subject	Molecular data	en_US
dc.subject	Embedding techniques	en_US
dc.subject	Text data	en_US
dc.subject	Text embedding	en_US
dc.subject	Machine learning	en_US
dc.subject	Simplified molecular input line entry specification (SMILES)	en_US
dc.subject	FASTA	en_US
dc.subject	SDG-09: Industry, innovation and infrastructure	en_US
dc.title	Feature engineered embeddings for classification of molecular data	en_US
dc.type	Article	en_US