Abstract:
The classification of molecules is of particular importance to the drug discovery process and several other
use cases. Data in this domain can be partitioned into structural and sequence/text data. Several tech-
niques such as deep learning are able to classify molecules and predict their functions using both types of
data. Molecular structure and encoded chemical information are sufficient to classify a characteristic of
a molecule. However, the use of a molecule’s structural information typically requires large amounts of
computational power with deep learning models that take a long time to train. In this study, we present
a different approach to molecule classification that addresses the limitations of other techniques. This
approach uses natural language processing techniques in the form of count vectorisation, term frequency-
inverse document frequency, word2vec and latent Dirichlet allocation to feature engineer molecular text
data. Through this approach we aim to make a robust and explainable embedding that is fast to im-
plement and solely dependent on chemical (text) data such as the sequence of a protein. Further, we
investigate the usefulness of these explainable embeddings for machine learning models, for representing
a corpus of data in vector space and for protein-protein interaction prediction using embedding similarity.
We apply the techniques on three different types of molecular text data: FASTA sequence data, Simpli-
fied Molecular Input Line Entry Specification data and Protein Data Bank data. We show that these
embeddings provide excellent performance for classification and protein-protein bind prediction.