Fine-tuning a sentence transformer for DNA

dc.contributor.authorMokoatle, Mpho
dc.contributor.authorMarivate, Vukosi
dc.contributor.authorMapiye, Darlington
dc.contributor.authorBornman, Maria S. (Riana)
dc.contributor.authorHayes, Vanessa M.
dc.date.accessioned2025-11-14T09:43:15Z
dc.date.available2025-11-14T09:43:15Z
dc.date.issued2025-10
dc.descriptionDATA AVAILABILITY : The benchmark datasets can be accessed here [23, 24]. For the other tasks (T1 and T2), the data can be accessed at the host database (The European Genome-phenome Archive at the European Bioinformatics Institute, accession number: EGAD00001004582 Data access). We share the DNA-based model on Hugging Face [36].
dc.description.abstractBACKGROUND : Sentence-transformers is a library that provides easy methods for generating embeddings for sentences, paragraphs, and images. Sentiment analysis, retrieval, and clustering are among the applications made possible by the embedding of texts in a vector space where similar texts are located close to one another. This study fine-tunes a sentence transformer model designed for natural language on DNA text and subsequently evaluates it across eight benchmark tasks. The objective is to assess the efficacy of this transformer in comparison to domain-specific DNA transformers, like DNABERT and the Nucleotide transformer. RESULTS : The findings indicated that the refined proposed model generated DNA embeddings that exceeded DNABERT in multiple tasks. However, the proposed model was not superior to the nucleotide transformer in terms of raw classification accuracy. The nucleotide transformer excelled in most tasks; but, this superiority incurred significant computing expenses, rendering it impractical for resource-constrained environments such as low- and middle-income countries (LMICs). The nucleotide transformer also performed worse on retrieval tasks and embedding extraction time. Consequently, the proposed model presents a viable option that balances performance and accuracy.
dc.description.departmentComputer Science
dc.description.departmentSchool of Health Systems and Public Health (SHSPH)
dc.description.librarianhj2025
dc.description.sdgSDG-03: Good health and well-being
dc.description.urihttps://bmcbioinformatics.biomedcentral.com/
dc.identifier.citationMokoatle, M., Marivate, V., Mapiye, D. et al. Fine-tuning a sentence transformer for DNA. BMC Bioinformatics 26, 267: 1-13 (2025). https://doi.org/10.1186/s12859-025-06291-1.
dc.identifier.issn1471-2105 (online)
dc.identifier.other10.1186/s12859-025-06291-1
dc.identifier.urihttp://hdl.handle.net/2263/105294
dc.language.isoen
dc.publisherBioMed Central
dc.rights© The Author(s) 2025. Open Access. This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
dc.subjectLow- and middle-income countries (LMICs)
dc.subjectSentence transformers
dc.subjectBERT
dc.subjectDNABERT
dc.subjectSimCSE
dc.subjectnucleotide transformer
dc.titleFine-tuning a sentence transformer for DNA
dc.typeArticle

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Mokoatle_FineTuning_2025.pdf
Size:
1.74 MB
Format:
Adobe Portable Document Format
Description:
Article

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: