dc.contributor.author |
Mokoatle, Mpho
|
|
dc.contributor.author |
Marivate, Vukosi
|
|
dc.contributor.author |
Mapiye, Darlington
|
|
dc.contributor.author |
Bornman, Maria S. (Riana)
|
|
dc.contributor.author |
Hayes, Vanessa M.
|
|
dc.date.accessioned |
2024-03-13T09:46:50Z |
|
dc.date.available |
2024-03-13T09:46:50Z |
|
dc.date.issued |
2023-03-23 |
|
dc.description |
AVAILABILITY OF DATA AND MATERIALS : The data can be accessed at the host database (The European Genome-phenome Archive at the European Bioinformatics
Institute, accession number: EGAD00001004582 Data access). |
en_US |
dc.description.abstract |
BACKGROUND : Using visual, biological, and electronic health records data as the sole
input source, pretrained convolutional neural networks and conventional machine
learning methods have been heavily employed for the identification of various malignancies.
Initially, a series of preprocessing steps and image segmentation steps are
performed to extract region of interest features from noisy features. Then, the extracted
features are applied to several machine learning and deep learning methods for the
detection of cancer.
METHODS : In this work, a review of all the methods that have been applied to develop
machine learning algorithms that detect cancer is provided. With more than 100 types
of cancer, this study only examines research on the four most common and prevalent
cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using
state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised
SimCSE (2021), this study proposes a new methodology for detecting cancer. This
method requires raw DNA sequences of matched tumor/normal pair as the only input.
The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to
machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification.
As far as we are aware, SBERT and SimCSE transformers have not been applied
to represent DNA sequences in cancer detection settings.
RESULTS : The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 %
using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best
performing classifier. In light of these findings, it can be concluded that incorporating
sentence representations from SimCSE’s sentence transformer only marginally
improved the performance of machine learning models. |
en_US |
dc.description.department |
Computer Science |
en_US |
dc.description.department |
School of Health Systems and Public Health (SHSPH) |
en_US |
dc.description.librarian |
am2024 |
en_US |
dc.description.sdg |
None |
en_US |
dc.description.sponsorship |
The South African Medical Research Council (SAMRC) through its Division of Research Capacity Development under the Internship Scholarship Program from funding received from the South African National Treasury. |
en_US |
dc.description.uri |
https://bmcbioinformatics.biomedcentral.com |
en_US |
dc.identifier.citation |
Mokoatle, M., Marivate, V., Mapiye, D. et al. 2023, 'A review and comparative study of cancer detection using machine learning : SBERT and SimCSE application', BMC Bioinformatics, vol. 24, art. 112, pp. 1-25. https://DOI.org/10.1186/s12859-023-05235-x. |
en_US |
dc.identifier.issn |
1471-2105 |
|
dc.identifier.other |
10.1186/s12859-023-05235-x |
|
dc.identifier.uri |
http://hdl.handle.net/2263/95182 |
|
dc.language.iso |
en |
en_US |
dc.publisher |
BMC |
en_US |
dc.rights |
© The Author(s) 2023.
This article is licensed under a Creative Commons Attribution 4.0 International License. |
en_US |
dc.subject |
Cancer detection |
en_US |
dc.subject |
Machine learning |
en_US |
dc.subject |
SentenceBert, |
en_US |
dc.subject |
SimCSE |
en_US |
dc.subject |
Deoxyribonucleic acid (DNA) |
en_US |
dc.title |
A review and comparative study of cancer detection using machine learning : SBERT and SimCSE application |
en_US |
dc.type |
Article |
en_US |