A review and comparative study of cancer detection using machine learning : SBERT and SimCSE application

dc.contributor.authorMokoatle, Mpho
dc.contributor.authorMarivate, Vukosi
dc.contributor.authorMapiye, Darlington
dc.contributor.authorBornman, Maria S. (Riana)
dc.contributor.authorHayes, Vanessa M.
dc.contributor.emailu19394277@tuks.co.zaen_US
dc.date.accessioned2024-03-13T09:46:50Z
dc.date.available2024-03-13T09:46:50Z
dc.date.issued2023-03-23
dc.descriptionAVAILABILITY OF DATA AND MATERIALS : The data can be accessed at the host database (The European Genome-phenome Archive at the European Bioinformatics Institute, accession number: EGAD00001004582 Data access).en_US
dc.description.abstractBACKGROUND : Using visual, biological, and electronic health records data as the sole input source, pretrained convolutional neural networks and conventional machine learning methods have been heavily employed for the identification of various malignancies. Initially, a series of preprocessing steps and image segmentation steps are performed to extract region of interest features from noisy features. Then, the extracted features are applied to several machine learning and deep learning methods for the detection of cancer. METHODS : In this work, a review of all the methods that have been applied to develop machine learning algorithms that detect cancer is provided. With more than 100 types of cancer, this study only examines research on the four most common and prevalent cancers worldwide: lung, breast, prostate, and colorectal cancer. Next, by using state-of-the-art sentence transformers namely: SBERT (2019) and the unsupervised SimCSE (2021), this study proposes a new methodology for detecting cancer. This method requires raw DNA sequences of matched tumor/normal pair as the only input. The learnt DNA representations retrieved from SBERT and SimCSE will then be sent to machine learning algorithms (XGBoost, Random Forest, LightGBM, and CNNs) for classification. As far as we are aware, SBERT and SimCSE transformers have not been applied to represent DNA sequences in cancer detection settings. RESULTS : The XGBoost model, which had the highest overall accuracy of 73 ± 0.13 % using SBERT embeddings and 75 ± 0.12 % using SimCSE embeddings, was the best performing classifier. In light of these findings, it can be concluded that incorporating sentence representations from SimCSE’s sentence transformer only marginally improved the performance of machine learning models.en_US
dc.description.departmentComputer Scienceen_US
dc.description.departmentSchool of Health Systems and Public Health (SHSPH)en_US
dc.description.librarianam2024en_US
dc.description.sdgNoneen_US
dc.description.sponsorshipThe South African Medical Research Council (SAMRC) through its Division of Research Capacity Development under the Internship Scholarship Program from funding received from the South African National Treasury.en_US
dc.description.urihttps://bmcbioinformatics.biomedcentral.comen_US
dc.identifier.citationMokoatle, M., Marivate, V., Mapiye, D. et al. 2023, 'A review and comparative study of cancer detection using machine learning : SBERT and SimCSE application', BMC Bioinformatics, vol. 24, art. 112, pp. 1-25. https://DOI.org/10.1186/s12859-023-05235-x.en_US
dc.identifier.issn1471-2105
dc.identifier.other10.1186/s12859-023-05235-x
dc.identifier.urihttp://hdl.handle.net/2263/95182
dc.language.isoenen_US
dc.publisherBMCen_US
dc.rights© The Author(s) 2023. This article is licensed under a Creative Commons Attribution 4.0 International License.en_US
dc.subjectCancer detectionen_US
dc.subjectMachine learningen_US
dc.subjectSentenceBert,en_US
dc.subjectSimCSEen_US
dc.subjectDeoxyribonucleic acid (DNA)en_US
dc.titleA review and comparative study of cancer detection using machine learning : SBERT and SimCSE applicationen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
Mokoatle_Review_2023.pdf
Size:
2.14 MB
Format:
Adobe Portable Document Format
Description:
Article
Loading...
Thumbnail Image
Name:
Mokoatle_Review_AddfileSuppl1_2023.docx
Size:
20.19 KB
Format:
Microsoft Word XML
Description:
AddfileSuppl1

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: