Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-

Saltykova, Assia; Wuyts, Veronique; Mattheus, Wesley; Bertrand, Sophie; Roosens, Nancy H.C.; Marchal, Kathleen; De Keersmaecker, Sigrid C.J.

Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-

dc.contributor.author	Saltykova, Assia
dc.contributor.author	Wuyts, Veronique
dc.contributor.author	Mattheus, Wesley
dc.contributor.author	Bertrand, Sophie
dc.contributor.author	Roosens, Nancy H.C.
dc.contributor.author	Marchal, Kathleen
dc.contributor.author	De Keersmaecker, Sigrid C.J.
dc.date.accessioned	2018-03-28T09:39:12Z
dc.date.available	2018-03-28T09:39:12Z
dc.date.issued	2018-02-06
dc.description	S1 Fig. Comparison of the sequencing samples based on the read mapping statistics. Read mapping statistics were obtained from Qualimap reports of the raw reads mapped on LT2 and SL1344 reference genomes, and re-plotted in R to improve visualization.	en_ZA
dc.description	S2 Fig. Original genome coverage plots generated by Qualimap with LT2 and SL1344 reference genomes.	en_ZA
dc.description	S3 Fig. Comparison of CFSAN and PHEnix variant selection procedures.	en_ZA
dc.description	S4 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix- based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.	en_ZA
dc.description	S5 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.	en_ZA
dc.description	S6 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.	en_ZA
dc.description	S7 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.	en_ZA
dc.description	S8 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined. For the CSI-based workflow, the distances between isolates 12±3582 and 12±3583 versus isolates 12±2984, 12±2998, 12±3067 and 12±3558 dropped from 10±12 SNP positions observed with the normal (high-coverage) dataset to 4±6 positions with the down-sampled dataset. For the CFSAN-based workflow, the distances between isolates 12± 2984, 12±2998, 12±3067 and 12±3558 increased strongly (as far as from 3 to 17 SNPs) with the down-sampled dataset compared to the original data.	en_ZA
dc.description	S9 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.	en_ZA
dc.description	S10 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and Sl1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.	en_ZA
dc.description	S11 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.	en_ZA
dc.description	S12 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.	en_ZA
dc.description	S13 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.	en_ZA
dc.description	S14 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.	en_ZA
dc.description	S15 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.	en_ZA
dc.description	S1 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using LT2 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with LT2 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.	en_ZA
dc.description	S2 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using SL1344 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with SL1344 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.	en_ZA
dc.description	S1 File. Perl script used for down-sampling of the sequencing data.	en_ZA
dc.description.abstract	Whole genome sequencing represents a promising new technology for subtyping of bacterial pathogens. Besides the technological advances which have pushed the approach forward, the last years have been marked by considerable evolution of the whole genome sequencing data analysis methods. Prior to application of the technology as a routine epidemiological typing tool, however, reliable and efficient data analysis strategies need to be identified among the wide variety of the emerged methodologies. In this work, we have compared three existing SNP-based subtyping workflows using a benchmark dataset of 32 Salmonella enterica subsp. enterica serovar Typhimurium and serovar 1,4,[5],12:i:- isolates including five isolates from a confirmed outbreak and three isolates obtained from the same patient at different time points. The analysis was carried out using the original (high-coverage) and a down-sampled (low-coverage) datasets and two different reference genomes. All three tested workflows, namely CSI Phylogeny-based workflow, CFSAN-based workflow and PHEnix-based workflow, were able to correctly group the confirmed outbreak isolates and isolates from the same patient with all combinations of reference genomes and datasets. However, the workflows differed strongly with respect to the SNP distances between isolates and sensitivity towards sequencing coverage, which could be linked to the specific data analysis strategies used therein. To demonstrate the effect of particular data analysis steps, several modifications of the existing workflows were also tested. This allowed us to propose data analysis schemes most suitable for routine SNP-based subtyping applied to S. Typhimurium and S. 1,4,[5],12:i:-. Results presented in this study illustrate the importance of using correct data analysis strategies and to define benchmark and fine-tune parameters applied within routine data analysis pipelines to obtain optimal results.	en_ZA
dc.description.department	Genetics	en_ZA
dc.description.librarian	am2018	en_ZA
dc.description.sponsorship	RP/PJ WIV- ISP (NeXSplorer.iph), the Federal Public Service of Health, Food Chain Safety and Environment. The National Reference Centre for Salmonella and Shigella is partially supported by the Belgian Ministry of Social Affairs through a fund within the Health Insurance System.	en_ZA
dc.description.uri	http://www.plosone.org	en_ZA
dc.identifier.citation	Saltykova A, Wuyts V, Mattheus W, Bertrand S, Roosens NHC, Marchal K, et al. (2018) Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-. PLoS ONE 13(2): e0192504. https://DOI.org/ 10.1371/journal.pone.0192504.	en_ZA
dc.identifier.issn	1932-6203 (online)
dc.identifier.other	10.1371/journal.pone.0192504
dc.identifier.uri	http://hdl.handle.net/2263/64330
dc.language.iso	en	en_ZA
dc.publisher	Public Library of Science	en_ZA
dc.rights	© 2018 Saltykova et al. This is an open access article distributed under the terms of the Creative Commons Attribution License.	en_ZA
dc.subject	Bacterial pathogens	en_ZA
dc.subject	Bacterium isolate	en_ZA
dc.subject	Phylogeny	en_ZA
dc.subject	Salmonella enterica serovar Typhimurium	en_ZA
dc.subject	SNP-based subtyping workflow	en_ZA
dc.subject	Whole genome sequencing (WGS)	en_ZA
dc.title	Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-	en_ZA
dc.type	Article	en_ZA