Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-
Loading...
Date
Authors
Saltykova, Assia
Wuyts, Veronique
Mattheus, Wesley
Bertrand, Sophie
Roosens, Nancy H.C.
Marchal, Kathleen
De Keersmaecker, Sigrid C.J.
Journal Title
Journal ISSN
Volume Title
Publisher
Public Library of Science
Abstract
Whole genome sequencing represents a promising new technology for subtyping of bacterial
pathogens. Besides the technological advances which have pushed the approach forward,
the last years have been marked by considerable evolution of the whole genome
sequencing data analysis methods. Prior to application of the technology as a routine epidemiological
typing tool, however, reliable and efficient data analysis strategies need to be
identified among the wide variety of the emerged methodologies. In this work, we have compared
three existing SNP-based subtyping workflows using a benchmark dataset of 32 Salmonella
enterica subsp. enterica serovar Typhimurium and serovar 1,4,[5],12:i:- isolates
including five isolates from a confirmed outbreak and three isolates obtained from the same
patient at different time points. The analysis was carried out using the original (high-coverage)
and a down-sampled (low-coverage) datasets and two different reference genomes.
All three tested workflows, namely CSI Phylogeny-based workflow, CFSAN-based workflow
and PHEnix-based workflow, were able to correctly group the confirmed outbreak isolates
and isolates from the same patient with all combinations of reference genomes and datasets.
However, the workflows differed strongly with respect to the SNP distances between
isolates and sensitivity towards sequencing coverage, which could be linked to the specific
data analysis strategies used therein. To demonstrate the effect of particular data analysis
steps, several modifications of the existing workflows were also tested. This allowed us to
propose data analysis schemes most suitable for routine SNP-based subtyping applied to
S. Typhimurium and S. 1,4,[5],12:i:-. Results presented in this study illustrate the importance
of using correct data analysis strategies and to define benchmark and fine-tune parameters applied within routine data analysis pipelines to obtain optimal results.
Description
S1 Fig. Comparison of the sequencing samples based on the read mapping statistics. Read
mapping statistics were obtained from Qualimap reports of the raw reads mapped on LT2 and
SL1344 reference genomes, and re-plotted in R to improve visualization.
S2 Fig. Original genome coverage plots generated by Qualimap with LT2 and SL1344 reference genomes.
S3 Fig. Comparison of CFSAN and PHEnix variant selection procedures.
S4 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix- based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S5 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S6 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S7 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S8 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined. For the CSI-based workflow, the distances between isolates 12±3582 and 12±3583 versus isolates 12±2984, 12±2998, 12±3067 and 12±3558 dropped from 10±12 SNP positions observed with the normal (high-coverage) dataset to 4±6 positions with the down-sampled dataset. For the CFSAN-based workflow, the distances between isolates 12± 2984, 12±2998, 12±3067 and 12±3558 increased strongly (as far as from 3 to 17 SNPs) with the down-sampled dataset compared to the original data.
S9 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S10 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and Sl1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S11 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S12 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S13 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S14 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S15 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S1 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using LT2 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with LT2 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.
S2 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using SL1344 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with SL1344 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.
S1 File. Perl script used for down-sampling of the sequencing data.
S2 Fig. Original genome coverage plots generated by Qualimap with LT2 and SL1344 reference genomes.
S3 Fig. Comparison of CFSAN and PHEnix variant selection procedures.
S4 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix- based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S5 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S6 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S7 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S8 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined. For the CSI-based workflow, the distances between isolates 12±3582 and 12±3583 versus isolates 12±2984, 12±2998, 12±3067 and 12±3558 dropped from 10±12 SNP positions observed with the normal (high-coverage) dataset to 4±6 positions with the down-sampled dataset. For the CFSAN-based workflow, the distances between isolates 12± 2984, 12±2998, 12±3067 and 12±3558 increased strongly (as far as from 3 to 17 SNPs) with the down-sampled dataset compared to the original data.
S9 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S10 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and Sl1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S11 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S12 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S13 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S14 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S15 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S1 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using LT2 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with LT2 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.
S2 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using SL1344 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with SL1344 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.
S1 File. Perl script used for down-sampling of the sequencing data.
Keywords
Bacterial pathogens, Bacterium isolate, Phylogeny, Salmonella enterica serovar Typhimurium, SNP-based subtyping workflow, Whole genome sequencing (WGS)
Sustainable Development Goals
Citation
Saltykova A, Wuyts V, Mattheus W,
Bertrand S, Roosens NHC, Marchal K, et al. (2018)
Comparison of SNP-based subtyping workflows
for bacterial isolates using WGS data, applied to
Salmonella enterica serotype Typhimurium and
serotype 1,4,[5],12:i:-. PLoS ONE 13(2):
e0192504. https://DOI.org/ 10.1371/journal.pone.0192504.