Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-

Whole genome sequencing represents a promising new technology for subtyping of bacterial pathogens. Besides the technological advances which have pushed the approach forward, the last years have been marked by considerable evolution of the whole genome sequencing data analysis methods. Prior to application of the technology as a routine epidemiological typing tool, however, reliable and efficient data analysis strategies need to be identified among the wide variety of the emerged methodologies. In this work, we have compared three existing SNP-based subtyping workflows using a benchmark dataset of 32 Salmonella enterica subsp. enterica serovar Typhimurium and serovar 1,4,[5],12:i:- isolates including five isolates from a confirmed outbreak and three isolates obtained from the same patient at different time points. The analysis was carried out using the original (high-coverage) and a down-sampled (low-coverage) datasets and two different reference genomes. All three tested workflows, namely CSI Phylogeny-based workflow, CFSAN-based workflow and PHEnix-based workflow, were able to correctly group the confirmed outbreak isolates and isolates from the same patient with all combinations of reference genomes and datasets. However, the workflows differed strongly with respect to the SNP distances between isolates and sensitivity towards sequencing coverage, which could be linked to the specific data analysis strategies used therein. To demonstrate the effect of particular data analysis steps, several modifications of the existing workflows were also tested. This allowed us to propose data analysis schemes most suitable for routine SNP-based subtyping applied to S. Typhimurium and S. 1,4,[5],12:i:-. Results presented in this study illustrate the importance of using correct data analysis strategies and to define benchmark and fine-tune parameters applied within routine data analysis pipelines to obtain optimal results.

Description

S1 Fig. Comparison of the sequencing samples based on the read mapping statistics. Read mapping statistics were obtained from Qualimap reports of the raw reads mapped on LT2 and SL1344 reference genomes, and re-plotted in R to improve visualization.
S2 Fig. Original genome coverage plots generated by Qualimap with LT2 and SL1344 reference genomes.
S3 Fig. Comparison of CFSAN and PHEnix variant selection procedures.
S4 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix- based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S5 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Isolates are coloured according to the MLVA-profile. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S6 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S7 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S8 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined. For the CSI-based workflow, the distances between isolates 12±3582 and 12±3583 versus isolates 12±2984, 12±2998, 12±3067 and 12±3558 dropped from 10±12 SNP positions observed with the normal (high-coverage) dataset to 4±6 positions with the down-sampled dataset. For the CFSAN-based workflow, the distances between isolates 12± 2984, 12±2998, 12±3067 and 12±3558 increased strongly (as far as from 3 to 17 SNPs) with the down-sampled dataset compared to the original data.
S9 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and LT2 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S10 Fig. Phylogenetic trees generated with the tested SNP-based subtyping workflows using high-coverage dataset and Sl1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S11 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S12 Fig. Phylogenetic trees generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. The minimal and maximal SNP distances observed between the five outbreak isolates and the three isolates obtained from the same patient are indicated near the clusters. The trees are drawn to scale, with branch lengths measured in the number of substitutions per site. The scale axis is provided below each tree. BS: bootstrap values.
S13 Fig. SNP distance matrices generated with the tested SNP-based subtyping workflows using high-coverage dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) adapted PHEnix-based workflow, (D) CFSAN-based workflow, (E) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S14 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S15 Fig. SNP distance matrices generated with the successful SNP-based subtyping workflows using down-sampled dataset supplemented with replicate data and SL1344 as a reference genome. (A) CSI-based workflow, (B) PHEnix-based workflow, (C) CFSAN-based workflow, (D) adapted CFSAN-based workflow. Values and colour codes in the SNP distance matrices indicate pairwise SNP distances between isolates. Outbreak isolates are shown in bold and isolates obtained from the same patient are underlined.
S1 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using LT2 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with LT2 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.
S2 Table. Performance metrics describing the output of tested SNP-based subtyping workflows and combinations thereof assessed using SL1344 as a reference genome. Performance metrics of the workflows were measured using original dataset (OD) and dataset down-sampled to a 30X coverage (30X), with SL1344 as a reference genome. ad. CFSAN-based workflow: adapted CFSAN-based workflow. PHEnix + CSI, PHEnix + CFSAN, etc.: refer to a combination of the variant calling rules from the first mentioned workflow with the SNP matrix construction rules of the second mentioned workflow. DP: discriminative power.
S1 File. Perl script used for down-sampling of the sequencing data.

Keywords

Bacterial pathogens, Bacterium isolate, Phylogeny, Salmonella enterica serovar Typhimurium, SNP-based subtyping workflow, Whole genome sequencing (WGS)

Citation

Saltykova A, Wuyts V, Mattheus W, Bertrand S, Roosens NHC, Marchal K, et al. (2018) Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-. PLoS ONE 13(2): e0192504. https://DOI.org/ 10.1371/journal.pone.0192504.

URI

http://hdl.handle.net/2263/64330

Collections

Research Articles (Genetics)
Research Articles (University of Pretoria)

Full item page

Comparison of SNP-based subtyping workflows for bacterial isolates using WGS data, applied to Salmonella enterica serotype Typhimurium and serotype 1,4,[5],12:i:-

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Sustainable Development Goals

Citation

URI

Collections