Abstract:
Mycobacterium tuberculosis is a clonal pathogen proposed to have co-evolved with its human
host for millennia, yet our understanding of its genomic diversity and biogeography remains
incomplete. Here we use a combination of phylogenetics and dimensionality reduction to
reevaluate the population structure of M. tuberculosis, providing an in-depth analysis of the
ancient Indo-Oceanic Lineage 1 and the modern Central Asian Lineage 3, and expanding our
understanding of Lineages 2 and 4. We assess sub-lineages using genomic sequences from
4939 pan-susceptible strains, and find 30 new genetically distinct clades that we validate in a
dataset of 4645 independent isolates. We find a consistent geographically restricted or
unrestricted pattern for 20 groups, including three groups of Lineage 1. The distribution of
terminal branch lengths across the M. tuberculosis phylogeny supports the hypothesis of a
higher transmissibility of Lineages 2 and 4, in comparison with Lineages 3 and 1, on a global
scale. We define an expanded barcode of 95 single nucleotide substitutions that allows rapid
identification of 69 M. tuberculosis sub-lineages and 26 additional internal groups. Our results
paint a higher resolution picture of the M. tuberculosis phylogeny and biogeography.