Gene duplication is an important mechanism for adding to genomic novelty. Hence, which genes undergo duplication and are preserved following duplication is an important question. It has been observed that gene duplicability, or the ability of genes to be retained following duplication, is a nonrandom process, with certain genes being more amenable to survive duplication events than others. Primarily, gene essentiality and the type of duplication (small-scale versus large-scale) have been shown in different species to influence the (long-term) survival of novel genes. However, an overarching view of "gene duplicability" is lacking, mainly due to the fact that previous studies usually focused on individual species and did not account for the influence of genomic context and the time of duplication. Here, we present a large-scale study in which we investigated duplicate retention for 9178 gene families shared between 37 flowering plant species, referred to as angiosperm core gene families. For most gene families, we observe a strikingly consistent pattern of gene duplicability across species, with gene families being either primarily single-copy or multicopy in all species. An intermediate class contains gene families that are often retained in duplicate for periods extending to tens of millions of years after whole-genome duplication, but ultimately appear to be largely restored to singleton status, suggesting that these genes may be dosage balance sensitive. The distinction between single-copy and multicopy gene families is reflected in their functional annotation, with single-copy genes being mainly involved in the maintenance of genome stability and organelle function and multicopy genes in signaling, transport, and metabolism. The intermediate class was overrepresented in regulatory genes, further suggesting that these represent putative dosage-balance-sensitive genes
SUPPLEMENTAL FIGURE 1. Motivation for the 32 out of 37 species cut-off
to define core gene families.
SUPPLEMENTAL FIGURE 2. The distribution of single-copy percentages
(SCPs) for all core gene families, with SCPs calculated upon removing
the highly duplicated genomes of Glycine max, Linum usitatissimum,
Brassica rapa, and Zea mays.
SUPPLEMENTAL FIGURE 3. Classification of species tree nodes as SSD
SUPPLEMENTAL FIGURE 4. Core gene families mainly duplicate through
SUPPLEMENTAL FIGURE 5. Comparison of the number of duplications for
core and noncore gene families at WGD and SSD nodes on a gene
SUPPLEMENTAL FIGURE 6. Ks distributions of duplicated pairs from core
and noncore gene families in 12 species.
SUPPLEMENTAL FIGURE 7. Duplicate gene retention in function of time
SUPPLEMENTAL FIGURE 8. Criteria that we used to choose the optimal
number of clusters for k-means clustering of the copy-number matrix.
SUPPLEMENTAL FIGURE 9. Consensus matrices obtained for different
number of clusters k.
SUPPLEMENTAL FIGURE 10. Polar diagrams depicting the fraction of
duplication events in each gene family group belonging to either the
“recent,” “K-Pg boundary,” “ancient,” or “SSD” duplication classes.
SUPPLEMENTAL FIGURE 11. Over- and underrepresentation of an
independent set of 2090 nuclear-encoded chloroplast-targeted genes
obtained from The Chloroplast Function Database.
SUPPLEMENTAL FIGURE 12. Over- and underrepresentation of an
independent set of 1795 putative transcription factors.
SUPPLEMENTAL FIGURE 13. Mapping of the whole-genome duplications
and triplications on the species tree.
SUPPLEMENTAL FIGURE 14. Conflicting clades between the species tree
used in this paper and which we inferred from 107 core gene families
and the APGIII tree.
SUPPLEMENTAL FIGURE 15. Explanation of how duplications were
inferred for gene families with at least two species but no more than
three genes or gene families that are only present in one species.
SUPPLEMENTAL FIGURE 16. The change in the total number of predicted
duplication events in core gene families in function of the threshold on
the duplication consistency score.
SUPPLEMENTAL FIGURE 17. Gaussian mixture models were fit to the Ks
distribution of each species.
SUPPLEMENTAL FIGURE 18. Comparison of power-law fit and exponential
fit to the data obtained from the Gaussian Mixture Modeling of Ksbased
SUPPLEMENTAL TABLE 1. Comparison of the numbers of interacting
protein pairs in each group to those obtained from randomized
SUPPLEMENTAL TABLE 2. Description of all identified peaks inferred
from the Ks-based age distributions.
SUPPLEMENTAL TABLE 3. Comparison of the power-law and the
SUPPLEMENTAL DATA SET 1. Concatenated multiple sequence alignment
for 107 genes to reconstruct the species tree.
SUPPLEMENTAL DATA SET 2. Data source and accession numbers of
107 genes used for reconstruction of the species tree.