JavaScript is disabled for your browser. Some features of this site may not work without it.
Please be advised that the site will be down for maintenance on Sunday, September 1, 2024, from 08:00 to 18:00, and again on Monday, September 2, 2024, from 08:00 to 09:00. We apologize for any inconvenience this may cause.
MethylToSNP : identifying SNPs in Illumina DNA methylation array data
LaBarre, Brenna A.; Goncearenco, Alexander; Petrykowska, Hanna M.; Jaratlerdsiri, Weerachai; Bornman, Maria S. (Riana); Hayes, Vanessa M.; Elnitski, Laura
BACKGROUND : Current array-based methods for the measurement of DNA methylation rely on the process of sodium
bisulfite conversion to differentiate between methylated and unmethylated cytosine bases in DNA. In the absence
of genotype data this process can lead to ambiguity in data interpretation when a sample has polymorphisms at a
methylation probe site. A common way to minimize this problem is to exclude such potentially problematic sites,
with some methods removing as much as 60% of array probes from consideration before data analysis.
RESULTS: Here, we present an algorithm implemented in an R Bioconductor package, MethylToSNP, which detects
a characteristic data pattern to infer sites likely to be confounded by polymorphisms. Additionally, the tool provides
a stringent reliability score to allow thresholding on SNP predictions. We calibrated parameters and thresholds used
by the algorithm on simulated and real methylation data sets. We illustrate findings using methylation data from YRI
(Yoruba in Ibadan, Nigeria), CEPH (European descent) and KhoeSan (southern African) populations. Our polymorphism
predictions made using MethylToSNP have been validated through SNP databases and bisulfite and genomic
sequencing.
CONCLUSIONS : The benefits of this method are threefold. First, it prevents extensive data loss by considering only SNPs
specific to the individuals in the study. Second, it offers the possibility to identify new polymorphisms in samples for
which there is little known about the genetic landscape. Third, it identifies variants as they exist in functional regions
of a genome, such as in CTCF (transcriptional repressor) sites and enhancers, that may be common alleles or personal
mutations with potential to deleteriously affect genomic regulatory activities. We demonstrate that MethylToSNP is
applicable to the Illumina 450K and Illumina 850K EPIC array data and is also backwards compatible to the 27K methylation
arrays. Going forward, this kind of nuanced approach can increase the amount of information derived from
precious data sets by considering samples of the project individually to enable more informed decisions about data cleaning.
Description:
Additional file 1. Supplemental Methods. Additional materials are
provided for the determination of default thresholds (Figure. S1), assessment
of false negative rates (Figure. S2), and inverse quantile weighting
(Figure. S3).