CpG是胞嘧啶(C,Cytosine),磷酸(p,phosphoric acid),鸟嘌呤(G,Guanine )的缩写,也可以去掉磷酸直接叫CG。在哺乳动物中,基因组中富含GC和CpG的序列区段,叫CpG岛(CpG islands)。Regions of the genome that are enriched for non-methylated CpGs, called CpG islands.
from Cancer Cell: Several reports have suggested that gene body DNA methylation may increase transcriptional activity by blocking the initiation of intragenic promoters or by affecting the activities of repetitive DNAs within the transcriptional unit.
2009:the HumanMethylation27 (27k) array (Bibikova et al. 2009) measured methylation at approximately 27,000 CpGs, primarily in gene promoters. Like bisulfite sequencing, the Infinium assay detects methylation status at single base resolution.
全基因组DNA甲基化测序(Whole Genome Bisulfite Sequencing,WGBS)是 DNA 甲基化研究的金标准,它通过 Bisulfite 处理和全基因组 DNA 测序结合的方式,对整个基因组上的甲基化情况进行分析,具有单碱基分辨率,可精确评估单个 C 碱基的甲基化水平,构建全基因组精细甲基化图谱。数据量非常大。
### 归一化 # 目的是:To minimise the unwanted variation within and between samples # 方案一:preprocessQuantile(): more suited for datasets where you do not expect global differences between your samples, for example a single tissue # 方案二:preprocessFunnorm(): most appropriate for datasets with global methylation differences such as cancer/normal or vastly different tissue types # 我们这里比较的是不同的血液细胞类型,因此用preprocessQuantile即可 mSetSq <- preprocessQuantile(rgSet) mSetRaw <- preprocessRaw(rgSet)
# visualise what the data looks like before and after normalisation par(mfrow=c(1,2)) densityPlot(rgSet, sampGroups=targets$Sample_Group,main='Raw', legend=FALSE) legend('top', legend = levels(factor(targets$Sample_Group)), text.col=brewer.pal(8,'Dark2')) densityPlot(getBeta(mSetSq), sampGroups=targets$Sample_Group, main='Normalized', legend=FALSE) legend('top', legend = levels(factor(targets$Sample_Group)), text.col=brewer.pal(8,'Dark2'))
# 来自:https://www./p/231771/ most people do with these experimental designs: ~ 0 DiseasePhenotype covariate (1 | randomCovariate) OR: y ~ 0 effect1 effect2 (1 | effect 3) # 这里DiseasePhenotype就是cellType,covariate就是individual
# fit the linear model fit <- lmFit(mVals, design) # create a contrast matrix for specific comparisons contMatrix <- makeContrasts(naive-rTreg, naive-act_naive, rTreg-act_rTreg, act_naive-act_rTreg, levels=design) # fit the contrasts fit2 <- contrasts.fit(fit, contMatrix) fit2 <- eBayes(fit2) # look at the numbers of DM CpGs at FDR < 0.05 summary(decideTests(fit2)) ## naive - rTreg naive - act_naive rTreg - act_rTreg act_naive - act_rTreg ## Down 1618 400 0 559 ## NotSig 436895 439291 439918 438440 ## Up 1405 227 0 919
# plot the top 4 most significantly differentially methylated CpGs par(mfrow=c(2,2)) sapply(rownames(DMPs)[1:4], function(cpg){ plotCpg(bVals, cpg=cpg, pheno=targets$Sample_Group, ylab = 'Beta values') })
然后是 Differential methylation analysis of regions
# using the dmrcate in the DMRcate package (Peters et al. 2015) myAnnotation <- cpg.annotate(object = mVals, datatype = 'array', what = 'M', analysis.type = 'differential', design = design, contrasts = TRUE, cont.matrix = contMatrix, coef = 'naive - rTreg', arraytype = '450K') DMRs <- dmrcate(myAnnotation, lambda=1000, C=2) results.ranges <- extractRanges(DMRs) results.ranges
可视化DMR
# set up the grouping variables and colours groups <- pal[1:length(unique(targets$Sample_Group))] names(groups) <- levels(factor(targets$Sample_Group)) cols <- groups[as.character(factor(targets$Sample_Group))] # draw the plot for the top DMR par(mfrow=c(1,1)) DMR.plot(ranges = results.ranges, dmr = 2, CpGs = bVals, phen.col = cols, what = 'Beta', arraytype = '450K', genome = 'hg19')
Gene ontology testing
# Get the significant CpG sites at less than 5% FDR sigCpGs <- DMPs$Name[DMPs$adj.P.Val<0.05] # First 10 significant CpGs sigCpGs[1:10] ## [1] 'cg07499259' 'cg26992245' 'cg09747445' 'cg18808929' 'cg25015733' ## [6] 'cg21179654' 'cg26280976' 'cg16943019' 'cg10898310' 'cg25130381' length(sigCpGs) ## [1] 3023
# Get all the CpG sites used in the analysis to form the background all <- DMPs$Name # Total number of CpG sites tested length(all) ## [1] 439918
# Top 10 GO categories topGSA(gst, number=10) ## ONTOLOGY TERM N DE ## GO:0002376 BP immune system process 2895 383 ## GO:0046649 BP lymphocyte activation 669 137 ## GO:0042110 BP T cell activation 467 103 ## GO:0002682 BP regulation of immune system process 1480 219 ## GO:0001775 BP cell activation 1378 206 ## GO:0002520 BP immune system development 999 169 ## GO:0048534 BP hematopoietic or lymphoid organ development 945 162 ## GO:0022407 BP regulation of cell-cell adhesion 431 93 ## GO:0022409 BP positive regulation of cell-cell adhesion 277 69 ## GO:0007166 BP cell surface receptor signaling pathway 2994 398 ## P.DE FDR ## GO:0002376 1.139863e-21 2.593075e-17 ## GO:0046649 6.147656e-21 6.992652e-17 ## GO:0042110 2.829736e-18 2.145788e-14 ## GO:0002682 5.707044e-18 3.245738e-14 ## GO:0001775 1.476431e-15 6.717465e-12 ## GO:0002520 4.851443e-15 1.839424e-11 ## GO:0048534 1.044579e-14 3.394733e-11 ## GO:0022407 1.379734e-14 3.923445e-11 ## GO:0022409 2.719337e-14 6.873578e-11 ## GO:0007166 4.538082e-14 9.313273e-11
有用的资源
EWAS Data Hub:中国科学院北京基因组研究所国家基因组科学数据中心,开发了人类表观组关联分析数据库。EWAS Data Hub: a resource of DNA methylation array data and metadata.Nucleic Acids Res 2020. EWAS全称是:表观基因组广泛关联研究(Epigenome-wide Association Study,EWAS)https://bigd./ewas/datahub 包括了来自95,783个样本的DNA甲基化数据,数据主要来自GEO、TCGA、ArrayExpress和Encode数据库,采用有效的归一化方法来减轻不同数据集之间的批次效应,其中约91%是来自450K,来自血液样本的数据最多。Associations、Studies、Cohorts、Probe annotations、Trait to trait relationships都可以下载。