【原】使用MethylMix包识别甲基化驱动的癌症基因

健明 2021-07-14

展开全文

真正做生信工具（R包，软件，网页）的很少，但是需要用工具的却超级多，比如：GEPIA2详解（中国智造-肿瘤数据库），主要是看表达量和生存，引用就是（1659+130），再比如GDCRNATools这个R包，早就了ceRNA等数据挖掘套路。我在：惊！3个同样的数据挖掘策略居然同时发表，提到过大家耳熟能详的策略，有：

差异分析+PPI网络+hub基因
WGCNA+hub基因
诊断模型构建
预后模型构建
肿瘤免疫，CIBERSOFT计算的LM22比例分组，以及ESTIMATE算法等等
m6A等生物学功能基因集
药敏信息

（mRNA，lncRNA，miRNA，甲基化，蛋白）均可走上述流程，也就是说33种癌症乘以5种亚型，乘以5种分子，乘以15个策略就已经是过万篇数据挖掘课题了，而且你仔细搜索一下就发现，真的是已经有了过万篇数据挖掘文章了哦！

人民群众的模仿能力是远超出我的想象，这里分享一个策略给大家。最近看到一篇文章，做的是甲基化和表达量联合分析，核心数据集如下所示：

核心数据集

完整的路线图如下：

路线图

文章链接是：https://www./articles/10.3389/fgene.2020.00294/full

我有预感，这样的策略，马上也会应用于33种癌症乘以5种亚型，就是一百多篇文章出来啦。不过，能模仿下来，首先就有一个R语言的拦路虎哦！再怎么强调生物信息学数据分析学习过程的计算机基础知识的打磨都不为过，我把它粗略的分成基于R语言的统计可视化，以及基于Linux的NGS数据处理：

把R的知识点路线图搞定，如下：

了解常量和变量概念
加减乘除等运算（计算器）
多种数据类型（数值，字符，逻辑，因子）
多种数据结构（向量，矩阵，数组，数据框，列表）
文件读取和写出
简单统计可视化
无限量函数学习

如果你学会了R，那么很容易就可以看懂MethylMix包的使用方法啦。去探索你感兴趣的癌症吧，当然了，又不是一定需要癌症数据，只要是有转录组和甲基化数据的项目，都可以使用起来。这个时候学习起来也不晚，虽然你错过了这个数据挖掘策略，但是未来十年仍然是有几十上百个好用的策略，到时候就需要你的R语言水平啦！可以考虑我们生信技能树的官方学习班（下一期是8月31号开始哈）：

数据挖掘学习班第5期（线上直播3周，马拉松式陪伴，带你入门），原价4800的数据挖掘全套课程，疫情期间半价即可抢购。
生信爆款入门-第7期（线上直播4周，马拉松式陪伴，带你入门），原价9600的生信入门全套课程，疫情期间3.3折即可抢购。

MethylMix说明书

简介

用途：识别甲基化驱动的癌症基因，可以识别癌症中高甲基化和低甲基化基因。

数据集

以下是包含在R包中的数据集。

GEcancer

数据集构成：胶质母细胞瘤患者的癌症基因表达矩阵,14行基因和251列样本。强调基因要放在行，样本要放在列。
参考链接：Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008 Oct 23; 455(7216):1061-8. doi: 10.1038/nature07385.
用法：data(GEcancer)

METcancer

数据集构成：胶质母细胞瘤患者的癌细胞甲基化矩阵,14行基因和251列样本。
参考链接：Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008 Oct 23; 455(7216):1061-8. doi: 10.1038/nature07385.
用法：data(METcancer)

METnormal

数据集构成：胶质母细胞瘤患者的正常细胞甲基化矩阵,14行基因和251列样本。
参考链接：Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008 Oct 23; 455(7216):1061-8. doi: 10.1038/nature07385.
用法：data(METnormal)

ProbeAnnotation

数据集构成：注释文件。把CpG位点注释成基因

SNPprobes

数据集构成：含有SNP探针的数据集

函数

ClusterProbes

用途：把每个探针注释到一个基因。一个基因有多个CpG位点，对这些CpG位点聚类，如果同时提供正常组织的样本，那么只对同时出现在正常样本和癌症样本中的探针进行操作。去除了SNP探针。用法:

ClusterProbes(MET_Cancer, MET_Normal, CorThreshold = 0.4)### MET_Cancer: 癌症组织样本### MET_Normal: 正常组织样本### CorThreshold: 划分聚类的相关性阈值

输出结果：

聚类之后的数据集
探针和基因的对应关系

Download_DNAmethylation

用途：从TCGA下载DNA甲基化数据用法:

Download_DNAmethylation(CancerSite, TargetDirectory, downloadData = TRUE)### CancerSite：各种癌症的缩写名称，例如OV，THCA等### TargetDirectory：下载文件存放位置### downloadData (默认：TRUE). 如果是false, 则返回下载链接

输出结果：

Download_GeneExpression

用途：从TCGA下载RNAseq数据用法:

Download_GeneExpression(CancerSite, TargetDirectory, downloadData = TRUE)### CancerSite：同上### TargetDirectory：下载文件存放位置### downloadData (默认：TRUE). 如果是false, 则返回下载链接

输出结果：

Preprocess_DNAmethylation

用途：预处理从TCGA下载DNA甲基化数据用法:

Preprocess_DNAmethylation(CancerSite, METdirectories,MissingValueThreshold = 0.2)### CancerSite: 同上### METdirectories: 由Download_DNAmethylation返回的对象。由下载数据路径组成的向量。### MissingValueThreshold: 当存在缺失值时，移除样本或基因的阈值

输出结果：预处理之后的癌症组织和正常组织矩阵

Preprocess_GeneExpression

用途：预处理从TCGA下载RNAseq数据用法:

Preprocess_GeneExpression(CancerSite, MAdirectories,MissingValueThresholdGene = 0.3, MissingValueThresholdSample = 0.1)### CancerSite: 同上### MAdirectories: 由Download_DNAmethylation返回的对象。由下载数据路径组成的向量。### MissingValueThresholdGene：默认值0.3，当存在缺失值时，移除样本或基因的阈值。### MissingValueThresholdSample：默认值0.1，当存在缺失值时，移除样本或基因的阈值。

输出结果：预处理之后的癌症组织和正常组织矩阵

GetData

用途：打包了Download_DNAmethylation，Download_GeneExpression，Preprocess_DNAmethylation，Preprocess_GeneExpression, ClusterProbes函数。

用法:

GetData(cancerSite, targetDirectory)### cancerSite：同上### targetDirectory：输出文件存放位置

输出结果：

gdac: 存放直接从TCGA下载的数据的文件夹
MET_CancerSite_Processed.rds: 处理过的甲基化矩阵（未聚类，未注释）
GE_CancerSite_Processed.rds: 处理过的基因表达矩阵
data_CancerSite.rds: 甲基化矩阵（已聚类，已从探针注释到基因）和基因表达矩阵

MethylMix

用途：对DNA甲基化数据建立mixture model

用法:

MethylMix(METcancer, GEcancer, METnormal = NULL, listOfGenes = NULL, filter = TRUE, NoNormalMode = FALSE, OutputRoot = "")

### 三个数据集：METcancer， GEcancer，METnormal ### listOfGenes：和rownames(METcancer)一致### filter：默认值TRUE 选择甲基化和基因表达呈负相关的基因### NoNormalMode：默认值FALSE，在癌症组织中的甲基化状态不和正常组织比较### OutputRoot：存储 MethylMix results object的路径

输出结果：

MethylationDrivers：符合以下两个条件的基因被挑选出来：

转录预测基因
差异甲基化

NrComponents：driver基因的甲基化状态数量
MixtureStates：每个driver基因的 Differential Methylation values(DM-values)
MethylationStates: DM-values矩阵，driver genes作为行名sample作为列名
Classifications：Matrix with integers indicating to which mixture component each cancer sample was assigned to, for each gene.
Models：Beta mixture model parameters for each driver gene.

MethylMix_ModelGeneExpression

用途：对DNA甲基化数据建立mixture model

用法:

MethylMix_ModelGeneExpression(METcancer, GEcancer, CovariateData = NULL)### 两个数据集：METcancer和GEcancer### CovariateData：一般不需要设置，如果样本来源于不同的组织类型，则可以设置。

输出结果：输出甲基化和基因表达呈显著负相关的的基因名

MethylMix_PlotModel

用途：根据MethylMix函数结果画图

用法:

MethylMix_PlotModel(GeneName, MixtureModelResults, METcancer, GEcancer = NULL, METnormal = NULL)#GeneName: Name of the gene for which to create a MethylMix plot.#MixtureModelResults：MethylMix函数返回的list### 三个数据集：METcancer，GEcancer，METnormal

输出结果：两张图：

MethylMix plots：a histogram of the methylation data (MixtureModelPlot)
scatterplot between DNA methylation and gene expression

MethylMix_Predict

用途：Given a new data set with methylation data, this function predicts the mixture component for each new sample and driver gene. Predictions are based on posterior probabilities calculated with MethylMix’x fitted mixture model

用法:

MethylMix_Predict(newBetaValuesMatrix, MethylMixResult)

### newBetaValuesMatrix：新矩阵，genes/cpg sites in rows, samples in columns.  genes/cpg sites的名字要和MethylMix函数中用到的矩阵中的genes/cpg sites的名字保持一致，但是genes/cpg sites数量可以不一致。

### MethylMixResult：MethylMix函数输出结果

输出结果：输出一个预测矩阵，driver genes in rows, new samples in columns

predictOneGene

用途：Given a new vector of beta values, this function calculates a matrix with posterior prob of belonging to each mixture commponent (columns) for each new beta value (rows), and return the number of the mixture component with highest posterior probabilit

用法:

predictOneGene(newVector, mixtureModel)### newVector：vector with new beta values### mixtureModel：beta mixture model object for the gene being evaluated.

输出结果：输出一个预测矩阵，driver genes in rows, new samples in columns

代码事例

### Optional register cluster to run in parallellibrary(doParallel)cl <- makeCluster(5)registerDoParallel(cl)### Methylation data for ovarian cancercancerSite <- "OV"targetDirectory <- paste0(getwd(), "/")### GetData可以替代下面的代码GetData(cancerSite, targetDirectory)### Downloading methylation dataMETdirectories <- Download_DNAmethylation(cancerSite, targetDirectory, TRUE)### Processing methylation dataMETProcessedData <- Preprocess_DNAmethylation(cancerSite, METdirectories)### Saving methylation processed datasaveRDS(METProcessedData, file = paste0(targetDirectory, "MET_", cancerSite, "_Processed.rds"))### Clustering methylation datares <- ClusterProbes(METProcessedData[[1]], METProcessedData[[2]])### Saving methylation clustered datatoSave <- list(METcancer = res[[1]], METnormal = res[[2]], ProbeMapping = res$ProbeMapping)saveRDS(toSave, file = paste0(targetDirectory, "MET_", cancerSite, "_Clustered.rds"))stopCluster(cl)

### load the three data sets needed for MethylMixdata(METcancer)data(METnormal)data(GEcancer)### run MethylMix on a small set of example dataMethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)### run in parallellibrary(doParallel)cl <- makeCluster(5)registerDoParallel(cl)MethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)stopCluster(cl)

### load data setsdata(METcancer)data(GEcancer)### model gene expressionMethylMixResults <- MethylMix_ModelGeneExpression(METcancer, GEcancer)

### Load the three data sets needed for MethylMixdata(METcancer)data(METnormal)data(GEcancer)### Run methylmix on a small set of example dataMethylMixResults <- MethylMix(METcancer, GEcancer, METnormal)### Plot the most famous methylated gene for glioblastomaMethylMix_PlotModel("MGMT", MethylMixResults, METcancer)

### Plot MGMT also with its normal methylation variationMethylMix_PlotModel("MGMT", MethylMixResults, METcancer, METnormal = METnormal)### Plot a MethylMix model for another geneMethylMix_PlotModel("ZNF217", MethylMixResults, METcancer, METnormal = METnormal)### Also plot the inverse correlation with gene expression (creates two separate plots)MethylMix_PlotModel("MGMT", MethylMixResults, METcancer, GEcancer, METnormal)### Plot all functional and differential genesfor (gene in MethylMixResults$MethylationDrivers) {MethylMix_PlotModel(gene, MethylMixResults, METcancer, METnormal = METnormal)}

### toy example new data, of same dimension of original METcancer datanewMETData <- matrix(runif(length(METcancer)), nrow = nrow(METcancer))rownames(newMETData) <- rownames(METcancer)colnames(newMETData) <- paste0("sample", 1:ncol(METcancer))predictions <- MethylMix_Predict(newMETData, MethylMixResults)