现如今,由于二代测序的普及化、公共数据库的便利化,越来越多的科研工作者可以将不同的组学数据进行大数据的整合分析。今天,小编就为大家讲解如何进行QTL的分析流程。 首先,我们要理解什么是QTL。数量性状基因座(Quantitative
Trait Loci, QTL)是指染色体上一些能特定调控mRNA(eQTL)、甲基化水平(mQTL)的SNP位点,其mRNA、甲基化的表达水平量与数量性状成比例关系。 有很多文章都用到了QTL的分析,比如小编上一期讲解的“Modulation of long noncoding RNAs by risk SNPs underlying genetic predispositions to prostate cancer”,用到了eQTL的分析;以及“Association and cis-mQTL Analysis of Variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 in Relation to Nicotine Dependence in a Chinese Han Population”这篇文章中结合了表达和甲基化数据,做了meQTL的分析。他们做这些分析的目的,是结合DNA、RNA、甲基化的数据,探索一些和疾病显著关联的intron上的SNP的潜在生物学机制。比如在“Association and cis-mQTL Analysis of Variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 in Relation to Nicotine Dependence in a Chinese Han Population”文章中,作者发现rs3743075和尼古丁成瘾显著相关,并且,这个位点不仅是cis-eQTL,还可以调节临近位点的甲基化水平。那么,这个位点可能通过增加了附近的甲基化水平,抑制了基因的表达,从而导致了疾病的发生。 这里正式开始给大家如何做QTL的分析。我们用到的软件是MatrixEQTL
package (v 2.1.1),具体所需要的文件格式,大家请查看我们之前的推文“meQTL分析,你分析过吗?”。
library(MatrixEQTL) # Genotype file
name SNP_file_name =
paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/rs1176744.txt",
sep="\t"); snps_location_file_name
= paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/rs1176744loc.txt",
sep="\t"); # Gene expression
file name expression_file_name
= paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/GE_chr11.txt",
sep="\t"); gene_location_file_name
=
paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/geneloc_chr11.txt",
sep="\t"); # Covariates file
name covariates_file_name
= paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/Covariates.txt",
sep="\t"); 此处一共输入5个文件,分别是SNP文件(注意转换为dosage格式)、SNP位置信息、表达\甲基化文件、表达\甲基化位置信息、协变量文件,需要注意的是,所有矩阵的样本顺序要保持一致。 # Output file name output_file_name_cis
= tempfile(); output_file_name_tra
= tempfile(); # Only
associations significant at this level will be saved pvOutputThreshold_cis
= 5e-2; pvOutputThreshold_tra
= 1e-20; 此处是设定大家对cis-eQTL和trans-eQTL阈值的设定; # Error covariance
matrix # Set to numeric()
for identity. errorCovariance =
numeric(); # errorCovariance
= read.table("Sample_Data/errorCovariance.txt"); # Distance for
local gene-SNP pairs cisDist = 2.5e5; 这里是设定cis-eQTL的作用范围,即以target
SNP为中心,上下游250kb范围,查看该范围内哪些基因的表达水平和target SNP显著相关。 ## Load genotype
data snps = SlicedData$new(); snps$fileDelimiter
= "\t"; # the TAB
character snps$fileOmitCharacters
= "NA"; # denote missing values; snps$fileSkipRows
= 1; # one row of column labels snps$fileSkipColumns
= 1; # one column of row labels snps$fileSliceSize
= 1000; # read file in slices of
2,000 rows snps$LoadFile(SNP_file_name); ## Load gene
expression data gene =
SlicedData$new(); gene$fileDelimiter
= "\t"; # the TAB
character gene$fileOmitCharacters
= "NA"; # denote missing values; gene$fileSkipRows
= 1; # one row of column labels gene$fileSkipColumns
= 1; # one column of row labels gene$fileSliceSize
= 2000; # read file in slices of
2,000 rows gene$LoadFile(expression_file_name); ## Load covariates cvrt =
SlicedData$new(); cvrt$fileDelimiter
= "\t"; # the TAB
character cvrt$fileOmitCharacters
= "NA"; # denote missing values; cvrt$fileSkipRows
= 1; # one row of column labels cvrt$fileSkipColumns
= 1; # one column of row labels if(length(covariates_file_name)>0)
{ cvrt$LoadFile(covariates_file_name); } ## Run the
analysis snpspos =
read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE); genepos =
read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE); me = Matrix_eQTL_main( snps = snps, gene = gene, cvrt = cvrt, output_file_name = output_file_name_tra, pvOutputThreshold = pvOutputThreshold_tra, useModel = useModel, errorCovariance = errorCovariance, verbose = TRUE, output_file_name.cis = output_file_name_cis, pvOutputThreshold.cis =
pvOutputThreshold_cis, snpspos = snpspos, genepos = genepos, cisDist = cisDist, pvalue.hist = "qqplot", min.pv.by.genesnp = FALSE, noFDRsaveMemory = FALSE); unlink(output_file_name_tra); unlink(output_file_name_cis); ## Results: cat('Analysis done
in: ', me$time.in.sec, ' seconds', '\n'); cat('Detected
local eQTLs:', '\n'); show(me$cis$eqtls) cat('Detected
distant eQTLs:', '\n'); show(me$trans$eqtls) ## Plot the Q-Q
plot of local and distant p-values plot(me) write.csv(me$cis$eqtls,file='mQTL_250k.csv') |
|