分享

公共数据库挖掘必备-QTL分析

 高六博 2018-05-03

现如今,由于二代测序的普及化、公共数据库的便利化,越来越多的科研工作者可以将不同的组学数据进行大数据的整合分析。今天,小编就为大家讲解如何进行QTL的分析流程。

首先,我们要理解什么是QTL。数量性状基因座(Quantitative Trait Loci, QTL)是指染色体上一些能特定调控mRNAeQTL)、甲基化水平(mQTL)SNP位点,其mRNA、甲基化的表达水平量与数量性状成比例关系。

有很多文章都用到了QTL的分析,比如小编上一期讲解的“Modulation of long noncoding RNAs by risk SNPs underlying genetic predispositions to prostate cancer”,用到了eQTL的分析;以及“Association and cis-mQTL Analysis of Variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 in Relation to Nicotine Dependence in a Chinese Han Population”这篇文章中结合了表达和甲基化数据,做了meQTL的分析。他们做这些分析的目的,是结合DNARNA、甲基化的数据,探索一些和疾病显著关联的intron上的SNP的潜在生物学机制。比如在“Association and cis-mQTL Analysis of Variants in CHRNA3-A5, CHRNA7, CHRNB2, and CHRNB4 in Relation to Nicotine Dependence in a Chinese Han Population”文章中,作者发现rs3743075和尼古丁成瘾显著相关,并且,这个位点不仅是cis-eQTL,还可以调节临近位点的甲基化水平。那么,这个位点可能通过增加了附近的甲基化水平,抑制了基因的表达,从而导致了疾病的发生。

 
这里正式开始给大家如何做QTL的分析。我们用到的软件是MatrixEQTL package (v 2.1.1),具体所需要的文件格式,大家请查看我们之前的推文“meQTL分析,你分析过吗?”。

library(MatrixEQTL)

# Genotype file name

SNP_file_name = paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/rs1176744.txt", sep="\t");

snps_location_file_name = paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/rs1176744loc.txt", sep="\t");

# Gene expression file name

expression_file_name = paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/GE_chr11.txt", sep="\t");

gene_location_file_name = paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/geneloc_chr11.txt", sep="\t");

# Covariates file name

covariates_file_name = paste("/data/liuqiang/mQTL/data_set/wgs/mqtl/HHJ/Covariates.txt", sep="\t");

此处一共输入5个文件,分别是SNP文件(注意转换为dosage格式)、SNP位置信息、表达\甲基化文件、表达\甲基化位置信息、协变量文件,需要注意的是,所有矩阵的样本顺序要保持一致。

# Output file name

output_file_name_cis = tempfile();

output_file_name_tra = tempfile();

# Only associations significant at this level will be saved

pvOutputThreshold_cis = 5e-2;

pvOutputThreshold_tra = 1e-20;

此处是设定大家对cis-eQTLtrans-eQTL阈值的设定;

# Error covariance matrix

# Set to numeric() for identity.

errorCovariance = numeric();

# errorCovariance = read.table("Sample_Data/errorCovariance.txt");

# Distance for local gene-SNP pairs

cisDist = 2.5e5;

这里是设定cis-eQTL的作用范围,即以target SNP为中心,上下游250kb范围,查看该范围内哪些基因的表达水平和target SNP显著相关。

## Load genotype data

snps = SlicedData$new();

snps$fileDelimiter = "\t";      # the TAB character

snps$fileOmitCharacters = "NA"; # denote missing values;

snps$fileSkipRows = 1;          # one row of column labels

snps$fileSkipColumns = 1;       # one column of row labels

snps$fileSliceSize = 1000;      # read file in slices of 2,000 rows

snps$LoadFile(SNP_file_name);

## Load gene expression data

gene = SlicedData$new();

gene$fileDelimiter = "\t";      # the TAB character

gene$fileOmitCharacters = "NA"; # denote missing values;

gene$fileSkipRows = 1;          # one row of column labels

gene$fileSkipColumns = 1;       # one column of row labels

gene$fileSliceSize = 2000;      # read file in slices of 2,000 rows

gene$LoadFile(expression_file_name);

## Load covariates

cvrt = SlicedData$new();

cvrt$fileDelimiter = "\t";      # the TAB character

cvrt$fileOmitCharacters = "NA"; # denote missing values;

cvrt$fileSkipRows = 1;          # one row of column labels

cvrt$fileSkipColumns = 1;       # one column of row labels

if(length(covariates_file_name)>0) {

  cvrt$LoadFile(covariates_file_name);

}

## Run the analysis

snpspos = read.table(snps_location_file_name, header = TRUE, stringsAsFactors = FALSE);

genepos = read.table(gene_location_file_name, header = TRUE, stringsAsFactors = FALSE);

me = Matrix_eQTL_main(

  snps = snps,

  gene = gene,

  cvrt = cvrt,

  output_file_name     = output_file_name_tra,

  pvOutputThreshold     = pvOutputThreshold_tra,

  useModel = useModel,

  errorCovariance = errorCovariance,

  verbose = TRUE,

  output_file_name.cis = output_file_name_cis,

  pvOutputThreshold.cis = pvOutputThreshold_cis,

  snpspos = snpspos,

  genepos = genepos,

  cisDist = cisDist,

  pvalue.hist = "qqplot",

  min.pv.by.genesnp = FALSE,

  noFDRsaveMemory = FALSE);

unlink(output_file_name_tra);

unlink(output_file_name_cis);

## Results:

cat('Analysis done in: ', me$time.in.sec, ' seconds', '\n');

cat('Detected local eQTLs:', '\n');

show(me$cis$eqtls)

cat('Detected distant eQTLs:', '\n');

show(me$trans$eqtls)

## Plot the Q-Q plot of local and distant p-values

plot(me)

write.csv(me$cis$eqtls,file='mQTL_250k.csv')


    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多