是时候来一波RNA-Seq差异表达分析实操了

九斗酒 2017-09-21

展开全文

案例简介

数据来自于发表在Nature commmunication 上的一篇文章 “Temporal dynamics of gene expression and histone marks at the Arabidopsis shoot meristem during flowerin”。原文用RNA-Seq的方式研究在开花阶段,芽分生组织在不同时期的基因表达变化。

原文的流程是： TopHat -> SummarizeOverlaps -> Deseq2 -> AmiGO
其中比对的参考基因组为TAIR10 ver.24 ，并且屏蔽了ribosomal RNA
regions (2:3471–9557; 3:14,197,350–14,203,988)。Deseq2只计算至少在一个时间段的FPKM的count > 1 的基因。

数据存放在http://www./arrayexpress/, ID为E-MTAB-5130。

实验设计： 4个时间段（0,1,2,3），分别有4个生物学重复，一共有16个样品。

数据下载

首先下载数据说明文件：

$ wget http://www./arrayexpress/files/E-MTAB-5130/E-MTAB-5130.sdrf.txt

然后根据数据说明文件提供的FTP链接下载

$ head -n1 E-MTAB-5130.sdrf.txt | tr '\t' '\n' | nl | grep URI
    33  Comment[FASTQ_URI]

$ tail -n  2 E-MTAB-5130.sdrf.txt | cut -f 33 | xargs -i wget {}

根据下载速度，你们可以选择去吃吃饭，还是睡睡觉。

下载完RNA-Seq数据后，我们还需要下载一个拟南芥cDNA序列(记住是转录组，而不是全基因组，好了，你别说了，我记住了)

$ curl ftp://ftp.ensemblgenomes.org/pub/plants/release-28/fasta/arabidopsis_thaliana/cdna/Arabidopsis_thaliana.TAIR10.28.cdna.all.fa.gz -o athal.fa.gz

然后用Salmon建立索引：

salmon index -t athal.fa.gz -i athal_index

数据定量

由于样本一共有16个，不可能一条条输入命令，所以我们写一个脚本：

#! /bin/bash
for fn in ERR1698{194..209};
do
    samp=`basename ${fn}`
    echo 'Processin sample ${sampe}'
    salmon quant -i athal_index -l A         -1 ${samp}_1.fastq.gz         -2 ${samp}_2.fastq.gz         -p 8 -o quants/${samp}_quant
done

根据你电脑的配置，你可以选择吃下午茶，还是选择睡个午觉。

数据导入R

当然你完全不必真的去睡午觉，我们可以程序运行的时候准备一下tximport所需要的输入文件。

tximport可以纠正不同样本基因长度的潜在改变（比如说differential isoform usage）；能够用于导入 (Salmon, Sailfish, kallisto)程序的输出文件；能够避免丢弃那些比对到多个基因的同源序列，从而提高灵敏度

In particular, the tximport pipeline offers the following benefits: (i) this approach corrects for potential changes in gene length across samples (e.g. from differential isoform usage) (Trapnell et al. 2013), (ii) some of the upstream quantification methods (Salmon, Sailfish, kallisto) are substantially faster and require less memory and disk usage compared to alignment-based methods that require creation and storage of BAM files, and (iii) it is possible to avoid discarding those fragments that can align to multiple genes with homologous sequence, thus increasing sensitivity (Robert and Watson 2015)

虽然tximport的参数看起来很多，但其实需要我们准备的就是两个files和tx2gene

tximport(files, type = c('none', 'salmon', 'sailfish', 'kallisto', 'rsem'),
  txIn = TRUE, txOut = FALSE, countsFromAbundance = c('no', 'scaledTPM',
  'lengthScaledTPM'), tx2gene = NULL, varReduce = FALSE,
  dropInfReps = FALSE, ignoreTxVersion = FALSE, geneIdCol, txIdCol,
  abundanceCol, countsCol, lengthCol, importer = NULL)

files存放的是salmon的输出文件，所以我们需要根据文件存放位置，进行声明

dir <- 'C:/Users/Xu/Desktop/'
list.files(dir)
sample <- paste0('ERR1698',c(194,seq(202,209),seq(195,201)),'_quant')
files <- file.path(dir,'quants',sample,'quant.sf')
names(files) <- paste0('sample',seq(1,16))
all(file.exists(files))

吐槽: 原本的我以为sample从1到16会和194到206是一一对应，但是万万没想到，他的对应关系居然是下图这个样子的。看来凡是都不能想当然，一定要仔细看他们的对应关系。

然后我们还要准备一个基因名和转录本名字相关的数据框

library(AnnotationHub)
ah <- AnnotationHub()
ath <- query(ah,'thaliana')
ath_tx <- ath[['AH52247']]
columns(ath_tx)
k <- keys(ath_tx,keytype = 'GENEID')
df <- select(ath_tx, keys=k, keytype = 'GENEID',columns = 'TXNAME')
tx2gene <- df[,2:1] # TXID在前， GENEID在后

如果你电脑跑的够快，基本上这个时候就可以导入数据了。

# install.packages('readr')
# install.packages('rsjon')
library('tximport')
library('readr')
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)

# reading in files with read_tsv
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
# summarizing abundance
# summarizing counts
# summarizing length

names(txi)
[1] 'abundance'           'counts'              'length'             
[4] 'countsFromAbundance

head(txi$lenght)
head(txi$counts)

由于后续要用DESeq2包做差异表达分析，所以需要用DESeqDataSetFromTximport这个函数，当然你还需要说明你的实验设计

library('DESeq2')
sampleTable <- data.frame(condition = factor(
  rep(c('Day0','Day1','Day2','Day3'),each=4)))
rownames(sampleTable) <- colnames(txi$counts)
dds <- DESeqDataSetFromTximport(txi, sampleTable, ~condition)
dim(dds)

数据预过滤：

dds <- dds[ rwoSums(counts(dds)) > 1 ,]
dim(dds)

差异表达分析

标准的差异表达分析步骤在DESeq2只需要DESeq一个函数就可以完成。然后可以通过results提取结果表，包括log2 change, p values, adjusted p values.通过summary查看描述性结果

# 超过100个样本用
# library('BiocParallel')
# register('MulticoreParam(4))
dds <- DESeq(dds)
res <- results(dds)
summary(res)

results还有许多参数：

contrast用于指定比较的对象，默认是最后一个（实验组）与第一个（对照组）比较。
lfcThreshold: 用于指定log2 fold change的阈值，默认是0. fold change = (group1 - group2)/min(group1,group2).
alpha: 显著性水平阈值
pAdjustMethod：多重实验校正

于是我分别用Day1,2,3与Day0进行比较：

res0vs1 <- results(dds, contrast = c('condition','Day1','Day0'))
res0vs2 <- results(dds, contrast = c('condition','Day2','Day0'))
res0vs3 <- results(dds, contrast = c('condition','Day3','Day0'))

最后找到了208个差异表达基因，105上调，103个下调；而文章是298个，148上调，156下调。

res1SigUp <- subset(res1, padj < 0.01 & log2FoldChange >0 ) 
res1SigDown <- subset(res1, padj < 0.01 & log2FoldChange <0 ) 
res2SigUp <- subset(res2, padj < 0.01 & log2FoldChange >0 ) 
res2SigDown <- subset(res2, padj < 0.01 & log2FoldChange <0 ) 
res3SigUp <- subset(res3, padj < 0.01 & log2FoldChange >0 ) 
res3SigDown <- subset(res3, padj < 0.01 & log2FoldChange <0 ) 

sigUpGene <- unique(c(rownames(res1SigUp),rownames(res2SigUp),rownames(res3SigUp)))
length(sigUpGene)
# 105
sigDownGene <- unique(c(rownames(res1SigDown),rownames(res2SigDown),rownames(res3SigDown)))
length(sigDownGene)
# 103

探索性分析

特殊基因的表达情况

有些基因是分生组织的特异基因，如STM, KNOTTED-LIKE, FROM ARABIDOPSIS THALIANA 1(KNAT1), CLAVATA 1(CLV1)和CLV3, 从理论上应该有一定的表达量

早花同源异形基因（homeotic gene)，如AP1, APETALA 3(AP3), CAULIFLOWER(CAL)和发育中期和后期基因JAGGED(JAG), WUSCHEL RELATED HOMEOBO 1(WOX1) WOX3 从理论上应该不表达。

上面的基因名我们需要先转成TAIR的ID，才能从dds中提取相应的counts.然后使用dplyr和tidyr组织数据, 以meristem_identify_genes为例：

meristem_identify_genes <- select(th, keys = c('STM','KNAT1','CLV1','CLV3'),
       columns = c('TAIR'), keytype = 'SYMBOL')
library(tidyverse)
msData <- dds[which(rownames(dds) %in% meristem_identify_genes$TAIR)]
msData1 <- as.data.frame(assay(msData))
msData1$gene = rownames(msData)
msData2 <- msData1 %>%
  gather(sample,count, -gene) %>%
  mutate(condition = rep(c('Day0','Day1','Day2','Day3'),each=20)) %>%
  arrange(gene)

由于这个画图过程，后续会用大很多次，为了避免重复操作，我写成了一个函数，我建议你根据上面的流程自己写一个函数，不要用我的。

plotGeneCounts <- function(genes) {
  require(ggplot2)
  require(tidyr)
  require(dplyr)
  GeneName <- select(th, keys=genes, columns=c('TAIR'),keytype = 'SYMBOL')
  GeneName <- GeneName[order(GeneName$TAIR), ]
  Data <- dds[which(rownames(dds) %in% GeneName$TAIR)]
  Data1 <- as.data.frame(assay(Data))
  Data1$gene <- GeneName$SYMBOL[GeneName$TAIR == rownames(Data1)]
  Data2 <- Data1 %>%
    gather(sample,count, -gene) %>%
    mutate(condition = rep(c('Day0','Day1','Day2','Day3'), each= length(sample) / 4))
  p <- ggplot(Data2)
  p1 <- p   geom_boxplot(aes(x=gene,y=count, fill=condition))
  p2 <- p1   theme(axis.title.x = element_blank())   ylab('Counts')
  print(p2)
  return(GeneName)
}

那么画图就是plotGeneCounts(c('STM','KNAT1','CLV1','CLV3'))

原文用柱状图展示不同基因在不同处理下的表达量，

但是柱状图的信息太少，所以我们用ggplot2做箱线图。

同样的方法我们可以做出早花同源异形基因（homeotic gene)，如AP1, APETALA 3(AP3), CAULIFLOWER(CAL)和发育中期和后期基因JAGGED(JAG), WUSCHEL RELATED HOMEOBO 1(WOX1) WOX3 的箱线图：

## early floral homeotic genes
plotGeneCounts(c('AP1','AP3','CAL'))
# genes expressed in the centre or abaxial side of developing leaves21
plotGeneCounts(c('JAG','WOX1','WOX3'))

和文章的结论基本一致，早花同源异形基因在我的流程中只能检测到CAL, 而发育中期和后期的基因基本上表达量也特别低。不过文章说CAL3这个基因能被稳定被检测到，实际上表达量与这些基因没有明显区别（< 100)，我也不知道他如何得到这个结论的。

文章在比较了不同时期的基因表达情况后，发现
morning-expressed genes of the central oscillator of the circadian clock: LATE ELONGATED HYPOCOTYL (LHY), CRICADOAM CLOCK ASSPCIATED 1 (CCA1) 在转移到长日照下的第一天表达量显著性提高，而evening-expressed clock gene PSEUDO-RESPONSE REGULATOR 5(PRR5) 则是显著性下调。并且evening复合体编码基因LUX
ARRHYTHMO (LUX), EARLY FLOWERING 4 (ELF4)也和PRR5有相同的趋势。作者认为这些节律基因表达量的变化是因为日长变化导致的，而不是amplitude改变。

首先我根据文章给出的基因，先作图看看是不是真的有这样的趋势：

# morning-expressed genes of the central oscillator of the circadian clock
# evening-expressed clock gene2
mGene <- c('LHY', 'CCA1', 'ELF4', 'PRR5','LUX')
mGenes <-plotGeneCounts(mGene)

看来很类似，下一步就是判断这些基因是不是也在我找到显著性差异表达的基因列表里。

> mGenes[mGenes$TAIR %in% sigUpGene,]
  SYMBOL      TAIR
1    LHY AT1G01060
2   CCA1 AT2G46830
> mGenes[mGenes$TAIR %in% sigDownGene,]
[1] SYMBOL TAIR  
<0 行> (或0-长度的row.names)

很尴尬，显著性上调的基因也都在我的列表中，但是显著性下调的，一个都没有，于是我把水平从0.01提高到0.05，重新检查下

res1SigDown <- subset(res1, padj < 0.05 & log2FoldChange <0 ) 
res2SigDown <- subset(res2, padj < 0.05 & log2FoldChange <0 ) 
res3SigDown <- subset(res3, padj < 0.05 & log2FoldChange <0 )
sigDownGene <- unique(c(rownames(res1SigDown),rownames(res2SigDown),rownames(res3SigDown)))
length(sigDownGene)
mGenes[mGenes$TAIR %in% sigDownGene,]
  SYMBOL      TAIR
4   PRR5 AT5G24470

好吧还是只找到一个，毕竟ELF4这个表达量实在是太低了，除非水平变成是0.1，不然是难以发现的啦。但是这个还是证明作者的猜想是对的。

随后作者又去找了一些flowering prmotation gene SOC1, FD, NF-YA4, 以及flowering-time repressor, 如JMJ30

LDgenes<- c('SOC1','FD','NF-YA4','JMJ30')
plotGeneCounts(LDgenes)

小结：可以找一些特征基因用来验证RNA-Seq是否正确

GO富集分析

文献通过GO富集分析发现：SAM诱导的基因大部分富集在long-day photoperiodism (GO:0048574) and regulation of circadian rhythm (GO:0042754）
下调的基因则是富集在ribosome biogenesis (GO:0042254) and the control of translation (GO:0006412; Fig. 2i).

为了验证这个结果是不是正确的，我使用Y叔良心之作clusterProfiler，进行GO富集分析。

结果在上调中并没有找到富集基因，但是在下调（day1 vs day0 和 day3 vs day0)中却发现了好多富集基因，虽然场面十分尴尬，但是至少说明下调的基因大多属于ribosome

res1DownGO <- enrichGO(gene = rownames(res1SigDown),
                   OrgDb = th,
                   keytype = 'TAIR',
                   pvalueCutoff = 0.1,
                 pAdjustMethod = 'fdr'
                   )

res3DownG0 <- enrichGO(gene = rownames(res3SigDown),
                   OrgDb = th,
                   keytype = 'TAIR',
                   pvalueCutoff = 0.1,
                   )
dotplot(res3DownG0)