RNA-seq差异基因表达分析和混池样本建库评价

九斗酒 2017-08-01

展开全文

文献阅读1

文献题目： Experimental validation of methods for differential gene expression analysis and sample pooling in RNA-seq
文献来源：BMC genomic-2015
文献摘要（译）：
背景： 大规模平行cDNA测序（RNA-seq）实验在基因表达定量分析上，逐步取代了芯片技术。但是，许多生物学家对于差异基因分（DEG）的方法和在RNA-seq实验中采用省钱的样本混池策略的可靠性存在疑惑。因此，我们在RNA-seq实验中对Cuffdiff2, edgeR,DESeq2和Two-stage Poisson Model(TSPM)鉴定到的DEGs，在老鼠扁桃腺进行微穿孔，使用高通量qRCR对独立生物学重复样本进行验证。另外，我们对RNA混池样本测序，并将其结果与相应独立测序样本比较。
结果: Cuffdiff2 的假阳性率和 DESeq2与TSPM的假阴性率很高。在四种调查的DEG分析方法中，edgeR的灵敏度和准确度相对较高。我们记录了混池的偏见，并且混池样本鉴定到的DEG具有很低的阳性预测值。
结论： 我们的结果表明组合使用灵敏度更高的DEG分析方法，及在未来RNA-seq实验中对鉴定到的DEGs进行高通量验证是必须的。这些结果表明对于RNA-seq实验在相似的设置上需要限制利用混池策略，并且增加样本的生物学重复。

和之前研究一致的发现：

(1).DESeq具有低灵敏度

(2).Cuffdiff具有高的假阳性

(3).edgeR具有高的灵敏度

(4).TSPM的假阳性率和假阴性率依赖于重复的数量

所以目前大家普遍推荐使用edgeR和DESeq2，cuffdiff不建议使用了

DEG分析方法的差异：

方法	edgeR	DESeq2	Cuffdiff2	TSPM
标准化	a model, which incorporates normalisation factors as offsets that are estimated by trimmed mean of M values for eachcontig	a relative log expression method	consider total number of reads, gene length, variability within and between the conditions, and differential isoform expression	accommodate various normalisation procedures, but works without normalisation by default
分布	Poisson distribution	negative binomial distribution	negative binomial distribution
分布预测	edgeR moderates its dispersion estimates by their dispersion-mean relationship	DESeq2 is stringent to detect outliers and excludes genes with extreme read counts by default.It considers the maximum a posteriori dispersion estimates	Cuffdiff2 includes covariances between different isoforms	TSPM differs by its per-gene dispersion estimation without considering the information across genes .
计算p values	generalized linear model (GLM) likelihood ratio test	generalized linear model (GLM) likelihood ratio test	generalized linear model (GLM) likelihood ratio test	employs quasi or standard likelihood ratio tests, based on whether a gene is over-dispersed or not

这些方法的主要差别在于分布预测过程不同

从A,C图都可以看出，混池的结果相较于对应的独立样本，鉴定到的DEGs数量显著偏多。因为混池相当于求平均值，会丢失异常值信息以及组内差异大小信息。所以混池建库测序会低估组内变异，导致很多低阳性预测值的DEGs被鉴定到。

从A,C图的比较可以看出，8个混池样本的鉴定到的DEGs(18055)少于3个混池样本鉴定到的DEGs(15745)；对于独立样本，情况也是如此（82 vs 16），所以增加生物学重复可以缩小混池对于预测差异表达基因的偏差。

B,D图的比较也可以说明增加生物学重复可以增加对于群体变异预测的能力，并且降低混池偏差和假阳性率。

RNA-seq分析

RNA质量检测：

NanoDrop 1000:
264 ngRNA/sample
Agilent 2100 Bioanalyzer:
RNA Integrity Number (RIN) 7.53(SD 0.31)

Total RNA

检测RNA的完整性
确定RNA的浓度
rRNA的峰
计算（28S/18S or 23S/16S）
计算出RNA integrity number (RIN)

Small RNA

miRNA (10-40nt)占small RNA(10-150nt)的比例
RIN >=7，good;
RIN between 6 and 7,sometimes can also get good results,if the samples are extremely precious,worth try
28S/18S > 0.7
Fluorescent unit >1

资料链接：
http://www.docin.com/p-769106334.html

DEG分析流程

Quality control > Aliment to mouse genome (TopHat 2.0.6)>
Aligned reads count (HTSeq 0.54) > DEG analysis (edgeR 3.2.4 Cuffdiff 2.1.1,DESeq 2 1.0.19 and TSPM)

adjusted p values less than 0.05 were considered as DEGs (BenjaminiHochberg false discovery correction)

通过qPCR验证DEGs的标准

对于在RNA-seq分析中鉴定到的DGE,如果满足以下标准，则被视为真阳性DEG:

1.RNA-seq 和 qPCR都显示相同的差异表达方向（上调或者下调）

2.由qPCR预测得到的差异表达倍数改变要么高于1.25倍，要么低于
0.8（LCF 界限为±0.3219）

3.Spearman相关系数，均方根偏差，kappa统计量使用STATA 13.1计算得到

原文链接：

https://www.ncbi.nlm./pmc/articles/PMC4515013/