数据挖掘专题 | 关于TCGA中的mRNA数据

ky之路 2020-08-09

展开全文

使用TCGABiolinks下载mRNA表达数据时，可以通过调整file.type参数得到不同类型的文件：

1、file.type = "results" 将会下载每个样本的 *.rsem.genes.results 文件，格式如下：

其中，对于mRNA表达定量的两列，raw_count表示比对到该基因上的原始reads数，注意并不全是整数，因为存在估计的成分；我们重点看下这个scaled_estimate值，将所有样本的放在一起看：

scaled_estimate值主要分布在1.0E-05~1.0E-07，对于每个样本，该列值的加和在0.7~1之间，Newer versions of RSEM call this value (multiplied by 1e6) Transcripts Per Million (TPM). 即scaled_estimate值乘以10的6次方可得到TPM值！

2、file.type = "normalized_results" 将会下载每个样本的 *.rsem.genes.normalized_results 文件，格式如下：

normalized_count是由该样本中mRNA的raw_count值除以75%分位数并乘以1000得到，脚本参考：

https://github.com/mozack/ubu/blob/master/src/perl/quartile_norm.pl

使用R也可以处理，如下：

exp/quantile(exp[which(exp>=1)], 0.75)*1000

对比从Firehose下载的文件：

1、http://gdac./runs/stddata__2016_01_28/data/COAD/20160128/gdac._COAD.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes__data.Level_3.2016012800.0.0.tar.gz

与 file.type = "results" 合并样本得到的值一致！

2、http://gdac./runs/stddata__2016_01_28/data/COAD/20160128/gdac._COAD.Merge_rnaseqv2__illuminahiseq_rnaseqv2__unc_edu__Level_3__RSEM_genes_normalized__data.Level_3.2016012800.0.0.tar.gz

与 file.type = "normalized_results" 合并样本得到的值一致，如下：

对比从Xena中下载的文件：

https://tcga./download/TCGA.COAD.sampleMap/HiSeqV2.gz

只提供了log2(normalized_count+1)转化后的值，如下：

即如Xena中所述：

综上，对于TCGA中的mRNA表达数据，实际可得三类数据：raw_count、scaled_estimate、normalized_count：

http:///forums/showthread.php?t=42911

raw_count：for DESeq2, EdgeR use integers

represent the (estimated) number of reads that aligned to a transcript. This value is not an integer because RSEM only reports a guess of how many ambiguously mapping reads belong to a transcript/gene. This number is what the TCGA slightly misleadingly calls raw counts.

scaled_estimate：scaled estimate*1.0E6 = transcript per million (TPM)

The scaled estimate value on the other hand is the estimated frequency of the gene/transcript amongst the total number of transcripts that were sequenced. Newer versions of RSEM call this value (multiplied by 1e6) TPM - Transcripts Per Million. It's closely related to FPKM, as explained on the RSEM website. The important point is that TPM, like FPKM, is independent of transcript length, whereas "raw" counts are not!

normalized_count：upper quartile normalized RSEM count estimates

The *.normalized_results files on the other hand just contain a scaled version of the raw_counts column. The values are divided by the 75-percentile and multiplied by 1000. This should make the values a bit more comparable between experiments. The Perl code for this quantile normalisation can be found here.

# 使用建议

比较常用的是 normalized_count！

https://www./p/106127/

在之前的文档中小编习惯把normalized_count称为RSEM值，起因是Xena中有描述：log2(x+1) transformed RSEM normalized count，而且RSEM值叫起来感觉是比normalized count顺嘴。其实这两者并不是一个概念，只是TCGA中mRNA的RNASeqV2表达定量是基于RSEM软件完成的，且文件名中也包含rsem字符，所以比较容易让人理解为是一种叫做RSEM的标准化值！

对TCGA比较熟悉的可能会知道，RNASeqV2是目前比较常用的mRNA Level 3 数据：

https://wiki.nci./display/TCGA/RNASeq+Version+2

与RNASeqV1相比，两者所使用的处理流程不同：

The first approach used at TCGA relies on the RPKM method, while the second method uses MapSplice to do the alignment and RSEM to perform the quantitation.

详细参考：

https://confluence./download/attachments/29790363/DESCRIPTION.txt?version=1&modificationDate=1363806109000

RNASeqV2表达定量pipeline参考：

https://github.com/mozack/ubu

https://github.com/zyxue/MapspliceRSEM

https://webshare.bioinf./public/mRNAseq_TCGA/UNC_mRNAseq_summary.pdf

http:///post/94066296740/what-do-tcgas-rnaseq-files-actually-show