GSEA原理以及软件的运行以及常见的错误及解决办法

生物_医药_科研 2018-11-19

展开全文

第一部分 GSEA原理

目标：预先定义的基因集S是否随机的分布在排序的基因list

1. 表达谱,样品分为两类,以1/2定义

GSEA considers experiments with genomewide expression profiles from samples belonging to two classes, labeled
1 or 2.

2. 基因按照表达与分类的相关性排序

Genes are ranked based on the correlation between their expression and the class distinction by using any suitable metric

3. 计算富集打分(ES)

Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution.

Step 1: Calculation of an Enrichment Score.

We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L.

The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S.

The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic

a running-sum statistic，

4. 评估ES的显著性(p值)

采用permutation ：可以选择1000次，500次等

5. 多重检验校正(FDR值)

ref：

Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles

http://www./content/102/43/15545

https://blog.csdn.net/qq_29300341/article/details/52956052

第二部分软件的运行

下载链接：http://software./gsea/downloads.jsp

需要事先安装JAVA，此软件是基于JAVA运行的

1、软件界面

2，文件准备

2.1. Expression dataset file (res, gct, pcl, or txt) 样品表达文件

一般是以\t 键保存为.txt 格式，然后将后缀.txt改为.gct就可以了

#表格中的二列描述一定要有，写成na列也行，但是必须有，我之前就没有这一列，折腾了好久一直报错不知道问题出在哪里

2.2 Phenotype labels file (cls) 样品表型分类文件

用文本文件写成.cls结尾的就行，同样是tab分割

2.3. Gene sets file (gmx or gmt) 预定义基因集（非必须）

这个文件可以自己根据上面的格式生成，如之前的KEGG本地化就可以生成这样的文件

也可以选择软件中定义的数据库

2.4. Chip (array) annotation file (chip) 芯片注释文件（非必须）
软件上可以选择

3、run

3.1 加载数据，将上面准备好的数据加载

3.2 选择参数

1) collapse dataset to gene symbols

true 芯片数据
false 测序的基因表达矩阵

2) Chip platform

非芯片数据可不选
芯片数据则按照芯片类型选择

3) permutation type

phenotype推荐，要求每组样品至少7个
gene_set 适用样品少

4) 显著性参数

若选择phenotype，FDR可设置0.25
若选择gene_set, FDR需低于0.05

5) metric for ranking genes

一般可以选择log2_Ratio_of_classes，就是logFC
还可以根据自己需要选择另外的参数

6) gene set database

可以选择软件中的如KEGG，GO，以及GO里面的cc，bp，mf等等
也可以是用户自己定义的gmt文件

7) 用户还可以选择自己的结果保存路径

4、点击下面的Run按钮

5、结果解读

第三部分常见的错误及解决办法

1、第一种错误Java heap space ,OutOfMemoryError

目前就遇到这个最头疼的错误，折腾了好久

意思就是运行GSEA的时候OutOfMemoryError,运行内存不足

如这张图的右下角，你会看到运行的内存，这里是84M，用了43M

那就改运行java的运行内存吧，我自己的笨办法是下载了一个eclipse软件https://www./downloads/

然后按照下面的教程改然后就可以运行了，你再次运行的时候可以看到上面的那个84M会变大很多

https://jingyan.baidu.com/article/5d6edee2f5efff99ebdeec63.html

https://blog.csdn.net/tomorrow13210073213/article/details/53031818

可以更改的大一些

对基因进行排序的各种参数解释

Metrics for Ranking Genes

For categorical phenotypes, GSEA determines a gene’s mean expression value for each phenotype and then uses one of the following metrics to calculate the gene’s differential expression with respect to the two phenotypes. To use median rather than mean expression values, set the Median for class metrics parameter to True, as described above.

● Signal2Noise (default) uses the difference of means scaled by the standard deviation. Note: You must have at least three samples for each phenotype to use this metric.

where μ is the mean and σ is the standard deviation; σ has a minimum value of .2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

● tTest uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric.

where μ is the mean, n is the number of samples, and σ is the standard deviation; σ has a minimum value of
.2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the tTest ratio, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

● Ratio_of_Classes (also referred to as fold change) uses the ratio of class means to calculate fold change for natural scale data:

where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

● Diff_of_Classes uses the difference of class means to calculate fold change for log scale data:

where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”

● log2_Ratio_of_Classes uses the log2 ratio of class means to calculate fold change for natural scale data:

where μ is the mean. This is the recommended statistic for calculating fold change for natural scale data.

来源于：丁香园夏木1220