第一部分 GSEA原理 目标:预先定义的基因集S是否随机的分布在排序的基因list 1. 表达谱,样品分为两类,以1/2定义 GSEA considers experiments with genomewide expression profiles from samples belonging to two classes, labeled 2. 基因按照表达与分类的相关性排序 Genes are ranked based on the correlation between their expression and the class distinction by using any suitable metric 3. 计算富集打分(ES) Given an a priori defined set of genes S (e.g., genes encoding products in a metabolic pathway, located in the same cytogenetic band, or sharing the same GO category), the goal of GSEA is to determine whether the members of S are randomly distributed throughout L or primarily found at the top or bottom. We expect that sets related to the phenotypic distinction will tend to show the latter distribution. Step 1: Calculation of an Enrichment Score. We calculate an enrichment score (ES) that reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of the entire ranked list L. The score is calculated by walking down the list L, increasing a running-sum statistic when we encounter a gene in S and decreasing it when we encounter genes not in S. The magnitude of the increment depends on the correlation of the gene with the phenotype. The enrichment score is the maximum deviation from zero encountered in the random walk; it corresponds to a weighted Kolmogorov–Smirnov-like statistic a running-sum statistic, 4. 评估ES的显著性(p值) 采用permutation :可以选择1000次,500次等 5. 多重检验校正(FDR值) ref: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles http://www./content/102/43/15545 https://blog.csdn.net/qq_29300341/article/details/52956052 第二部分 软件的运行 下载链接:http://software./gsea/downloads.jsp 需要事先安装JAVA,此软件是基于JAVA运行的 1、软件界面 2,文件准备 2.1. Expression dataset file (res, gct, pcl, or txt) 样品表达文件 一般是以\t 键保存为.txt 格式,然后将后缀.txt改为.gct就可以了 #表格中的二列描述一定要有,写成na列也行,但是必须有,我之前就没有这一列,折腾了好久一直报错不知道问题出在哪里 2.2 Phenotype labels file (cls) 样品表型分类文件 用文本文件写成.cls结尾的就行,同样是tab分割 2.3. Gene sets file (gmx or gmt) 预定义基因集(非必须) 这个文件可以自己根据上面的格式生成,如之前的KEGG本地化就可以生成这样的文件 也可以选择软件中定义的数据库 2.4. Chip (array) annotation file (chip) 芯片注释文件(非必须) 3、run 3.1 加载数据,将上面准备好的数据加载 3.2 选择参数 1) collapse dataset to gene symbols
2) Chip platform
3) permutation type
4) 显著性参数
5) metric for ranking genes
6) gene set database
7) 用户还可以选择自己的结果保存路径 4、点击下面的Run按钮 5、结果解读 第三部分 常见的错误及解决办法 1、第一种错误Java heap space ,OutOfMemoryError 目前就遇到这个最头疼的错误,折腾了好久 意思就是运行GSEA的时候OutOfMemoryError,运行内存不足 如这张图的右下角,你会看到运行的内存,这里是84M,用了43M 那就改运行java的运行内存吧,我自己的笨办法是下载了一个eclipse软件https://www./downloads/ 然后按照下面的教程改然后就可以运行了,你再次运行的时候可以看到上面的那个84M会变大很多 https://jingyan.baidu.com/article/5d6edee2f5efff99ebdeec63.html https://blog.csdn.net/tomorrow13210073213/article/details/53031818 可以更改的大一些 对基因进行排序的各种参数解释 Metrics for Ranking Genes For categorical phenotypes, GSEA determines a gene’s mean expression value for each phenotype and then uses one of the following metrics to calculate the gene’s differential expression with respect to the two phenotypes. To use median rather than mean expression values, set the Median for class metrics parameter to True, as described above. ● Signal2Noise (default) uses the difference of means scaled by the standard deviation. Note: You must have at least three samples for each phenotype to use this metric. where μ is the mean and σ is the standard deviation; σ has a minimum value of .2 * absolute(μ), where μ=0 is adjusted to μ=1. The larger the signal-to-noise ratio, the larger the differences of the means (scaled by the standard deviations); that is, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.”
● tTest uses the difference of means scaled by the standard deviation and number of samples. Note: You must have at least three samples for each phenotype to use this metric. where μ is the mean, n is the number of samples, and σ is the standard deviation; σ has a minimum value of ● Ratio_of_Classes (also referred to as fold change) uses the ratio of class means to calculate fold change for natural scale data: where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.” ● Diff_of_Classes uses the difference of class means to calculate fold change for log scale data: where μ is the mean. The larger the fold change, the more distinct the gene expression is in each phenotype and the more the gene acts as a “class marker.” ● log2_Ratio_of_Classes uses the log2 ratio of class means to calculate fold change for natural scale data: where μ is the mean. This is the recommended statistic for calculating fold change for natural scale data. 来源于:丁香园夏木1220 |
|