有一些五六年前的学生们都成长为了各个生物信息学相关公司的小领导,而且他们都有了自己的公众号,知乎号,也算是一番人物。最近他们跟我反馈面试找不到或者说很难直接考核筛选到认真干活的生信工程师,挺有意思的。让我想起来了早在生信技能树论坛创立之初我为了引流,而规划的200个生信工程师面试题。值得继续分享: 背景知识超几何分布 超几何分布是统计学上的一种离散概率分布。简单来说就是,从有限个物件中抽出n个物件,在不归还的情况下,功能抽出制定物件的次数的概率,称为超几何分布 富集分析的原理 基于筛选的差异基因,或者其他自己选定的一组基因,采用超级和分布检验,判断选定的目的基因集是否偏好性地分布在GO或者KEGG的特定通路中。 富集分析实例: 假设人类的背景基因中有30000个,其中有40个基因参与p53信号通路。我们有一个基因集,其中有300个基因集在背景基因中,这300个基因中又有3个基因是p53信号通路相关的。在背景基因中,从背景基因中随机抽取一个p53信号通路的基因的概率为40/30000(P0),从目的基因集中随机抽取一个p53信号通路的基因的概率为3/300(P1),如果通过统计检验(Fisher exact test)发现P1显著大于P0,那么就说明目的基因中的p53信号通路的基因是被显著富集的,并不是随机发生的。 当然,一个目的基因集可能富集到多条通路,那么就需要进行多重假设验证计算FDR 超几何分布的问题解析 使用超几何分布检验进行富集分析的关键在于4个值的获取 得到这4个值之后,然后分别计算每个通路的pValue,将得到的pValue进行多重假设验证,在进行筛选 - 背景基因(m),以KEGG富集为例就是KEGG通路注释到的所有基因的数目
R 语言实现超几何分布检验R语言实现
#设置镜像并安装需要的包 options(BioC_mirror="http://mirrors.ustc.edu.cn/bioc/") rm(list=ls()) library(BiocManager) BiocManager::install("KEGG.db") library(org.Hs.eg.db) library(KEGG.db) tmp = toTable(org.Hs.egPATH) #导出全部的KEGG通路 write.table(tmp,'KEGG2gene.list.txt',quote = F,row.names = F)
结果文件如下 gene_id path_id 2 04610 9 00232 9 00983 9 01100 10 00232 10 00983 10 01100 15 00380 15 01100
BiocManager::install("hgu95av2.db") library("hgu95av2.db") ls('package:hgu95av2.db') universeGeneIds <- unique(mappedRkeys(hgu95av2ENTREZID)) write.table(universeGeneIds,'universeGeneIds.txt',quote = F, row.names = F,col.names = F)
set.seed('123456.789') diff_gene = sample(universeGeneIds,300) #随机取出300个基因 write.table(diff_gene,'diff_gene.txt',quote = F,row.names = F,col.names = F)
结果文件如下: 79441 1968 11070 5770 23016 2993 284001 3257 5971 10512
library(GOstats) annotationPKG='org.Hs.eg.db' hyperG.params = new("KEGGHyperGParams",geneIds=diff_gene, universeGeneIds=universeGeneIds, annotation=annotationPKG, categoryName="KEGG", pvalueCutoff=1, testDirection='over')
KEGG.hyperG.results = hyperGTest(hyperG.params) htmlReport(KEGG.hyperG.results,file = "kegg.enrichment.html", summary.args = list("htmlLinks"=TRUE))
结果文件如下 KEGGID Pvalue OddsRatio ExpCount Count Size Term 04640 0.005 3.284 3 8 83 Hematopoietic cell lineage 04940 0.007 4.746 1 5 37 Type I diabetes mellitus 05332 0.012 5.032 1 4 28 Graft-versus-host disease 05143 0.016 4.642 1 4 30 African trypanosomiasis 04210 0.017 2.849 3 7 82 Apoptosis 05200 0.020 1.898 9 16 280 Pathways in cancer
library(org.Hs.eg.db) library(KEGG.db)
tmp = toTable(org.Hs.egPATH) GeneID2kegg<<-tapply(tmp[,2], as.factor(tmp[,1]), function(x) x) kegg2GeneID<<-tapply(tmp[,1], as.factor(tmp[,2]), function(x) x)
GeneID2Path = GeneID2kegg Path2geneID = kegg2GeneID
diff_gene_has_path = intersect(diff_gene,names(GeneID2Path)) n = length(diff_gene) N = length(universeGeneIds) results = c()
library(dplyr) for (i in names(Path2geneID)){ M = length(intersect(Path2geneID[[i]],universeGeneIds)) if(M<5) next exp_count = n*M/N k=0 for (j in diff_gene_has_path){ if(i %in% GeneID2Path[[j]]) k=k+1 } Oddratio = k/exp_count if (k==0) next p=phyper(k-1,M,N-M,n,lower.tail = F) results = rbind(results,c(i,p,Oddratio,exp_count,k,M)) }
colnames(results) = c("PathwayID","Pvalue","OddsRatio","ExpCount","Count","Size") results=as.data.frame(results,stringsAsFactors=F) results$p.adjust = p.adjust(results$Pvalue, method = 'BH') results = results[order(results$Pvalue),] rownames(results) = 1:nrow(results)
结果文件如下 PathwayID Pvalue OddsRatio ExpCount Count Size p.adjust 04640 0.00824952317771826 2.75726907630522 2.90142158005127 8 83 0.6687091 04940 0.00878895975632959 3.86576576576577 1.29340480074575 5 37 0.6687091 05332 0.0154543396579935 4.08666666666667 0.978792822185971 4 28 0.6687091 05143 0.0195908646239614 3.81422222222222 1.04870659519925 4 30 0.6687091 04210 0.0241899557307567 2.4420325203252 2.86646469354463 7 82 0.6687091 04970 0.0330057644830707 2.48753623188406 2.41202516895829 6 69 0.6687091
Python实现超几何分布检验输入数据为以上R代码生成的两个数据集
import os import re from collections import OrderedDict import numpy as np import pandas as pd from scipy.stats import hypergeom from rpy2.robjects.packages import importr from rpy2.robjects.vectors import FloatVector
kegg = OrderedDict() #以KEGG通路为键,基因集为值的字典 pop = [] #所有KEGG通路的背景基因 universe_gene=[] with open('universeGeneIds.txt','r') as f: for line in f: universe_gene.append(line.strip())
with open('KEGG2gene.list.txt','r') as f: for line in f: lst = line.strip().split(' ') key = lst[1] if key not in kegg: kegg[key] = [] if lst[0] in universe_gene: kegg[key].append(lst[0]) else: if lst[0] in universe_gene: kegg[key].append(lst[0])
for k in kegg.keys(): pop = pop + [gene.strip() for gene in kegg[k] if gene not in pop] DEGs = [] with open('diff_gene.txt','r') as f: for line in f: if line.strip() in pop: DEGs.append(line.strip())
popTotal = len(pop) listTotal = len(DEGs)
print('Number of genes in backgraound list: %d' % popTotal) print('Number of DE genes involved in any pathways: %d' % listTotal keggEnrich = OrderedDict() for k,v in kegg.items(): hits = [gene for gene in v if gene in DEGs] hitCount = len(hits) popHits = len(v) if hitCount == 0: print(k) else: pVal = hypergeom.sf(hitCount-1, popTotal, popHits, listTotal) keggEnrich[k] = [hitCount, listTotal, popHits, popTotal,pVal,';'.join(hits)]
keggOutput = pd.DataFrame.from_dict(keggEnrich,orient='columns',dtype=None) keggOutput = pd.DataFrame.transpose(keggOutput) keggOutput.columns = ['Count','List Total','pop Hits','pop Total','pval','Genes'] keggOutput = keggOutput.sort_values(by='pval',axis=0)
stats = importr('stats') pVal = keggOutput['pval'] fdr = stats.p_adjust(FloatVector(pVal),method='fdr') keggOutput['FDR']=fdr keggOutput.to_excel('kegg_pathway_enrichment.xlsx',sheet_name='kegg')
结果文件如下: Count List Total pop Hits pop Total pval Genes FDR 04640 8 125 83 3802 0.005438322 930;952;2322;2811;2993;3553;3674;3815 0.506946309 04940 5 125 37 3802 0.00664396 355;356;3112;3329;3553 0.506946309 05332 4 125 28 3802 0.012366912 355;356;3112;3553 0.506946309 05143 4 125 30 3802 0.015738957 355;356;3553;7064 0.506946309 04210 7 125 82 3802 0.01727475 355;356;1439;3553;4792;5567;8837 0.506946309
三种方式计算出的结果类似 文末友情推荐
|