基础工具-HMMER用法

ABCmedic 2017-08-23

展开全文

一、简介

HMMER通常与已构建好的HMM数据集一起使用，例如Pfam以及Interpro收纳的一些数据库。同时HMMER也可以像BLAST一样使用查询序列，对序列数据库（非HMM数据库）进行检索。例如，您可以使用phmmer将一条蛋白质序列检索序列数据库（类似于BLASTP了），也可以用jackhmmer进行迭代检索。

HMMER依靠其潜在概率模型的强度，旨在尽可能敏感地检测远程同源序列。在过去，这种能力需要消耗很大的计算资源，但是在新的HMMER3中，HMMER现在基本上和BLAST一样快。

HMMER可以作为命令行工具进行本地下载和安装，现在也可以通过EBI的服务器在线访问。

本地下载地址：http:///

在线版本：http://www./Tools/hmmer/

常用Pfam在线数据库：http://pfam./

HMMER包含下面几个主要的程序:

phmmer：与Blastp类似，使用一个蛋白质序列搜索蛋白质序列库；

jackhmmer：与psiBlast类似，蛋白质序列迭代搜索蛋白质序列库；

hmmbuild：用多重比对序列构建HMM模型；

hmmsearch：使用HMM模型搜索序列库；

hmmscan：使用序列搜索HMM库；

hmmalign：使用HMM为线索，构建多重比对序列；

hmmconvert：转换HMM格式

hmmemit：从HMM模型中，得到一个模式序列；

hmmfetch：通过名字或者接受号从HMM库中取回一个HMM模型；

hmmpress：格式化HMM数据库，以便于hmmscan搜索使用；

hmmstat：显示HMM数据库的统计信息；

二、常用的两大功能

（一）使用HMM数据集搜索全基因组蛋白（核酸）序列数据库

1、hmmbuild, 训练给定多序列比对结果，构建HMM 数据集。举个例子像在基因家族分析中，用所有已知的某基因家族成员做多序列比对，然后利用下面命令构建HMM数据集，最后使用HMM数据集扫描你要鉴定的物种所有基因序列数据库即可获得获得该物种候选的该基因家族成员。

示例命令：

hmmbuild [-options]

输入文件多序列比对后的序列格式如:CLUSTALW, SELEX, GCG MSF。

输出文件一般命名为.hmm 后缀, 该结果为HMM 数据库。

2)hmmsearch, 寻找相似序列

hmmsearch [options]

（二）使用蛋白质（核酸）序列搜索已构建HMM数据库

该方法为常用的功能注释方法。

构建HMM数据库。使用多序列比对文件，同上述命令即可完成构建。同时可以从Pfam、SMART等网站下载现成额HMM。举个例子，假如我有一批蛋白质序列，想做Pfam注释，看看有什么结构域，那么我可以去Pfam下载下述文件：

ftp://ftp./pub/databases/Pfam/releases/Pfam31.0/Pfam-A.hmm.gz

使用hmmscan搜索HMM数据库，命令如下：

hmmscan -E 0.00001 --domE 0.00001 --cpu 2 --noali --acc --notextw --domtblout pfam.tab Pfam-A.hmm test.pep.fa

三、输出结果介绍

主要介绍两种格式

--domtblout

--tblout

输出结果中分为两类一类是针对序列的（full sequence），另一类是针对domain的（主要基于一条序列存在多个domain）。这两种格式涉及到的每一列信息解释如下（英文原文大家看的可能更明白！）

(1) target name: The name of the target sequence or profile.

(2) accession: The accession of the target sequence or profile, or ’-’ if none.

(3) query name: The name of the query sequence or profile.

(4) accession: The accession of the query sequence or profile, or ’-’ if none.

(5) hmmfrom: The position in the hmm at which the hit starts.

(6) hmm to: The position in the hmm at which the hit ends.

(7) alifrom: The position in the target sequence at which the hit starts.

(8) ali to: The position in the target sequence at which the hit ends.

(9) envfrom: The position in the target sequence at which the surrounding envelope starts. 结构域的起始位置。

(10) env to: The position in the target sequence at which the surrounding envelope ends. 结构域的终止位置。

(11) sq len: The length of the target sequence..

(12) strand: The strand on which the hit was found (“-” when alifrom¿ali to).

(13) E-value: The expectation value (statistical significance) of the target, as above.

(14) score (full sequence): The score (in bits) for this hit. It includes the biased-composition correction.

(15) Bias (full sequence): The biased-composition correction, as above

(16) description of target: The remainder of the line is the target’s description line, as free text.

(17) c-Evalue: The “conditional E-value”, a permissive measure of how reliable this particular domain may be. The conditional E-value is calculated on a smaller search space than the independent Evalue. The conditional E-value uses the number of targets that pass the reporting thresholds. The null hypothesis test posed by the conditional E-value is as follows. Suppose that we believe that there is already sufficient evidence (from other domains) to identify the set of reported sequences as homologs of our query; now, how many additional domains would we expect to find with at least this particular domain’s bit score, if the rest of those reported sequences were random nonhomologous sequence (i.e. outside the other domain(s) that were sufficient to identified them as homologs in the first place)?

(18) i-Evalue: The “independent E-value”, the E-value that the sequence/profile comparison would have received if this were the only domain envelope found in it, excluding any others. This is a stringent measure of how reliable this particular domain may be. The independent E-value uses the total number of targets in the target database.

附：大伙可能对输出结果的envelope的含义比较陌生，但是它确是我们序列结构域所在的位置，不要误用成align的位置了。关于envelope这里涉及到一些算法，小编贴了一段原文，大伙可以用通俗的话说出自己的理解。

Envelope定义：The envelope defines a subsequence for which their is substantial probability mass supporting a homologous domain, whether or not a single discrete alignment can be identified. The envelope may extend beyond the endpoints of the MEA（maximum expected accuracy ） alignment, and in fact often does, for weakly scoring domains.

Envelope鉴定：Now, within each region, we will attempt to identify envelopes. An envelope is a subsequence of the target sequence that appears to contain alignment probability mass for a likely domain (one local alignment to the profile).When the region contains '1 expected domain, envelope identification is already done: the region’s start and end points are converted directly to the envelope coordinates of a putative domain.

There are a few cases where the region appears to contain more than one expected domain – where more than one domain is closely spaced on the target sequence and/or the domain scores are weak and the probability masses are ill-resolved from each other. These “multidomain regions”, when they occur, are passed off to an even more ad hoc resolution algorithm called stochastic traceback clustering. In stochastic traceback clustering, we sample many alignments from the posterior alignment ensemble, cluster those alignments according to their overlap in start/end coordinates, and pick clusters that sum up to sufficiently high probability. Consensus start and end points are chosen for each cluster of sampled alignments. These start/end points define envelopes.These envelopes identified by stochastic traceback clustering are not guaranteed to be nonoverlapping.It’s possible that there are alternative “solutions” for parsing the sequence into domains, when the correct parsing is ambiguous. HMMER will report all high-likelihood solutions, not just a single nonoverlapping parse.

It’s also possible (though rare) for stochastic clustering to identify no envelopes in the region.In a tabular output (--tblout) file, the number of regions that had to be subjected to stochastic traceback clustering is given in the column labeled clu. This ought to be a small number (often it’s zero). The number of envelopes identified by stochastic traceback clustering that overlap with other envelopes is in the column labeled ov. If this number is non-zero, you need to be careful when you interpret the details of alignments in the output, because HMMER is going to be showing overlapping alternative solutions. The total number of domain envelopes identified (either by the simple method or by stochastic traceback clustering) is in the column labeled env. It ought to be almost the same as the expectation and the number of regions