【求助/交流】进化树构建和分析

谁能拯救我110 2015-04-27

展开全文

来源于：http://blog./user1/5569/archives/2009/232876.shtml
1. 建树前的准备工作
1.1 相似序列的获得——BLAST
BLAST是目前常用的数据库搜索程序，它是Basic Local Alignment Search Tool的缩写，意为“基本局部相似性比对搜索工具”(Altschul et al.,1990;1997)。国际著名生物信息中心都提供基于Web的BLAST服务器。BLAST算法的基本思路是首先找出检测序列和目标序列之间相似性程度最高的片段，并作为内核向两端延伸，以找出尽可能长的相似序列片段。
首先登录到提供BLAST服务的常用网站，比如国内的CBI、美国的NCBI、欧洲的EBI和日本的DDBJ。这些网站提供的BLAST服务在界面上差不多，但所用的程序有所差异。它们都有一个大的文本框，用于粘贴需要搜索的序列。把序列以FASTA格式(即第一行为说明行，以“>”符号开始，后面是序列的名称、说明等，其中“>”是必需的，名称及说明等可以是任意形式，换行之后是序列)粘贴到那个大的文本框，选择合适的BLAST程序和数据库，就可以开始搜索了。如果是DNA序列，一般选择BLASTN搜索DNA数据库。
这里以NCBI为例。登录NCBI主页-点击BLAST-点击Nucleotide-nucleotide BLAST (blastn)-在Search文本框中粘贴检测序列-点击BLAST!-点击Format-得到result of BLAST。
BLASTN结果如何分析(参数意义)：
>gi|28171832|gb|AY155203.1| Nocardia sp. ATCC 49872 16S ribosomal RNA gene, complete sequence
Score = 2020 bits (1019), Expect = 0.0
Identities = 1382/1497 (92%), Gaps = 8/1497 (0%)
Strand = Plus / Plus
Query: 1 gacgaacgctggcggcgtgcttaacacatgcaagtcgagcggaaaggccctttcgggggt 60
|||||||||||||||||||||||||||||||||||||||||| ||||||||| |||||
Sbjct: 1 gacgaacgctggcggcgtgcttaacacatgcaagtcgagcggtaaggcccttc--ggggt 58
Query: 61 actcgagcggcgaacgggtgagtaacacgtgggtaacctgccttcagctctgggataagc 120
|| ||||||||||||||||||||||||||||||| | |||||| |||||||||||||
Sbjct: 59 acacgagcggcgaacgggtgagtaacacgtgggtgatctgcctcgtactctgggataagc 118
Score ：指的是提交的序列和搜索出的序列之间的分值，越高说明越相似；
Expect：比对的期望值。比对越好，expect越小,一般在核酸层次的比对，expect小于1e-10，就比对很好了，多数情况下为0；
Identities：提交的序列和参比序列的相似性，如上所指为1497个核苷酸中二者有1382个相同；
Gaps：一般翻译成空位，指的是对不上的碱基数目；
Strand：链的方向，Plus / Minus意味着提交的序列和参比序列是反向互补的，如果是Plus / Plus则二者皆为正向。
1.2 序列格式：FASTA格式
由于EMBL和GenBank数据格式较为复杂，所以为了分析方便也出现了十分简单的FASTA数据格式。FASTA格式又称为Pearson格式，该种序列格式要求序列的标题行以大于号“>”开头，下一行起为具体的序列。一般建议每行的字符数不超过60或80个，以方便程序处理。多条核酸和蛋白质序列格式即将该格式连续列出即可，如下所示：
>E.coli
1 aaattgaaga gtttgatcat ggctcagatt gaacgctggc ggcaggccta acacatgcaa
61 gtcgaacggt aacaggaaga agcttgcttc tttgctgacg agtggcggac ……
>AY631071 Jiangella gansuensis YIM 002
1 gacgaacgct ggcggcgtgc ttaacacatg caagtcgagc ggaaaggccc tttcgggggt
61 actcgagcgg cgaacgggtg agtaacacgt gggtaacctg ccttcagctc tgggataagc
……
其中的‘>’为Clustal X默认的序列输入格式，必不可少。其后可以是种属名称，也可以是序列在Genbank中的登录号(Accession No.)，自编号也可以，不过需要注意名字不能太长，一般由英文字母和数字组成，开首几个字母最好不要相同，因为有时Clustal X程序只默认前几位为该序列名称。回车换行后是序列。将检测序列和搜索到的同源序列以FASTA格式编辑成为一个文本文件(例：C:\temp\jc.txt)，即可导入Clustal X等程序进行比对建树。
2. 构建系统树的相关软件和操作步骤
构建进化树的主要步骤是比对，建立取代模型，建立进化树以及进化树评估。鉴于以上对于构建系统树的评价，结合本实验室实际情况，以下主要介绍N-J Tree构建的相关软件和操作步骤。
2.1 用Clustal X构建N-J系统树的过程
(1) 打开Clustal X程序，载入源文件.
File-Load sequences- C:\temp\jc.txt.
(2) 序列比对
Alignment - Output format options - √ Clustal format； CLUSTALW sequence numbers: ON
Alignment - Do complete alignment
(Output Guide Tree file, C:\temp\jc.dnd；Output Alignment file, C:\temp\jc.aln；)
Align → waiting……
等待时间与序列长度、数量以及计算机配置有关。
(3) 掐头去尾
File-Save Sequence as…
Format: ⊙ CLUSTAL
GDE output case: Lower
CLUSTALW sequence numbers: ON
Save from residue: 39 to 1504 (以前后最短序列为准)
Save sequence as: C:\temp\jc-a.aln
OK
将开始和末尾处长短不同的序列剪切整齐。这里，因为测序引物不尽相同，所以比对后序列参差不齐。一般来说，要“掐头去尾”，以避免因序列前后参差不齐而增加序列间的差异。剪切后的文件存为ALN格式。
(4) File-Load sequences-Replace existing sequences?-Yes- C:\temp\jc-a.aln
重新载入剪切后的序列。
(5) Trees-Output Format Options
Output Files : √ CLUSTAL format tree √ Phylip format tree √ Phylip distance matrix
Bootstrap labels on: NODE
CLOSE
Trees-Exclude positions with gaps
Trees-Bootstrap N-J Tree ：
Random number generator seed(1-1000) : 111
Number of bootstrap trails(1-1000): 1000
SAVE CLUSTAL TREE AS: C:\temp\jc-a.njb
SAVE PHYLIP TREE AS: C:\temp\jc-a.njbphb
OK → waiting……
等待时间与序列长度、数量以及计算机配置有关。在此过程中，生成进化树文件*.njbphb，可以用TreeView打开查看。
(6) Trees-Draw N-J Trees
SAVE CLUSTAL TREE AS: C:\temp\jc-a.nj
SAVE PHYLIP TREE AS: C:\temp\jc-a.njph
SAVE DISTANCE MATRIX AS: C:\temp\jc-a.njphdst
OK
此过程中生成的报告文件*.nj比较有用，里面列出了比对序列两两之间的相似度，以及转换和颠换分别各占多少。
(7) TreeView
File-Open-C:\temp\jc-a.njbphb
Tree- phylogram(unrooted, slanted cladogram，Rectangular cladogram多种树型)
Tree- Show internal edge labels (Bootstrap value)(显示数值)
Tree- Define outgroup… → ingroup >> outgroup → OK(定义外群)
Tree- Root with outgroup
通常需要对进化树进行编辑，这时首先要Edit-Copy至PowerPoint上，然后Copy至Word上，再进行图片编辑。如果直接Copy至Word则显示乱码，而进化树不能正确显示。
2.2 Mega建树
虽然Clustal X可以构建系统树，但是结果比较粗放，现在一般很少用它构树，Mega因为操作简单，结果美观，很多研究者选择用它来建树。
(1) 首先用Clustal X进行序列比对，剪切后生成C:\temp\jc-a.aln文件；(同上)
(2) 打开BioEdit程序，将目标文件格式转化为FASTA格式，
File-Open- C:\temp\jc-a.aln，
File-Save As- C:\temp\ jc-b.fas；
(3) 打开Mega程序，转化为mega格式并激活目标文件，
File-Convert To MEGA Format- C:\temp\ jc-b.fas → C:\temp\ jc-b.meg，
关闭Text Editor窗口-(Do you want to save your changes before closing?-Yes)；
Click me to activate a data file- C:\temp\jc-b.meg-OK-
(Protein-coding nucleotide sequence data?-No)；
Phylogeny-Neighbor-Joining(NJ)
Distance Options-Models-Nucleotide: Kimura 2-parameter;
√d: Transitions+Transversions;
Include Sites-⊙Pairwise Deletion
Test of Phylogeny-⊙Bootstrap; Replications 1000; Random Seed 64238
OK；开始计算－得到结果；
(4) Image-Copy to Clipboard-粘贴至Word文档进行编辑。
此外，Subtree中提供了多个命令可以对生成的进化树进行编辑，Mega窗口左侧提供了很多快捷键方便使用；View中则给出了多个树型的模式。下面只介绍几种最常用的：
Subtree-Swap:任意相邻两个分支互换位置；
-Flip:所选分支翻转180度；
-Compress/Expand:合并/展开多个分支；
-Root:定义外群；
View-Topology：只显示树的拓扑结构；
-Tree/Branch Style:多种树型转换；
-Options:关于树的诸多方面的改动。
2.3 TREECON
打开Clustal X，File-Load sequences-jc-a.aln，File-Save Sequence as…(Format-PHYLIP；Save from residue-1 to 末尾；Save sequence as : C:\temp\jc.phy)；
打开TREECON程序，
(1) Distance estimation
点击Distance estimation-Start distance estimation，打开上面保存的jc.phy文件，Sequence Type-Nuleic Acid Sequence，Sequence format-PHYLIP interleaved，Select ALL，OK；
Distance Estimation-Jukes&Cantor(or Kimura)，Alignment positions-All，Bootstrap analysis-Yes，Insertions&Deletions-Not taken into account，OK；
Bootstrap samples-1000，OK；运算，等待……
Finished-OK。
(2) Infer tree topology
点击Infer tree topology-Start inferring tree topology，Method-Neighbor-joining, Bootstrap analysis-Yes，OK.；运算，等待……
Finished-OK。
(3) Root unrooted trees
点击Root unrooted trees-Start rooting unrooted trees，Outgroup opition-single sequence(forced)，Bootstrap analysis-Yes，OK；
Select Root-X89947，OK；运算，等待……
Finished-OK。
(4) Draw phylogenetic tree
点击Draw phylogenetic tree，File-Open-(new) tree，Show-Bootstrap values/ Distance scale。
File-Copy，粘贴至Word文档，编辑。
TREECON的操作过程看起来似乎较MEGA烦琐，且运算速度明显不及MEGA，如果参数选择一样，用它构建出来的系统树几乎和MEGA构建的完全一样，只在细节上，比如Bootstrap值二者在某些分支稍有不同。在参数选择方面，TREECON和MEGA也有些不同，但总体上相差不大。
2.4 PHYLIP
PHYLIP是多个软件的压缩包，下载后双击则自动解压。当你解压后就会发现PHYLIP的功能极其强大，主要包括五个方面的功能软件：i，DNA和蛋白质序列数据的分析软件。ii，序列数据转变成距离数据后，对距离数据分析的软件。 iii，对基因频率和连续的元素分析的软件。iv，把序列的每个碱基/氨基酸独立看待（碱基/氨基酸只有0和1的状态）时，对序列进行分析的软件。v，按照DOLLO简约性算法对序列进行分析的软件。vi，绘制和修改进化树的软件。在此，主要对DNA序列分析和构建系统树的功能软件进行说明。
(1) 生成PHY格式文件
首先用Clustal X等软件打开剪切后的序列文件C:\temp\jc-a.aln另存为C:\temp\jc.phy(使用File-Save Sequences As命令，Format项选“PHY”)。用BioEdit或记事本打开(2) 打开Phylip软件包里的SEQBOOT
seqboot.exe: can't find input file "infile"
Please enter a new file name> C:\temp\jc.phy
按路径输入刚才生成的 *.PHY文件，显示如下：
Bootstrapping algorithm, version 3.6a3
Settings for this run:
D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
B Block size for block-bootstrapping? 1
R How many replicates? 100
W Read weights of characters? No
C Read categories of sites? No
F Write out data sets or just weights? Data sets
I Input sequences interleaved? Yes
0 Terminal type none
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
R
Number of replicates?
1000
0
Settings for this run:
D Sequence, Morph, Rest., Gene Freqs? Molecular sequences
J Bootstrap, Jackknife, Permute, Rewrite? Bootstrap
B Block size for block-bootstrapping? 1
R How many replicates? 1000
W Read weights of characters? No
C Read categories of sites? No
F Write out data sets or just weights? Data sets
I Input sequences interleaved? Yes
0 Terminal type IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
Y
Random number seed (must be odd)?
5(any odd number)
completed replicate number 100
completed replicate number 200
completed replicate number 300
completed replicate number 400
completed replicate number 500
completed replicate number 600
completed replicate number 700
completed replicate number 800
completed replicate number 900
completed replicate number 1000
上面的D、J、R、I、O、1、2代表可选择的选项，键入这些字母后敲回车键，程序的条件就会发生改变。D选项无须改变。J选项有三种条件可以选择，分别是Bootstrap、Jackknife和Permute。R选项让使用者输入republicate的数目。所谓republicate就是用Bootstrap法生成的一个多序列组。根据多序列中所含的序列的数目的不同可以选取不同的republicate。当我们设置好条件后，键入Y按回车。得到一个文件outfile：C:\Program Files\Phylip\exe\ outfile.
重命名outfile→infile。
(3) 打开dnadist.exe
Nucleic acid sequence Distance Matrix program, version 3.6a3
Settings for this run:
D Distance ? F84
G Gamma distributed rates across sites? No
T Transition/transversion ratio? 2.0
C One category of substitution rates? Yes
W Use weights for sites? No
F Use emperical base frequencies? Yes
L Form of distance matrix? Square
M Analyze multiple data sets? No
I Input sequences interleaved? Yes
0 Terminal type ?
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
d
D Distance ? Kimura 2-parameter
m
Multiple data sets or multiple weighs? (type D or W)
d
How many data sets?
1000
0
Settings for this run:
D Distance ? Kimura 2-parameter
G Gamma distributed rates across sites? No
T Transition/transversion ratio? 2.0
C One category of substitution rates? Yes
W Use weights for sites? No
F Use emperical base frequencies? Yes
L Form of distance matrix? Square
M Analyze multiple data sets? Yes, 1000 data sets
I Input sequences interleaved? Yes
0 Terminal type ? IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes
Y to accept these of type the letter for one to change
Y
选项D有四种距离模式可以选择，分别是Kimura 2-parameter、Jin/Nei、Maximum-likelihood和Jukes-Cantor。选项T一般键入一个1.5-3.0之间的数字。选项M键入1000。运行后生成文件C:\Program Files\Phylip\exe\ outfile。
重命名outfile→infile。
(4) 打开 neighbor.exe
Neighbor-Joining/UPGMA method version 3.6a3
Settings for this run:
N Neighbor-Joining or UPGMA tree? Neighbor-Joining
O Outgroup root? No, Use as outgroup species 1
L Lower-triangular data metrix? No
R Upper-triangular data metrix? No
S Subreplication? No
J Randomize input order of species? No, Use input order
M Analyze multiple data sets? No
0 Terminal type ?
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y to accept these of type the letter for one to change
m
How many data sets?
1000
Random number seed (must be odd)?
5
Settings for this run:
N Neighbor-Joining or UPGMA tree? Neighbor-Joining
O Outgroup root? No, Use as outgroup species 1
L Lower-triangular data metrix? No
R Upper-triangular data metrix? No
S Subreplication? No
J Randomize input order of species? Yes
M Analyze multiple data sets? Yes, 1000 sets
0 Terminal type ? IBM PC
1 Print out the data at start of run No
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y to accept these of type the letter for one to change
Y
生成文件C:\Program Files\Phylip\exe\ outtree&outfile。
重命名outtree→intree；outfile→infile。
2.4.5
打开consense.exe
Consensus tree program, version 3.6a3
Settings for this run:
C Consensus type ? Majority rule (extended)
O Outgroop root? No, use as outgroup species 1
R Trees to be treated as Rooted? No
T Terminal type ?
1 Print out the sets of the species Yes
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Are these settings correct?
R
T
Settings for this run:
C Consensus type ? Majority rule (extended)
R Trees to be treated as Rooted? Yes
T Terminal type ? IBM PC
1 Print out the sets of the species Yes
2 Print indications of progress of run Yes
3 Print out tree Yes
4 Write out trees onto tree file? Yes
Y
生成文件C:\Program Files\Phylip\exe\ outtree。
重命名outtree→ jc.tre。
2.4.6
打开TreeView
打开C:\Program Files\Phylip\exe\ jc.tre。以下操作参照前述详细说明即可。