细菌群落结构分析：菌群α多样性香农指数、辛普森Chao1、Rarefaction多样性---引自陈冠舟博客

Irene_2017 2018-03-27

展开全文

物多样性测定主要有三个空间尺度：α多样性，β多样性，γ多样性。

α多样性主要关注局域均匀生境下的物种数目，因此也被称为生境内的多样性（within-habitat diversity）

β多样性指沿环境梯度不同生境群落之间物种组成的的相异性或物种沿环境梯度的更替速率也被称为生境间的多样性（between-habitat diversity），控制β多样性的主要生态因子有土壤、地貌及干扰等。

不同群落或某环境梯度上不同点之间的共有种越少，β多样性越大。精确地测定β多样性具有重要的意义。这是因为：①它可以指示生境被物种隔离的程度；②β多样性的测定值可以用来比较不同地段的生境多样性；③β多样性与α多样性一起构成了总体多样性或一定地段的生物异质性。

γ多样性描述区域或大陆尺度的多样性，是指区域或大陆尺度的物种数量，也被称为区域多样性（regional diversity）。控制γ多样性的生态过程主要为水热动态，气候和物种形成及演化的历史。主要指标为物种数（S）。γ多样性测定沿海拔梯度具有两种分布格局：偏锋分布和显著的负相关格局。

alpha_rarefaction.py（qiime）

via铁汉1990

这个脚本调用如下的步骤： Generate rarefied OTU tables; compute alpha diversity metrics for each rarefied OTU table; collate alpha diversity results; and generate alpha rarefaction plots.

alpha_rarefaction.py

-i, 输入biom文件

-m, mapping文件

-o, 输出文件夹

-p, 参数文件，指定求解哪些东西

-n, --num_steps

Number of steps (or rarefied OTU table sizes) to make between min and max counts [default: 10]

-f, 强行覆盖同名的文件夹

-w, 提示有哪些程序，但不适用他们（用于排错）

-a, 平行运行

-t, 进化树文件

--min_rare_depth

The lower limit of rarefaction depths [default: 10]

-e, --max_rare_depth

The upper limit of rarefaction depths [default: median sequence/sample count]

-O, --jobs_to_start

Number of jobs to start. NOTE: you must also pass -a to run in parallel, this defines the number of jobs to be started if and only if -a is passed [default: 2]

--retain_intermediate_files

Retain intermediate files: rarefied OTU tables (rarefaction) and alpha diversity results (alpha_div). By default these will be erased [default: False]

例子：

（1）首先把需要做的多样性指数写入txt文档中：

echo "alpha_diversity:metrics shannon,PD_whole_tree,chao1,observed_species,goods_coverage,simpson" > alpha_params.txt

（2）接着运行脚本（it may need several hours）：

alpha_rarefaction.py -i otu_table/otu_table.biom -m map.txt -o div_alpha/ -p alpha_params.txt -t rep_phylo.tre

#输入文件otu_table.biom,rep_phylo.tre

#输出结果在div_alpha/

div_alpha/alpha_rarefaction_plots/rarefaction_plots.html用网页打开，可以选择你想要表示的图形文件

log文件中显示调用的命令

python /usr/lib/qiime/bin//multiple_rarefactions.py -i otu_table/otu_table.biom -m 10 -x 16544 -s 1653 -o div_alpha//rarefaction/

随即抽取序列，默认的最小取10条序列，最大取16544条序列，下次抽取增加1653条序列，每一步的抽取重复10次

# Alpha diversity on rarefied OTU tables command

python /usr/lib/qiime/bin//alpha_diversity.py -i div_alpha//rarefaction/ -o div_alpha//alpha_div/ --metrics shannon,PD_whole_tree,chao1,observed_species,goods_coverage,simpson -t rep_phylo.tre

sam@sam-Precision-WorkStation-T7500[mtt3] alpha_diversity.py -s

Known metrics are: ACE, berger_parker_d, brillouin_d, chao1, chao1_confidence, dominance, doubles, equitability, esty_ci, fisher_alpha, gini_index, goods_coverage, heip_e, kempton_taylor_q, margalef, mcintosh_d, mcintosh_e, menhinick, michaelis_menten_fit, observed_species, osd, simpson_reciprocal, robbins, shannon, simpson, simpson_e, singles, strong, PD_whole_tree

可以知道一共有哪些alpha_diversity矩阵

# Collate alpha command

python /usr/lib/qiime/bin//collate_alpha.py -i div_alpha//alpha_div/ -o div_alpha//alpha_div_collated/

#上一步得到的结果中，一个文件夹中包含很多个Alpha多样性矩阵，将文件夹中所有文件中涉及到同一个矩阵的内容提出来，以该矩阵命令，形成新的文件夹。

# Rarefaction plot: All metrics command

python /usr/lib/qiime/bin//make_rarefaction_plots.py -i div_alpha//alpha_div_collated/ -m map.txt -o div_alpha//alpha_rarefaction_plots/

作图，div_alpha/alpha_rarefaction_plots/rarefaction_plots.html用网页打开，你什么都明白了

里面提到的几个矩阵：

shannon, 菌群多样性指数

香农-威纳指数的公式是：H=－∑（Pi）（㏑Pi）

Pi=样品中属于第i种的个体的比例，如样品总个体数为N，第i种个体数为ni，则Pi=ni/N

各种之间，个体分配越均匀，H值就越大。如果每一个体都属于不同的种，多样性指数就最大；如果每一个体都属于同一种，则其多样性指数就最小

Dominance 随即取两条序列，来自同一个样品的概率Σ (S_i(Si-1))/N(N-1)

simpson 菌群多样性指数

辛普森多样性指数=随机取样的两个个体属于不同种的概率

=1-随机取样的两个个体属于同种的概率

越均匀，值越大

PD_whole_tree,

谱系alpha多样性（phylogenetic diversity,Faith 1992）:探讨进化历史的保存，应用于种群，群落，生物地理学，保护生物学。

谱系beta多样性（phylobetadiversity,Webb 2002）:探讨群落或的确的谱系距离及其成因。

谱系信号与谱系结构（phylogenetic signal and phylogenetic structure）:探讨群落和地区物种共存机制。

谱系多样性（phylogenetic diversity PD），某个地点所有物种间最短进化分支长度之和占各节点分支长度综合的比例（Faith,1992）

群落谱系距离（phylogenetic distance）:群落I与群落II中种俩俩之间谱系分支长度之和的平均值（Webb,2002）

PD_whole_tree:sum of branch lengths between all representatives ????

chao1, 菌种丰富度指数。估计群落中的OTU数目

S_chao1=S_obs+n1(n1-1)/2(n2+1),其中S_chao1为估计的OUT数，S_obs为观测到的OTU数，n1为只有一天序列的OUT数目，n2为只有两天序列的OUT数目。

observed_species,

Otu的个数

goods_coverage 测序深度指数

测序深度：C=1-n1/N,n1为只有含一条序列的OTU数目，N为抽样中出现的总的序列数目。

参考资料：

http:///scripts/alpha_rarefaction.html

multiple_rarefactions.py注解 http:///scripts/multiple_rarefactions.html

alpha_diversity.py注解 http:///scripts/alpha_diversity.html

collate_alpha.py 注解 http:///scripts/collate_alpha.html

make_rarefaction_plots.py 注解 http:///scripts/make_rarefaction_plots.html

http://blog.sina.com.cn/s/blog_670445240102uw6s.html

——————

Diversity index

A diversity index is a quantitative measure that reflects how many different types (such as species) there are in a dataset, and simultaneously takes into account how evenly the basic entities (such as individuals) are distributed among those types. The value of a diversity index increases both when the number of types increases and when evenness increases. For a given number of types, the value of a diversity index is maximized when all types are equally abundant.

When diversity indices are used in ecology, the types of interest are usually species, but they can also be other categories, such as genera, families, functional types or haplotypes. The entities of interest are usually individual plants or animals, and the measure of abundance can be, for example, number of individuals, biomass or coverage. In demography, the entities of interest can be people, and the types of interest various demographic groups. In information science, the entities can be characters and the types the different letters of the alphabet. The most commonly used diversity indices are simple transformations of the effective number of types (also known as 'true diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real phenomenon (but a different one for each diversity index).

Shannon index

The Shannon index has been a popular diversity index in the ecological literature, where it is also known as Shannon's diversity index, the Shannon–Wiener index,[citation needed] the Shannon–Weaver index and the Shannon entropy. The measure was originally proposed by Claude Shannon to quantify the entropy (uncertainty or information content) in strings of text.The idea is that the more different letters there are, and the more equal their proportional abundances in the string of interest, the more difficult it is to correctly predict which letter will be the next one in the string. The Shannon entropy quantifies the uncertainty (entropy or degree of surprise) associated with this prediction.

Simpson index

The Simpson index was introduced in 1949 by Edward H. Simpson to measure the degree of concentration when individuals are classified into types. The same index was rediscovered by Orris C. Herfindahl in 1950.The square root of the index had already been introduced in 1945 by the economist Albert O. Hirschman.[8] As a result, the same measure is usually known as the Simpson index in ecology, and as the Herfindahl index or the Herfindahl–Hirschman index (HHI) in economics.

The measure equals the probability that two entities taken at random from the dataset of interest represent the same type.

更直观的反应微生物的多样性，还需要利用香农-威纳指数（Shannon-Wiener Index）和辛普森多样性指数（Simpson's diversity Index）来表示。

首先说明：多样性指数是反映丰富度和均匀度的综合指标。应指出的是，应用多样性指数时，具低丰富度和高均匀度的群落与具高丰富度与低均匀度的群落，可能得到相同的多样性指数。

Shannon-Wiener Index

费歇尔和普雷斯顿的方法所表示的多样性指数仅包括种的多寡一方面。香农-威纳指数和辛普森指数则包括了测量群落的异质性。香农-威纳指数借用了信息论方法。信息论的主要测量对象是系统的序（ order）或无序(disorder)的含量。在通讯工程中，人们要进行预测，预测信息中下一个是什么字母，其不定性的程度有多大。例如，b b b b b b b这样的信息流，都属于同一个字母，要预测下一个字母是什么，没有任何不定性，其信息的不定性含量等于零。如果是a，b，c，d，e，f，g，每个字母都不相同。那么其信息的不定性含量就大。在群落多样性的测度上，就借用了这个信息论中不定性测量方法，就是预测下一个采集的个体属于什么种，如果群落的多样性程度越高，其不定性也就越大。

香农-威纳指数的公式是：H＝－∑（Pi）（log₂Pi）

其中，H=样品的信息含量（彼得/个体）=群落的多样性指数，S=种数，Pi=样品中属于第i种的个体的比例，如样品总个体数为N，第i种个体数为ni，则Pi＝ni／N
    下面用一个假设的简单数字为例，说明香农一威纳指数的含义，设有 A，B，C三个群落，各有两个种所组成，其中各种个体数组成如下：
    物种甲物种乙
    群落A 100（1.0） 0(0)
    群落B 50(0.5) 50(0.5)
    群落C 99(0.99) 1(0.01)
    括号内数字即 Pi因为群落A的所有个体均属于物种甲，没有任何不定性，从理论上说H应该等于零，其香农一威纳指数是：
    H=－〔(1.0 log₂1.0)＋ 0）〕＝0
    由于在群落B中两个物种各有50个体，其分布是均匀的。它的香农指数是：
    H=－〔0.50（log₂0.50）＋0.50（log₂0.50）〕＝l
    群落C的两个物种分别具有99和1个个体，则：
    H=一〔0.99（log₂0.99）＋ 0.01（log₂0.01）〕＝0.081
    显然，H值的大小与我们的直觉是相符的：群落B的多样性较群落C大，而群落A的多样性等于零。

在香农-威纳指数中，包含着两个成分：①种数；②各种间个体分配的均匀性（equiability或evenness）。各种之间，个体分配越均匀，H值就越大。如果每一个体都属于不同的种，多样性指数就最大；如果每一个体都属于同一种，则其多样性指数就最小。那么，均匀性指数如何来测定呢？可以通过估计群落的理论上的最大多样性指数（Hmax），然后以实际的多样性指数对Hmax的比率，从而获得均匀性指数，具体步骤如下：

Hmax=－S（1/S log₂1/S）＝log₂S，其中 Hmax=在最大均匀性条件下的种多样性值，S=群落中种数
如果有S个种，在最大均匀性条件下，即每个种有1／S个体比例，、所以在此条件下Pi=1／S，举例说，群落中只有两个种时，则：Hmax＝log₂2=1
这与前面的计算是一致的，因此，我们可以犯均匀性指数定义为：E＝H/ Hmax，其中 E=均匀性指数，H=实测多样性值，Hmax =最大多样性值= log₂S

Simpson's diversity Index

辛普森在1949年提出过这样的问题：在无限大小的群落中，随机取样得到同样的两个标本，它们的概率是什么呢？如在加拿大北部森林中，随机采取两株树标本，属同一个种的概率就很高。相反，如在热带雨林随机取样，两株树同一种的概率很低，他从这个想法出发得出多样性指数。用公式表示为：
    辛普森多样性指数=随机取样的两个个体属于不同种的概率
    =1-随机取样的两个个体属于同种的概率
    设种i的个体数占群落中总个体数的比例为Pi，那么，随机取种i两个个体的联合概率就为。如果我们将群落中全部种的概率合起来，就可得到辛普森指数D，即