基因家族聚类之OrthoFinder

微笑如酒 2019-02-15

展开全文

大神一句话，菜鸟跑半年。我不是大神，但我可以缩短你走弯路的半年~

就像歌儿唱的那样，如果你不知道该往哪儿走，就留在这学点生信好不好~

这里有豆豆和花花的学习历程，从新手到进阶，生信路上有你有我！

豆豆写于19.2.11

这是干啥的？

做进化、基因家族分析、比较基因组使用

OrthoFinder is simple to use and all you need to run it is a set of protein sequence files (one per species) in FASTA format.

最新版的是version2，Emms, D.M. and Kelly, S. (2018) OrthoFinder2: fast and accurate phylogenomic orthology analysis from gene sequences. bioRxiv

OrthoFinder

Orthologs： pairs of genes that descended from a single gene in the last common ancestor (LCA) of two species
Orthogroup：（extension of the concept of orthology）the group of genes descended from a single gene in the LCA of a group of species
【all the genes in an orthogroup started out with the same sequence and function】

Orthologues, Orthogroups & Paralogues

使用

安装

https://github.com/davidemms/OrthoFinder

需要python2环境

conda install -n orthofinder python=2 orthofinder
source activate orthofinder

软件运行依赖于diamond或者mmseqs2、blast、mcl、fastme

参数设置

orthofinder -f data \ #存放蛋白的fa数据
    -S diamond \ #比对模式：diamond，blast，mmseqs，blast_gz
    -M msa \ #基因树推断法：dendroblast，msa(推荐)
    -T fasttree \ #建树软件：iqtree, fasttree, raxml（推荐），raxml
    -t 5 #线程

可以用软件自带的参考数据测试：https://github.com/davidemms/OrthoFinder/tree/master/orthofinder/ExampleDataset

运次过程

因为是小测试数据，因此运行的过程都可以监测：

1. Checking required programs are installed
2. Dividing up work for BLAST for parallel processing
3. Running diamond all-versus-all
4. Running OrthoFinder algorithm
5. Writing orthogroups to file
6. Analysing Orthogroups
7. Best outgroup(s) for species tree
8. Multiple potential species tree roots were identified, only one will be analyed.
9. Reconciling gene trees and species tree
10. Writing results files

结果生成这些文件，存放在Result目录中

Orthogroups.GeneCount.csv  
Orthogroups.txt                  Orthogroups_UnassignedGenes.csv  
SingleCopyOrthogroups.txt  
Statistics_PerSpecies.csv
Orthogroups.csv            
Orthogroups_SpeciesOverlaps.csv  
Orthologues_Feb11                
Statistics_Overall.csv     WorkingDirectory

其中Orthogroups.GeneCount.csv 中每一行代表一个基因家族，每一列表示每个物种的基因家族包含多少基因，比如OG0000000这个基因家族，在1物种中没有，在2物种有1个基因，在3物种有8个基因

Orthogroups

我们选出各个物种中基因数大于0的基因家族，首先看物种1

我们不要第一行，然后看物种1，也就是$2，选出大于0的，然后我们需要的是基因家族编号，也即是第一列

sed '1d' Orthogroups.GeneCount.csv |awk '$2 >0 {print $1}' >1.txt

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自：微笑如酒 > 《工具软件》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

微笑如酒

关注对话

TA的最新馆藏

注射生长激素，可能导致眼轴发育加速
零基础学病理---慢性胃炎
病理交流｜慢性胃炎的病理诊断标准
可怕！男子耳道里取出活体螨虫，只因这个习惯……很多人都有
78%中老年人血管正在“石化”！同济破解百年难题: 天然植物成分让血管返老还童
别动屎山

喜欢该文的人也喜欢更多

热门阅读换一换