Microbiome：宏基因组分箱流程MetaWRAP安装和数据库布置

生物_医药_科研 2018-12-04

展开全文

简介

MetaWRAP这是一套强大的宏基因组分析流程，专注于宏基因组Binning。文章于2018年9月15日发表于《Microbiome》。文章简介见参考文献链接。

软件开源，代码和教程如下：

https://github.com/bxlab/metaWRAP

工作原理

metaWRAP工作流程

图中红色代表分析模块，绿色代表宏基因组数据，橙色代表中间文件，蓝色代表结果图表。

实现原始序列的质控、物种注释和可视化、宏基因组拼接、三种主流Bin方法分析和结果筛选与可视化、Bin的重新组装、Bin的物种和功能注释等。轻松实现Bin相关分析和可视化的绝大部分需求。

优势

图2. 基于CAMI人工数据集高、中、低数据量下，对6款Bin软件结果的完整度和污染率进行评估。结果表明metaWRAP在各种情况下在完整度和污染率方面都表现更优秀。

功能模块

宏基因组数据预处理模块

1) 质控Read_QC： read质控剪切和移除人类宿主
2) 组装Assembly: 质控、使用megahit或metaSPAdes拼接
3) 物种注释Kraken: 对reads和contigs层面进行可视化

分箱Bin处理模块

1) 分箱Binning: 利用MaxBin2, metaBAT2, 和CONCOCT三个软件分别分箱；
2) 提纯Bin_refinement：对多种Bin结果评估和综合分析，获得更好的结果；
3) 重组装Reassemble_bins：利用原始序列和评估软件二次组装，改善Bin的N50、完整度4) 定量Quant_bins: 估计样品中每个bin的丰度并热图展示
5) 气泡图Blobology: blobplots可视化群体的contigs的物种和Bin分布
6) 物种注释Classify_bins: 对Bin物种注释
7) 基因注释Annotate_bins: 预测Bin中的基因

软件安装

系统要求

系统要求是由处理的数据量决定的。其中一些软件，如KRAKEN、metaSPAdes对内存需求较高，推荐服务器至少8+核，64+GB内存，仅支持64位Linux系统。对于300 GB以上数据用户，推荐配置48核，512内存或更高。

软件原作者的教程中参数使用了96线程和900G内存，可以推断软件开发和测试所用服务器至少为96线程和1TB内存。

安装conda

(安过请跳过，详见- Nature Method：Bioconda解决生物软件安装的烦恼)

wget https://repo./miniconda/Miniconda2-latest-Linux-x86_64.shbash Miniconda2-latest-Linux-x86_64.sh

直接安装——我没成功，不推荐

此法使用方便，但可能安装不成功、环境不满足要求，或影响其它己安装程序。

# ORDER IS IMPORTANT!!!conda config --add channels defaultsconda config --add channels conda-forgeconda config --add channels biocondaconda config --add channels urskyconda install -c ursky metawrap-mg

虚拟环境安装——推荐

metaWRAP依赖超过140个软件作为依赖关系，容易引起与已经安装的软件冲突。因此强烈推荐使用conda虚拟环境安装。

每次使用要进入虚拟环境，结果要退出，多两行代码；但更安全。

conda create -n metawrap python=2.7source activate metawrap# ORDER IS IMPORTANT!!!conda config --add channels defaultsconda config --add channels conda-forgeconda config --add channels biocondaconda config --add channels urskyconda install -c ursky metawrap-mg

手动安装——不推荐

当然，如果你不喜欢conda，软件也可以手动安装，这样可以更好的控制你的环境变量。依赖关系列表见 https://github.com/bxlab/metaWRAP/blob/master/installation/dependancies.md

不推荐，高手可能需要3-7天，对Linux不熟悉人简直是不可完成的任务。

数据库配置

conda安装软件并不带数据库，需要手动下载数据库，并设置数据库的位置。

关于数据库的下载，详见 https://github.com/bxlab/metaWRAP/blob/master/installation/database_installation.md

主要大小和依赖模块如下：

Database	Size	Used in module
Checkm	1.4GB	binning, bin_refinement, reassemble_bins
KRAKEN	192GB	kraken
NCBI_nt	99GB	blobology, classify_bins
NCBI_tax	283MB	blobology, classify_bins
Indexed hg38	34GB	read_qc

这里我们安装数据库到~/db目录，保证你有权限，但要保证至少有500GB的空间。请根据你的情况修改为自己有权限且空间足够的位置。

mkdir -p ~/db

CheckM数据库

下载文件276MB，解压后1.4GB

cd ~/dbmkdir checkmcheckm data setRoot# CheckM will prompt to to chose your storage location...# Now manually download the database:cd checkmwget https://data.ace./public/CheckM_databases/checkm_data_2015_01_16.tar.gztar -xvf *.tar.gzrm *.gz

KRAKEN数据库

下载建索引需要 > 300GB以上空间，完成后占用192GB空间

cd ~/dbmkdir krakenkraken-build --standard --threads 24 --db krakenkraken-build --db kraken --clean

NCBI_nt

41GB，我下载大约12h；解压后99GB

cd ~/dbmkdir NCBI_nt && cd NCBI_ntwget -c 'ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz'for a in nt.*.tar.gz; do tar xzf $a; done

NCBI物种信息

压缩文件45M，解压后351M

cd ~/dbmkdir NCBI_taxcd NCBI_taxwget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gztar -xvf taxdump.tar.gz

人类基因组bmt索引

下载人类基因组942M，解压后合并3.2G，并建索引34GB

mkdir BMTAGGER_INDEXcd BMTAGGER_INDEXwget ftp://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/*fa.gzgunzip *fa.gzcat *fa > hg38.farm chr*.fabmtool -d hg38.fa -o hg38.bitmasksrprism mkindex -i hg38.fa -o hg38.srprism -M 100000

设置数据库位置

配置文件为config-metawrap，使用如下命令查找配置文件位置：

which config-metawrap

查使用vi/vim/gedit等文本编辑器来修改数据库的位置吧

参数简介

metaWRAP程序整理了所有的功能模块，可以独立运行。运行metaWRAP -h显示模块名称

Usage: metawrap [module] --helpOptions:read_qc        质控Raw read QC moduleassembly    组装Assembly modulebinning        分箱Binning modulebin_refinement    分箱提纯Refinement of bins from binning modulereassemble_bins 重装分箱Reassemble bins using metagenomic readsquant_bins    定量Quantify the abundance of each bin across samplesblobology    可视化Blobology modulekraken        物种注释KRAKEN module

想查看每个模块的具体参数，如组装metawrap assembly -h

Usage: metawrap assembly [options] -1 reads_1.fastq -2 reads_2.fastq -o output_dirOptions:-1 STR          正向序列forward fastq reads-2 STR          反向序列reverse fastq reads-o STR          输出目录output directory-m INT          内存大小memory in GB (default=10)-t INT          线程number of threads (defualt=1)--use-megahit        assemble with megahit (default)--use-metaspades    assemble with metaspades instead of megahit

详细使用：见明天使用实战