【原】ChIP-seq数据分析课程学习笔记之合并bam以及使用macs找peaks

健明 2021-08-11

展开全文

咱们《生信技能树》的B站有一个ChIP-seq数据分析实战视频课程，缺乏配套笔记。恰好前些天的求贤令：曾经我给你带来了十万用户，但现在祝你倒闭，走大运结识了几位优秀小伙伴！

首先视频免费共享在B站：【生信技能树】Chip-seq测序数据分析ChIP-SEQ实战演练的素材：链接：https://share./53CwQ8B 密码：ju3rrh，包括一些公司PPT，综述以及文献 ChIP-SEQ 实战演练的思维导图：文档链接：https:///doc/11taEb9ZYg 密码：wk29

接下来是根据生信技能树课程、思维导图、PPT和综述文献整理的笔记。

今天的分享从conda安装的bowtie2一直这样报错说起。

(chipseq) vip13t16@bw-X11DAi-N:~$ bowtie2 --help
/home/data/vip13t16/biosoft/miniconda/yes/envs/chipseq/bin/bowtie2-align-s: error while loading shared libraries: libtbb.so.2: cannot open shared object file: No such file or directory
(ERR): Description of arguments failed!
Exiting now ...

然后我发现群里这样报错的不止我一个人！

bowtie2报错

经过一番探索在简书上找到这样的帖子：Linux下bowtie2安装(非conda)和配置

养成习惯，安装在固定目录下比如家目录下，建biosoft文件夹，下面建各个软件的文件夹
mkdir biosoft/bowtie2 && cd biosoft/bowtie2 wget https:///projects/bowtie-bio/files/bowtie2/2.3.5.1/bowtie2-2.3.5.1-linux-x86_64.zip/download# unzip bowtie2-2.3.5.1-linux-x86_64.zip #下面把bowtie2写入bashrc,以便以后随便哪个目录都可以调用 vim ~/.bashrc #最后一行加入 export PATH="$PATH:/home/data/vip13t16/biosoft/bowtie2/bowtie2-2.3.5.1-linux-x86_64/"
按Esc键退出编辑:wq保存并退出，最后source ~/.bashrc

然后再构建索引进行比对

合并bam文件

因为一个样品分成了多个lane进行测序，所以在进行peaks calling的时候，需要把bam进行合并。

如果不用循环：

samtools merge  control.merge.bam Control_1_trimmed.bam Control_2_trimmed.bam

通常我们用批处理：

cd ~/project/epi/ 
mkdir mergeBam
source activate chipseq
cd ~/project/epi/align
ls *.bam|sed 's/_[0-9]_trimmed.bam//g' |sort -u |while read id;do samtools merge ../mergeBam/$id.merge.bam $id*.bam ;done

得到全新的bam文件如下：一共是9个。

merge.bam

是否做PCR重复（要根据文章）

PCR重复

如果存在PCR重复，起始点、终止点都一样，但是有两条带。在http://www. 有两篇可以学习的帖子。

两篇推文

cd  ~/project/epi/mergeBam 
source activate chipseq
ls  *merge.bam  | while read id ;do (nohup samtools markdup -r $id $(basename $id ".bam").rmdup.bam & );done
# 针对新生成的.rmdup.bam文件构建索引、统计
ls  *.rmdup.bam  |xargs -i samtools index {} 
ls  *.rmdup.bam  | while read id ;do (nohup samtools flagstat $id > $(basename $id ".bam").stat & );done

使用macs2进行找peaks

macs2包含一系列的子命令，其中最主要的就是callpeak，官方提供了使用实例

macs2 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01

一般而言，我们照葫芦画瓢，按照这个实例替换对应部分就行了，介绍一下各个参数的意义
-t: 实验组的输出结果
-c: 对照组的输出结果
-f: -t和-c提供文件的格式，可以是”ELAND”, “BED”, “ELANDMULTI”, “ELANDEXPORT”, “ELANDMULTIPET” (for pair-end tags), “SAM”, “BAM”, “BOWTIE”, “BAMPE” “BEDPE” 任意一个。如果不提供这项，就是自动检测选择。
-g: 基因组大小，默认提供了hs, mm, ce, dm选项，不在其中的话，比如说拟南芥，就需要自己提供了。
-n: 输出文件的前缀名
-B: 会保存更多的信息在bedGraph文件中，如fragment pileup, control lambda, -log10pvalue and -log10qvalue scores
-q: q值，也就是最小的PDR阈值，默认是0.05。q值是根据p值利用BH计算，也就是多重试验矫正后的结果。
-p：这个是p值，指定p值后MACS2就不会用q值了。
-m: 和MFOLD有关，而MFOLD和MACS预构建模型有关，默认是5：50，MACS会先寻找100多个peak区构建模型，一般不用改，因为你很大概率上不会懂。

cd  ~/project/epi/mergeBam 
source activate chipseq
#如果不存在 ${id}_summits.bed这个文件
ls  *merge.bam |cut -d"." -f 1 |while read id;
do 
 if [ ! -s ${id}_summits.bed ];
 then 
 echo $id 
 nohup macs2 callpeak -c  Control.merge.bam -t $id.merge.bam -f BAM -B -g mm -n $id --outdir ../peaks  2> $id.log &  
 fi 
done  

#对去PCR重复的再做一次
mkdir dup
mv *rmdup* dup/
cd dup/

ls  *.merge.rmdup.bam |cut -d"." -f 1 |while read id;
do 
 if [ ! -s ${id}_rmdup_summits.bed ];
 then 
echo $id 
nohup macs2 callpeak -c  Control.merge.rmdup.bam  -t $id.merge.rmdup.bam  -f BAM -B -g mm -n ${id}_rmdup --outdir ../peaks 2> $id.log &  
 fi 
done

jimmy大神的

我的

但是我对比了一下我和大神的结果，我的peak文件夹里空空如也。是的，小菜狗又有哪里出问题了(T~T)

打开了我的log

于是然后去找了MACS的官方文档，https://github.com/macs3-project/MACS，都已经MACS3了啊，看了一眼前几天的推文：ChIP-Seq数据分析上下游打通，人家也是用的MACS3，那就来下载吧~

MACS (3.0.0a6)

Usage

Example for regular peak calling on TF ChIP-seq:

macs3 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01

Example for broad peak calling on Histone Mark ChIP-seq:

macs3 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1

Example for peak calling on ATAC-seq (paired-end mode):

macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01

There are currently twelve functions available in MAC3 serving as sub-commands.

Subcommand	Description
`callpeak`	Main MACS3 Function to call peaks from alignment results.
`bdgpeakcall`	Call peaks from bedGraph output.
`bdgbroadcall`	Call broad peaks from bedGraph output.
`bdgcmp`	Comparing two signal tracks in bedGraph format.
`bdgopt`	Operate the score column of bedGraph file.
`cmbreps`	Combine BEDGraphs of scores from replicates.
`bdgdiff`	Differential peak detection based on paired four bedGraph files.
`filterdup`	Remove duplicate reads, then save in BED/BEDPE format.
`predictd`	Predict d or fragment size from alignment results.
`pileup`	Pileup aligned reads (single-end) or fragments (paired-end)
`randsample`	Randomly choose a number/percentage of total reads.
`refinepeak`	Take raw reads alignment, refine peak summits.
`callvar`	Call variants in given peak regions from the alignment BAM files.

安装&使用：（此处省略Python环境配置）

pip install MACS3
#下载成功会有：
#Successfully built MACS3
#Installing collected packages: cykhash, MACS3
#Successfully installed MACS3-3.0.0a4 cykhash-1.0.2
ls  *merge.bam |cut -d"." -f 1 |while read id;
do 
 if [ ! -s ${id}_summits.bed ];
 then 
 echo $id 
 nohup macs3 callpeak -c  Control.merge.bam -t $id.merge.bam -f BAM -B -g mm -n $id --outdir ../peaks  2> $id.log &  
 fi 
done

现在的

文件在~/project/epi/peaks里面

ls -lh *.bed |cut -d " " -f 5-
#以下是结果
   0 8月   8 14:30 Control_summits.bed
 69K 8月   8 14:30 H2Aub1_summits.bed
1.1M 8月   8 14:31 H3K36me3_summits.bed
1.2M 8月   8 14:31 Ring1B_summits.bed
1.9M 8月   8 14:30 RNAPII_8WG16_summits.bed
814K 8月   8 14:31 RNAPII_S2P_summits.bed
2.0M 8月   8 14:29 RNAPII_S5PRepeat_summits.bed
2.7M 8月   8 14:30 RNAPII_S5P_summits.bed
3.1M 8月   8 14:30 RNAPII_S7P_summits.bed

下一次，将带来激动人心的 可视化 部分的分享

题外话：真的是 大神一句话，菜鸟跑半年。这两节加起来20分钟的课程，感觉自己踩的坑大概比这篇笔记的字数还多。经验教训：多读文献，多做实战，shell脚本的知识点还要好好看。