【原】(11)仿写bowtie-生信菜鸟团博客2周年精选文章集

健明 2021-07-14

展开全文

然后仿写了bowtie，对我的编程技术提高非常有帮助。目录如下：

自己动手写bowtie第一讲：BWT算法详解并建立索引
自己动手写bowtie第二讲：优化索引
自己动手写bowtie第三讲：序列查询。
自己动手写bowtie第4讲：笨方法字符串搜索
Bowtie算法第五讲-index2tally
Bowtie算法第六讲-tally法对bwt索引进行搜索

首先，什么是BWT，可以参考博客

http://www.cnblogs.com/xudong-bupt/p/3763814.html

他讲的非常好。

一个长度为n的串A1A2A3…An经过旋转可以得到

A1A2A3…An

A2A3…AnA1

A3…AnA1A2

…

AnA1A2A3…

n个串，每个字符串的长度都是n。

对这些字符串进行排序，这样它们之前的顺序就被打乱了，打乱的那个顺序就是index，需要输出。

首先我们测试一个简单的字符串acaacg$,总共六个字符，加上一个$符号，下次再讲$符号的意义。

实现以上功能是比较简单的，代码如下

但是这是对于6个字符串等小片段字符串，如果是是几千万个字符的字符串，这样转换就会输出千万的平方个字符串组成的正方形数组，是很恐怖的数据量。所以在转换的同时就不能把整个千万字符储存在内存里面。

在生物学领域，是这样的，这千万个千万个碱基的方阵，我们取每个字符串的前20个字符串就足以对它们进行排序，当然这只是近视的，我后面会讲精确排序，而且绕过内存的方法。

Perl程序如下

[perl]

while (<>){

next if />/;

chomp;

$a.=$_;

}

$a.=’$';

$len=length $a;

$i=0;

print "first we transform it !!!\n";

foreach (0..$len-1){

$up=substr($a,0,$_);

$down=substr($a,$_);

#print "$down$up\n";

#$hash{"$down$up"}=$i;

$key=substr("$down$up",0,20);

$key=$key.”\t”.substr("$down$up",$len-1);

$hash{$key}=$i;

$i++;

}

print "then we sort it\n";

foreach (sort keys %hash){

$first=substr($_,0,1);

$len=length;

$last=substr($_,$len-1,1);

#print "$first\t$last\t$hash{$_}\n";

print "$_\t$hash{$_}\n";

}

[/perl]

运行的结果如下

个人觉得这样排序是极好的，但是暂时还没想到如何解决不够精确的问题！！！

参考：

http://tieba.baidu.com/p/1504205984

http://www.cnblogs.com/xudong-bupt/p/3763814.html

其中第一讲我提到了一个简单的索引产生方式，因为是课堂就半个小时想的，很多细节没有考虑到，对病毒那种几K大小的基因组来说是很简单的，速度也非常快，但是我测试了一下酵母，却发现好几个小时都没有结果，我只好kill掉重新改写算法，我发现之前的测序最大的问题在于没有立即substr函数的实现方式，把一个5M的字符串不停的截取首尾字符串好像是一个非常慢的方式。

所以我优化了那个字符串的函数，虽然代码量变多了，实现过程也繁琐了一点，但是速度提升了几千倍。

time perl bwt_new_index.pl e-coli.fa >e-coli.index

测试了一下我的脚本，对酵母这样的5M的基因组，索引耗费时间是43秒

real 0m43.071s

user 0m41.277s

sys 0m1.779s

输出的index矩阵如下，我简单的截取头尾各10行给大家看，一点要看懂这个index。

首先第一列就是我们的BWT向量，也就是BWT变换后的尾字符

第二列是之前的顺序被BWT变换后的首字符排序后的打乱的顺序。

第三，四，五，六列分别是A,C,G,T的计数，就是在当行之前累积出现的A,C,G,T的数量，是对第一列的统计。

这个索引文件将会用于下一步的查询，这里贴上我新的索引代码，查询见下一篇文章

[perl]

while (<>){

next if />/;

chomp;

$a.=$_;

}

$len=length $a;

open FH_F,">tmp_forward.txt";

open FH_R,">tmp_reverse.txt";

for(my $i=0;$i<=$len-1;$i+=20){

print FH_F substr($a,$i,20);

print FH_F "\n";

}

$rev_a=reverse $a;

for(my $i=0;$i<=$len-1;$i+=20){

print FH_R substr($rev_a,$i,20);

print FH_R "\n";

}

close FH_F;

close FH_R;

$a=”;

open FH_F,"tmp_forward.txt";

open FH_R,"tmp_reverse.txt";

#把前一行的所有20bp碱基当做后一行的头部信息

$residue_F=<FH_F>;

$residue_R=<FH_R>;

$i=0;

while ($F_reads=<FH_F>){

$R_reads=<FH_R>;

$F_merge=$residue_F.$F_reads;

$R_merge=$residue_R.$R_reads;

#这样每次就需要处理20个碱基

foreach (0..19) {

$up =substr($F_merge,$_,20);

$down=substr($R_merge,$_,1);

$hash{"$up\t$down"}=$i;

$i++;

}

#处理完毕之后再保存当行的20bp碱基做下一行的头部信息

$residue_F=$F_reads;

$residue_R=$R_reads;

}

#print "then we sort it\n";

$count_a=0;

$count_c=0;

$count_g=0;

$count_t=0;

foreach (sort keys %hash){

$first=substr($_,0,1);

$len=length;

$last=substr($_,$len-1,1);

#print "$first\t$last\t$hash{$_}\n";

$count_a++ if $last eq 'A';

$count_c++ if $last eq 'C';

$count_g++ if $last eq 'G';

$count_t++ if $last eq 'T';

print "$last\t$hash{$_}\t$count_a\t$count_c\t$count_g\t$count_t\n";

}

unlink("tmp_forward.txt");

unlink("tmp_reverse.txt");

[/perl]

查询需要根据前面建立的索引来做。

这是一个比较复杂的过程，我也是看了bowtie的作者的ppt才慢慢弄懂的，感觉自己也不可能三言两语就说清楚，一般都是辅助图片，动画，再经过多方交流才能慢慢理解。

所以大家呢，就自己去看ppt，看懂那个查询算法。（ppt及代码在我的群里面有共享，欢迎大家加群交流）

这里我简单讲讲我的程序

首先读取索引文件，统计好A,C,G,T的总数

然后把查询序列从最后一个字符往前面回溯。

我创建了一个子函数，专门来处理回溯的问题

每次接受四个参数（左右两端的碱基，上下的阈值），并返回两个参数（新的上下两个阈值）

大家要看懂阈值是如何更新迭代，这样动态的一个个回溯字符串，一个个迭代阈值。

直到四种临界情况的出现。

第一是上下阈值已经相等了，但是我们还没有回溯完全，那就说明字符串只能查找后几个字符，前面还有字符是无法匹配的

第二种情况是上下阈值已经相等了，正巧我们也回溯到了最后一个字符串，那么我们就找到了精确匹配。

第三种情况是已经进行到了最后一个字符串，但是上下阈值还有差值，那么就找到了多个精确匹配点。

最后一种情况是各种非法字符。

然后我简单的测序了一下在病毒的5K基因组里面的精确匹配情况，好像效果还挺好的

但是在酵母里面还有一个问题没有解决，就是取前二十个字符串排序的问题，不够精确，需要重新审视排序结果进行局部优化，可能是需要用堆排序发，具体我还得考虑一个星期，只能等下周上课再看看了，平时太忙了，基本没时间码代码。

这里贴上我的代码给大家看看，

[perl]

$a=’CGCTATGTACTGGATGCGCTGGCAAACGAGCCTGCCGTAAG';

while(<>){

chomp;

@F=split;

$hash_count_atcg{$F[0]}++;

$hash{$.}=$_;

}

$all_a=$hash_count_atcg{'A’};

$all_c=$hash_count_atcg{'C’};

$all_g=$hash_count_atcg{'G’};

$all_t=$hash_count_atcg{'T’};

#print "$all_a\t$all_c\t$all_g\t$all_t\n";

$len_a=length $a;

$end_a=$len_a-1;

print "your query is $a\n";

print "and the length of your query is $len_a \n";

foreach (reverse (0..$end_a)){

$after=substr($a,$_,1);

$before=substr($a,$_-1,1);

#对第一个字符进行找阈值的时候，我们需要人为的定义起始点！

if($_ == $end_a){

if ($after eq 'A’) {

$start=1;

$end=$all_a;

}

elsif ($after eq 'C’) {

$start=$all_a+1;

$end=$all_a+$all_c;

}

elsif ($after eq 'G’) {

$start=$all_a+$all_c+1;

$end=$all_a+$all_c+$all_g;

}

elsif ($after eq 'T’){

$start=$all_a+$all_c+$all_g+1;

$end=$all_a+$all_c+$all_g+$all_t;

}

else {print "error !!! we just need A T C G !!!\n";exit;}

}

#如果阈值已经无法继续分割，但是字符串还未查询完

if ($_ > 0 && $start == $end) {

$find_char=substr($a,$_);

$find_len=length $find_char;

#这里需要修改，但是不影响完全匹配了

print "we can just find the last $find_len char ,and it is $find_char \n";

exit;

}

#如果进行到了最后一个字符

if ($_ == 0) {

if ($start == $end) {

print "It is just one perfect match ! \n";

my @F_start=split/\s+/,$hash{$start};

print "The index is $F_start[1]\n";

exit;

}

else {

print "we find more than one perfect match!!!\n";

#print "$start\t$end\n";

foreach ($start..$end) {

my @F_start=split/\s+/,$hash{$_};

print "One of the index is $F_start[1]\n";

}

exit;

}

($start,$end)=&find_level($after,$before,$start,$end);

}

sub find_level{

my($after,$before,$start,$end)=@_;

my @F_start=split/\s+/,$hash{$start};

my @F_end=split/\s+/,$hash{$end};

if ($before eq 'A’) {

return ($F_start[2],$F_end[2]);

}

elsif ($before eq 'C’) {

return ($all_a+$F_start[3],$all_a+$F_end[3]);

}

elsif ($before eq 'G’) {

return ($all_a+$all_c+$F_start[4],$all_a+$all_c+$F_end[4]);

}

elsif ($before eq 'T’) {

return ($all_a+$all_c+$all_g+$F_start[5],$all_a+$all_c+$all_g+$F_end[5]);

}

else {print "sorry , I can’t find the right match!!!\n";}

}

#perl -alne '{next if />/;$all.=$_;}END{print substr($all,308,10)}’ lambda_virus.fa

[/perl]

由于之前就简单的看了看bowtie作者的ppt，没有完全吃透就开始敲代码了，写了十几个程序最后我自己都搞不清楚进展到哪一步了，所以我现在整理一下，从新开始！！！

首先，bowtie的作用就是在一个大字符串里面搜索一个小字符串！那么本身就有一个非常笨的复杂方法来搜索，比如，大字符串长度为100万，小字符串为10，那么就依次取出大字符串的10个字符来跟小字符串比较即可，这样的算法是非常不经济的，我简单用perl代码实现一下。

[perl]

#首先读取大字符串的fasta文件

open FH ,"<$ARGV[0]";

$i=0;

while (<FH>) {

next if /^>/;

chomp;

$a.=(uc);

}

#print "$a\n";

#然后接受我们的小的查询字符串

$query=uc $ARGV[1];

$len=length $a;

$len_query=length $query;

$a=$a.’$’.$a;

#然后依次循环取大字符串来精确比较！

foreach (0..$len-1){

if (substr($a,$_,$len_query) eq $query){

print "$_\n";

#last;

}

[/perl]

这样在时间复杂度非常恐怖，尤其是对人的30亿碱基。

正是因为这样的查询效率非常低，所以我们才需要用bwt算法来构建索引，然后根据tally来进行查询

其中构建索引有三种方式，我首先讲最效率最低的那种索引构造算法，就是依次取字符串进行旋转，然后排序即可。

[perl]

$a=uc $ARGV[0];

$len=length $a;

$a=$a.’$’.$a;

foreach (0..$len){

$hash{substr($a,$_,$len+1)}=$_;

}

#print "$_\t$hash{$_}\n" foreach sort keys %hash;

print substr($_,-1),"\t$hash{$_}\n" foreach sort keys %hash;

[/perl]

这个算法从时间复杂度来讲是非常经济的，对小字符串都是瞬间搞定！！！

perl rotation_one_by_one.pl atgcgtanngtc 这个字符串的BWT矩阵索引如下！

C 12

T 6

$ 0

T 11

G 3

T 2

C 4

N 9

N 8

A 7

G 5

G 10

A 1

但同样的，它也有一个无法避免的弊端，就是内存消耗太恐怖。对于30亿的人类碱基来说，这样旋转会生成30亿乘以30亿的大矩阵，一般的服务器根本hold不住的。

最后我讲一下，这个BWT矩阵索引如何还原成原字符串，这个没有算法的差别，因为就是很简单的原理。

[perl]

#first read the tally !!!

#首先读取上面输出的BWT矩阵索引文件。

open FH,"<$ARGV[0]";

$hash_count{'A’}=0;

$hash_count{'C’}=0;

$hash_count{'G’}=0;

$hash_count{'T’}=0;

while(<FH>){

chomp;

@F=split;

$hash_count{$F[0]}++;

$hash{$.}="$F[0]\t$F[1]\t$hash_count{$F[0]}";

#print "$hash{$.}\n";

}

$all_a=$hash_count{'A’};

$all_c=$hash_count{'C’};

$all_g=$hash_count{'G’};

$all_t=$hash_count{'T’};

$all_n=$hash_count{'N’};

#start from the first char !

$raw=”;

&restore(1);

sub restore{

my($num)=@_;

my @F=split/\t/,$hash{$num};

$raw.=$F[0];

my $before=$F[0];

if ($before eq 'A’) {

$new=$F[2]+1;

}

elsif ($before eq 'C’) {

$new=1+$all_a+$F[2];

}

elsif ($before eq 'G’) {

$new=1+$all_a+$all_c+$F[2];

}

elsif ($before eq 'N’) {

$new =1+$all_a+$all_c+$all_g+$F[2];

}

elsif ($before eq 'T’) {

$new=1+$all_a+$all_c+$all_g+$all_n+$F[2];

}

elsif ($before eq '$’) {

chop $raw;

$raw = reverse $raw;

print "$raw\n";

exit;

}

else {die "error !!! we just need A T C N G !!!\n"}

#print "$F[0]\t$new\n";

&restore($new);

}

[/perl]

前面讲到了如何用笨方法进行字符串搜索，也讲了如何构建bwt索引，和把bwt索引还原成字符串！

原始字符串是ATGCGTANNGTC

排序过程是下面的

$ATGCGTANNGTC 12

ANNGTC$ATGCGT 6

ATGCGTANNGTC$ 0

C$ATGCGTANNGT 11

CGTANNGTC$ATG 3

GCGTANNGTC$AT 2

GTANNGTC$ATGC 4

GTC$ATGCGTANN 9

NGTC$ATGCGTAN 8

NNGTC$ATGCGTA 7

TANNGTC$ATGCG 5

TC$ATGCGTANNG 10

TGCGTANNGTC$A 1

现在讲讲如何根据bwt索引构建tally，并且用tally搜索方法来搜索字符串！

首先是bwt索引转换为tally

C 12

T 6

$ 0

T 11

G 3

T 2

C 4

N 9

N 8

A 7

G 5

G 10

A 1

这个其实非常简单的，tally就是增加四列计数的列即可

[perl]

$hash_count{'A’}=0;

$hash_count{'C’}=0;

$hash_count{'G’}=0;

$hash_count{'T’}=0;

open FH ,"<$ARGV[0]";

while(<FH>){

chomp;

@F=split;

$last=$F[0]; # 读取上面的tally文件，分列，判断第一列，并计数

$hash_count{$last}++;

print "$_\t$hash_count{'A’}\t$hash_count{'C’}\t$hash_count{'G’}\t$hash_count{'T’}\n";

}

[/perl]

输出的tally如下

C 12 0 1 0 0

T 6 0 1 0 1

$ 0 0 1 0 1

T 11 0 1 0 2

G 3 0 1 1 2

T 2 0 1 1 3

C 4 0 2 1 3

N 9 0 2 1 3

N 8 0 2 1 3

A 7 1 2 1 3

G 5 1 2 2 3

G 10 1 2 3 3

A 1 2 2 3 3

接下来就是针对这个tally的查询函数了

因为要讲搜索，所以我选择了一个长一点的字符串来演示多种情况的搜索

perl rotation_one_by_one.pl atgtgtcgtagctcgtnncgt

程序运行的结果如下

$ATGTGTCGTAGCTCGTNNCGT 21

AGCTCGTNNCGT$ATGTGTCGT 9

ATGTGTCGTAGCTCGTNNCGT$ 0

CGT$ATGTGTCGTAGCTCGTNN 18

CGTAGCTCGTNNCGT$ATGTGT 6

CGTNNCGT$ATGTGTCGTAGCT 13

CTCGTNNCGT$ATGTGTCGTAG 11

GCTCGTNNCGT$ATGTGTCGTA 10

GT$ATGTGTCGTAGCTCGTNNC 19

GTAGCTCGTNNCGT$ATGTGTC 7

GTCGTAGCTCGTNNCGT$ATGT 4

GTGTCGTAGCTCGTNNCGT$AT 2

GTNNCGT$ATGTGTCGTAGCTC 14

NCGT$ATGTGTCGTAGCTCGTN 17

NNCGT$ATGTGTCGTAGCTCGT 16

T$ATGTGTCGTAGCTCGTNNCG 20

TAGCTCGTNNCGT$ATGTGTCG 8

TCGTAGCTCGTNNCGT$ATGTG 5

TCGTNNCGT$ATGTGTCGTAGC 12

TGTCGTAGCTCGTNNCGT$ATG 3

TGTGTCGTAGCTCGTNNCGT$A 1

TNNCGT$ATGTGTCGTAGCTCG 15

它的BWT及索引是

T 21

T 9

$ 0

N 18

T 6

T 13

G 11

A 10

C 19

C 7

T 4

T 2

C 14

N 17

T 16

G 20

G 8

G 5

C 12

G 3

A 1

G 15

然后得到它的tally文件如下

接下来用我们的perl程序在里面找字符串

第一次我测试 GTGTCG 这个字符串，程序可以很清楚的看到它的查找过程。

perl search_char.pl GTGTCG tm.tally

your last char is G

start is 7 ; and end is 13

now it is number 5 and the char is C

start is 3 ; and end is 6

now it is number 4 and the char is T

start is 17 ; and end is 19

now it is number 3 and the char is G

start is 10 ; and end is 11

now it is number 2 and the char is T

start is 19 ; and end is 20

now it is number 1 and the char is G

start is 11 ; and end is 12

It is just one perfect match !

The index is 2

第二次我测试一个多重匹配的字符串GT，在原字符串出现了五次的

perl search_char.pl GT tm.tally

your last char is T

start is 15 ; and end is 22

now it is number 1 and the char is G

start is 8 ; and end is 13

we find more than one perfect match!!!

8 13

One of the index is 11

One of the index is 10

One of the index is 19

One of the index is 7

One of the index is 4

One of the index is 2

One of the index is 14

惨了，这个是很严重的bug，不知道为什么，对于多个匹配总是会多出那么一点点的结果。

去转换矩阵里面查看，可知，前面两个结果11和10是错误的。

CTCGTNNCGT$ATGTGTCGTAG 11

GCTCGTNNCGT$ATGTGTCGTA 10

GT$ATGTGTCGTAGCTCGTNNC 19

GTAGCTCGTNNCGT$ATGTGTC 7

GTCGTAGCTCGTNNCGT$ATGT 4

GTGTCGTAGCTCGTNNCGT$AT 2

GTNNCGT$ATGTGTCGTAGCTC 14

最后我们测试未知字符串的查找。

perl search_char.pl ACATGTGT tm.tally

your last char is T

start is 15 ; and end is 22

now it is number 7 and the char is G

start is 8 ; and end is 13

now it is number 6 and the char is T

start is 19 ; and end is 21

now it is number 5 and the char is G

start is 11 ; and end is 12

now it is number 4 and the char is T

start is 20 ; and end is 21

now it is number 3 and the char is A

start is 2 ; and end is 3

now it is number 2 and the char is C

start is 3 ; and end is 3

we can just find the last 6 char ,and it is ATGTGT

原始字符串是ATGTGTCGTAGCTCGTNNCGT，所以查找的挺对的！！！

[perl]

$a=$ARGV[0];

$a=uc $a;

open FH,"<$ARGV[1]";

while(<FH>){

chomp;

@F=split;

$hash_count_atcg{$F[0]}++;

$hash{$.}=$_;

# the first line is $ and the last char and the last index !

}

$all_a=$hash_count_atcg{'A’};

$all_c=$hash_count_atcg{'C’};

$all_g=$hash_count_atcg{'G’};

$all_n=$hash_count_atcg{'N’};

$all_t=$hash_count_atcg{'T’};

#print "$all_a\t$all_c\t$all_g\t$all_t\n";

$len_a=length $a;

$end_a=$len_a-1;

#print "your query is $a\n";

#print "and the length of your query is $len_a \n";

$after=substr($a,$end_a,1);

#we fill search your query from the last char !

if ($after eq 'A’) {

$start=2;

$end=$all_a+1;

}

elsif ($after eq 'C’) {

$start=$all_a+1;

$end=$all_a+$all_c+1;

}

elsif ($after eq 'G’) {

$start=$all_a+$all_c+1;

$end=$all_a+$all_c+$all_g+1;

}

elsif ($after eq 'T’){

$start=$all_a+$all_c+$all_g+$all_n+1;

$end=$all_a+$all_c+$all_g+$all_t+$all_n+1;

}

else {die "error !!! we just need A T C G !!!\n"}

print "your last char is $after\n ";

print "start is $start ; and end is $end \n";

foreach (reverse (1..$end_a)){

$after=substr($a,$_,1);

$before=substr($a,$_-1,1);

($start,$end)=&find_level($after,$before,$start,$end);

print "now it is number $_ and the char is $before \n ";

print "start is $start ; and end is $end \n";

if ($_ > 1 && $start == $end) {

$find_char=substr($a,$_);

$find_len=length $find_char;

print "we can just find the last $find_len char ,and it is $find_char \n";

#return "miss";

last;

}

if ($_ == 1) {

if (($end-$start)==1) {

print "It is just one perfect match ! \n";

my @F_start=split/\s+/,$hash{$end};

print "The index is $F_start[1]\n";

#return $F_start[1];

last;

}

else {

print "we find more than one perfect match!!!\n";

print "$start\t$end\n";

foreach (($start-1)..$end) {

my @F_start=split/\s+/,$hash{$_};

print "One of the index is $F_start[1]\n";

}

#return "multiple";

last;

}

sub find_level{

my($after,$before,$start,$end)=@_;

my @F_start=split/\s+/,$hash{$start};

my @F_end=split/\s+/,$hash{$end};

if ($before eq 'A’) {

return ($F_start[2]+1,$F_end[2]+1);

}

elsif ($before eq 'C’) {

return ($all_a+$F_start[3]+1,$all_a+$F_end[3]+1);

}

elsif ($before eq 'G’) {

return ($all_a+$all_c+1+$F_start[4],$all_a+$all_c+1+$F_end[4]);

}

elsif ($before eq 'T’) {

return ($all_a+$all_c+$all_g+$all_n+1+$F_start[5],$all_a+$all_c+$all_g+1+$all_n+$F_end[5]);

}

else {die "error !!! we just need A T C G !!!\n"}

}

[/perl]

原始字符串是atgtgtcgtagctcgtnncgt

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自：健明 > 《待分类》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

健明

关注对话

TA的最新馆藏

ncount_RNA 和nFeature_RNA辅助过滤
蛋白质组学数据分析一本通
人海绵体单细胞转录组图谱
如何整合10x和smart-seq2技术平台的单细胞转录组数据
类器官培养物真的只有恶性肿瘤细胞吗（去卷积推测细胞亚群比例）
非模式生物构建10x单细胞转录组CellRanger参考文件

喜欢该文的人也喜欢更多

热门阅读换一换