genbank 提取 cdna PRO 及基因组序列的脚本

xiaohua1314 2013-08-30

展开全文

从NCBI上下载病毒的基因组序列，genbank格式。通过perl脚本生成三个文件

1、蛋白序列

2、基因组序列

3、cDNA序列。

#!/usr/bin/perl -w

#usage : perl test-gb.pl xx.gb xx.pep xx.genome xx.cnda

my $input_file=$ARGV[0];

open INPUT,$input_file or die "we can not open this file";

my $flag=0;

my $seq;

my $genome_file=$ARGV[1];

open OUTPUT_GEN,">$genome_file" or die "!!!!!!!!!!!!!!!!!!";

my $seq_pro;

my $output_pro=$ARGV[2];

open OUTPUT_PRO,">$output_pro" or die "@@@@@@@@@@@@@";

my $genome_name;

my @product_name;

while (<INPUT>){

if(/ORGANISM\s+(.*)/){

$genome_name=$1;

}

###寻找基因组序列的其实位置###########

if(/ORIGIN/){

$_=<INPUT>;

$flag=1;

}

####寻找基因组序列的终止位置##########

if(/\/\//){

last;

}

########从标记开始进行序列截取##########

if($flag==1){

$_=~s/ |[\d+]//g;#one line one seq 60

chomp $_;

#print $_;

$_=~tr/a-zA-Z/A-Za-z/;

$seq.=$_;

}

#######寻找基因名字#################

if(/gene\s+.*\d/){

my $gene_name=<INPUT>;

$gene_name=~/locus_tag="(.*)"/;

print OUTPUT_PRO ">$1\t";

}

if(/\/product="(.*)"/){

push @product_name,$1;

print OUTPUT_PRO "$1\n";

}

##################寻找翻译后的蛋白质序列############

if(/trans/){

$_=~s/\/translation="//;

my $seq_pro=$_;

while(<INPUT>){

chomp $_;

$seq_pro.=$_;

if(/"/){

last;

}

$seq_pro=~s/\s+|"//g;

print OUTPUT_PRO $seq_pro."\n";

}

close INPUT;

###################################输出基因组序列######################2######################################

my $i=1;

print OUTPUT_GEN ">Genome\t$genome_name\n";

while (length($seq)>70*($i-1)){

my $seq_line=substr($seq,($i-1)*70,70);

$i++;

print OUTPUT_GEN $seq_line."\n";

}

###################################输出CDNA序列####################333################################

my $output_cdna=$ARGV[3];

open OUTPUT_CDNA ,">$output_cdna" or die "We can not open this file3";

open INPUT,$input_file or die "we can not open this file";

while (<INPUT>){

if(/gene/){

if(/(\d+)..(\d+)/){

#print $1."\t".$2."\n";

my $start=$1-1;

my $end=$2-1;

my $len=$end-$start+1;

my $gene_pos=$_;

if(/gene\s+.*\d/){

my $gene_locus_tag=<INPUT>;

$gene_locus_tag=~/locus_tag="(.*)"/;

print OUTPUT_CDNA ">$1\t";

my $pro_productname=pop @product_name;

print OUTPUT_CDNA $pro_productname."\n";

}

###########不同情况的处理截取不同的片段##################

if($gene_pos=~/com/){

my $seq_cdna=substr($seq,$start,$len);

$seq_cdna=~tr/ACGT/TGCA/;

$seq_cdna=reverse $seq_cdna;

#print $seq_cdna."\n";

my $i=1;

while (length($seq_cdna)>70*($i-1)){

my $seq_line=substr($seq_cdna,($i-1)*70,70);

$i++;

print OUTPUT_CDNA $seq_line."\n";

}

else {

my $seq_cdna=substr($seq,$start,$len);

my $i=1;

while (length($seq_cdna)>70*($i-1)){

my $seq_line=substr($seq_cdna,($i-1)*70,70);

$i++;

print OUTPUT_CDNA $seq_line."\n";

}

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自： xiaohua1314 > 《Perl》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

xiaohua1314

关注对话

TA的最新馆藏

[转] 最全柑橘常见病虫害防治年历（立马收藏）
[转] 都2021了仍然有人转录组走tophat加cufflinks流程
[转] lncRNA-seq数据分析之新lncRNA鉴定和注释视频课程众筹
[转] 中国荷斯坦奶牛新的lncRNA全基因组鉴定
[转] lncRNA数据分析传送门
[转] lncRNA实战项目-第四步-得到表达矩阵的流程

喜欢该文的人也喜欢更多

热门阅读换一换