有一个需求,是把peaks注释前后的文件合并起来,不想写脚本,想就用shell命令来做到。 我可以很肯定,文件都是以tab符号分割的列。 sed '1d' 57030_sort_peaks.peakAnn.xls |cut -f 2-8,1 |sort -t $'\t' -k 8 |head -12
cut -f 4,7-9 57030_sort_peaks.narrowPeak.bed |sort -t $'\t' -k 1 |head -12 排序结果如下: sed '1d' 57030_sort_peaks.peakAnn.xls |sort -t$'\t' -k1 |cut -f 1-8 |head -12
peak1000 chr16 2513739 2513891 + 4.00941 NA promoter-TSS (NM_001198569)
peak1001 chr16 2777421 2777567 + 4.54262 NA promoter-TSS (NM_007108)
peak1002 chr16 2904949 2905145 + 4.40547 NA Intergenic
peak1003 chr16 3059613 3059759 + 4.06547 NA promoter-TSS (NR_123723)
peak1004 chr16 3282982 3283229 + 6.19764 NA promoter-TSS (NM_005741)
peak1005 chr16 3457701 3457880 + 4.26274 NA promoter-TSS (NM_001317097)
peak1006 chr16 4412132 4412322 + 4.11112 NA intron (NR_145128, intron 3 of 27)
peak1007 chr16 4425589 4425781 + 6.69260 NA promoter-TSS (NM_001135110)
peak1008 chr16 4538610 4538771 + 4.44393 NA promoter-TSS (NM_001199054)
peak1009 chr16 9091684 9092031 + 4.99539 NA 5' UTR (NM_014117, exon 1 of 4)
peak100 chr1 66930370 66930598 + 5.48491 NA intron (NM_001077704, intron 1 of 13)
peak1010 chr16 14928161 14928337 + 6.77548 NA promoter-TSS (NR_103772)
cut -f 4,7-9 57030_sort_peaks.narrowPeak.bed |sort -t$'\t' -k1 |head -12
peak1000 4.00941 10.78589 6.44162
peak1001 4.54262 12.95468 8.39192
peak1002 4.40547 9.16804 5.03032
peak1003 4.06547 8.97515 4.87603
peak1004 6.19764 16.08099 11.28833
peak1005 4.26274 9.14970 5.01487
peak100 5.48491 9.93370 5.68875
peak1006 4.11112 10.51290 6.19959
peak1007 6.69260 18.52489 13.59905
peak1008 4.44393 9.86713 5.64038
peak1009 4.99539 12.14135 7.65193
peak1010 6.77548 13.81271 9.17838 我发现那个 我还特意抓取第一列测试了一下,代码如下; sed '1d' 57030_sort_peaks.peakAnn.xls |cut -f 1| sort -k1 |head
cat 57030_sort_peaks.narrowPeak.bed |cut -f 4 | sort -k1 |head 下面是正常的排序。 sed '1d' 57030_sort_peaks.peakAnn.xls |cut -f 1| sort -k1 |head
peak1
peak10
peak100
peak1000
peak1001
peak1002
peak1003
peak1004
peak1005
peak1006
cat 57030_sort_peaks.narrowPeak.bed |cut -f 4 | sort -k1 |head
peak1
peak10
peak100
peak1000
peak1001
peak1002
peak1003
peak1004
peak1005
peak1006 查了一些资料,发现搞不明白为什么,就算了,反正我喜欢perl,就干脆用perl排序吧! sed '1d' 57030_sort_peaks.peakAnn.xls |perl -alne '{$h{$F[0]}=$_}END{print $h{$_} foreach sort keys %h}' |cut -f 1-8 |head -12
peak1 chr1 191368 191561 + 6.45284 NA Intergenic
peak10 chr1 2652248 2652508 + 4.24389 NA intron (NM_001242672, intron 4 of 6)
peak100 chr1 66930370 66930598 + 5.48491 NA intron (NM_001077704, intron 1 of 13)
peak1000 chr16 2513739 2513891 + 4.00941 NA promoter-TSS (NM_001198569)
peak1001 chr16 2777421 2777567 + 4.54262 NA promoter-TSS (NM_007108)
peak1002 chr16 2904949 2905145 + 4.40547 NA Intergenic
peak1003 chr16 3059613 3059759 + 4.06547 NA promoter-TSS (NR_123723)
peak1004 chr16 3282982 3283229 + 6.19764 NA promoter-TSS (NM_005741)
peak1005 chr16 3457701 3457880 + 4.26274 NA promoter-TSS (NM_001317097)
peak1006 chr16 4412132 4412322 + 4.11112 NA intron (NR_145128, intron 3 of 27)
peak1007 chr16 4425589 4425781 + 6.69260 NA promoter-TSS (NM_001135110)
peak1008 chr16 4538610 4538771 + 4.44393 NA promoter-TSS (NM_001199054)
cut -f 4,7-9 57030_sort_peaks.narrowPeak.bed |perl -alne '{$h{$F[0]}=$_}END{print $h{$_} foreach sort keys %h}' |head -12
peak1 6.45284 12.81044 8.25841
peak10 4.24389 20.70743 15.66803
peak100 5.48491 9.93370 5.68875
peak1000 4.00941 10.78589 6.44162
peak1001 4.54262 12.95468 8.39192
peak1002 4.40547 9.16804 5.03032
peak1003 4.06547 8.97515 4.87603
peak1004 6.19764 16.08099 11.28833
peak1005 4.26274 9.14970 5.01487
peak1006 4.11112 10.51290 6.19959
peak1007 6.69260 18.52489 13.59905
peak1008 4.44393 9.86713 5.64038 虽然代码有点长,但也比耽搁大半个小时要划算。 虽然我看了下面这个sort命令的解释,但是仍然没搞明白为什么~ sort命令的k选项大讨论 我还是觉得这些文件是有问题的,因为cut命令也没办法按照tab分隔符正常的取第几列。 |
|