谈一谈中国春基因转录水平上的证据

洋溢九洲 2021-01-04

展开全文

从组装好的基因组序列到基因注释这一步，说简单也简单，说难也难。这里的难是指，在转录水平上做到95%以上的准确率，还是比较困难的。我们前面曾经介绍过基因注释的一些内容。

基因注释一般是指采用生物信息学的方法获得已组装好的基因组中基因的位置、结构和基因功能等信息，一般包括从头注释、同源注释和基于转录组和蛋白质组的注释。基于转录组和蛋白质组的注释是目前最准确的方法，但受限于不可能获得所有时空下的转录组，所以有必要用同源注释和从头注释的结果作为补充。基因注释是分子生物学研究的基础，若基因注释结果不正确或不完整，则以此为基础的后续研究也会受到影响。

目前已经有众多的算法或软件被开发出来用于基因注释领域。从头注释的软件包括SNAP (Korf 2004)，TwinScan (Korf et al. 2001)，FGENESH (Salamov and Solovyev 2000)，Augustus (Stanke et al. 2006)，Genscan (Burge and Karlin 1997)， GAZE (Howe et al. 2002) 等。这些软件往往需要将一些已知基因作为训练集，然后根据训练好的模型去预测基因。理论上，只要训练集包括足够的基因，该方法是可以预测出所有基因位点的，但却不能准确的界定基因的外显子-内含子结构。

同源注释则是将近源物种基因的转录本序列或蛋白序列映射至需要注释的基因组上，常用的工具有BLAST、BLAT (Kent 2002)、Splign (Kapustin et al. 2008)、Spidey (Wheelan et al. 2001)、sim4 (Florea et al. 1998)、Exonerate (Slater and Birney 2005)、gmap (Wu and Watanabe 2005)、Magic-BLAST(Boratyn et al. 2019)和minimap2 (Li 2018) 等软件, 其中gmap、Magic-BLAST和minimap2为新一代比对工具，可将大量的转录本序列快速比对至基因组上。同源注释有助于基因位点的发现，但由于不同物种之间基因组上存在差异，在基因结构以及是否表达上还需要本物种转录水平的证据支持。

基于转录组的注释，是指将不同来源的转录本序列比对至基因组上，然后根据转录本的位置进行注释。比对常用的软件与上述同源转录本的比对所用的软件一致。转录本序列一般来自EST序列、全长cDNA序列、二代测序获得的转录本序列以及三代测序获得的转录本序列。由于二代测序打断测序的缺点，在拼接成全长转录本时会有假阳性的转录本产生。而三代测序获得的转录本则可以避免这种情况的发生，但由于错误率高而且价格也比较高，只被用在少数研究中。另外，为了获得基因的方向、更精确的转录起始位点和结束位点等信息，诸如链特异性RNA-seq、Cap Analysis Gene Expression (CAGE-seq) 和PolyA-seq等基于二代测序平台获得的数据也被加入到基因组注释流程中 (Wang et al. 2019)。相比转录组来说，目前高通量蛋白质组技术还未获得关键性突破。核糖体印记测序 (Ribo-seq) 可在一定程度上代替高通量蛋白质组技术。该技术能够获得正在翻译过程中的mRNA片段, 但目前还未见将该数据应用到注释流程当中的报道。基因组正常转录时可能会出现一些转录噪音，并不是真正的基因，因此注释基因时也应当考虑基因的表达量，排除可能的转录噪音。

为提高基因注释的准确性和完整性，可以将上述三种基因注释方法综合起来使用。目前有一些软件将这三个方面的注释方法整合到一个流程当中，如MAKER (Cantarel et al. 2008)、MAKER-P (Campbell et al. 2014)、PASA (Haas et al. 2003)、Funannotate[1] 以及一些综合性的生物数据库网站也会开发一套自己的注释流程，如Gramene pipeline (Liang et al. 2009)、Ensembl gene annotation system (Aken et al. 2016)、NCBI Eukaryotic Genome Annotation Pipeline[2]和PGSB[3]等。随着使用三代测序获得的转录本日益增多，一些基于三代转录组数据的基因注释软件也被开发出来，如LoReAn (Cook et al. 2019)、mikado (Venturini et al. 2018) 等。另外，随着测序价格的降低以及基因组组装技术的进步，从头组装一个新基因组也变得容易起来，对于那些已有基因注释的物种来说，可将已有的基因注释转移至新基因组上，目前已经有一些生物信息学工具可以方便的完成这一过程 (Konig et al. 2016; Song et al. 2019)。总结来说，这种将不同注释方法整合起来的生物信息软件极大简化了基因注释的过程，在此基础之上可辅以人工校正来纠正仍然可能错误的基因。其中Dunn et al. (2019) 开发的工具Apollo让研究者进行人工校正变得更加便捷。

上述基因注释方法同样可以应用到小麦基因组的注释上，无论是乌尔拉图小麦、节节麦还是野生二粒小麦和栽培二粒小麦以及中国春的基因注释工作，都使用了上述三种方法和相关的软件。其中，国际小麦测序联盟在注释中国春基因组的过程中采用了多种方法，除PGSB、PASA流程之外，还使用了专门为注释小麦基因组所开发的TriAnnot流程 (Leroy et al. 2012)。该流程包括了转座子注释、基因注释以及后续的基因功能注释。尽管如此，在实际使用过程中发现，目前中国春小麦的基因注释中仍然存在错误，如小麦雄性不育基因Ms2就不在当前注释版本中 (Ni et al. 2017)。

以上是介绍着重方法和工具的介绍。实际上，我也有动手去完善中国春或者大麦的基因注释，有些结果已经放到小麦多组学网站上。主要是利用转录水平的证据去完善，如小麦ESTs序列，RNA-seq数据，PacBio数据等。其中，仅用到的RNA-seq数据就达2000多份。

但折腾了各种工具和方法之后发现，想要在转录水平上达到较高的准确率，必须利用大量高深度的全长转录本(PacBio数据)去完善，仅仅靠二代RNA-seq数据是不现实的，另外，还需要借助Apollo进行人工校正，这才能达到一个较高的准确率。我曾折腾过大麦的Apollo，大概估算了下时间，全心全意搞这个，挨个基因过一遍大概需要半年时间。重要的是，这玩意时间一长，超级无聊。有段时间，投入了我的休息时间，用了几天时间大概检查了110Mb，但没坚持下去。我们常常讲时间管理，我认为这是不对的，其实应该是人的精力/注意力管理。很多时候，时间是有，但精力跟不上，尤其是我们这种不经常锻炼身体的。

去年听说，IWGSC在整2.0版本的注释，但到目前为止还没有释放。我倒是希望好好整一下，哪怕晚点出来。不管有没有出来，大家关注某一区间的基因时，不妨参考下这些转录水平的证据。

Aken BL, Ayling S, Barrell D, Clarke L, Curwen V, Fairley S, Fernandez Banet J, Billis K, Garcia Giron C, Hourlier T, Howe K, Kahari A, Kokocinski F, Martin FJ, Murphy DN, Nag R, Ruffier M, Schuster M, Tang YA, Vogel JH, White S, Zadissa A, Flicek P, Searle SM (2016) The Ensembl gene annotation system. Database (Oxford) 2016

Boratyn GM, Thierry-Mieg J, Thierry-Mieg D, Busby B, Madden TL (2019) Magic-BLAST, an accurate RNA-seq aligner for long and short reads. BMC Bioinformatics 20:405

Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268:78-94

Campbell MS, Law M, Holt C, Stein JC, Moghe GD, Hufnagel DE, Lei J, Achawanantakun R, Jiao D, Lawrence CJ, Ware D, Shiu SH, Childs KL, Sun Y, Jiang N, Yandell M (2014) MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations. Plant Physiol 164:513-524

Cantarel BL, Korf I, Robb SM, Parra G, Ross E, Moore B, Holt C, Sanchez Alvarado A, Yandell M (2008) MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes. Genome Res 18:188-196

Cook DE, Valle-Inclan JE, Pajoro A, Rovenich H, Thomma B, Faino L (2019) Long-Read Annotation: automated eukaryotic genome annotation based on long-read cDNA sequencing. Plant Physiol 179:38-54

Dunn NA, Unni DR, Diesh C, Munoz-Torres M, Harris NL, Yao E, Rasche H, Holmes IH, Elsik CG, Lewis SE (2019) Apollo: Democratizing genome annotation. PLoS Comput Biol 15:e1006790

Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8:967-974

Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK, Jr., Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, Salzberg SL, White O (2003) Improving the *Arabidopsis* genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31:5654-5666

Howe KL, Chothia T, Durbin R (2002) GAZE: a generic framework for the integration of gene-prediction data by dynamic programming. Genome Res 12:1418-1427

Kapustin Y, Souvorov A, Tatusova T, Lipman D (2008) Splign: algorithms for computing spliced alignments with identification of paralogs. Biol Direct 3:20

Kent WJ (2002) BLAT--the BLAST-like alignment tool. Genome Res 12:656-664

Konig S, Romoth LW, Gerischer L, Stanke M (2016) Simultaneous gene finding in multiple genomes. Bioinformatics 32:3388-3395

Korf I (2004) Gene finding in novel genomes. BMC Bioinformatics 5:59

Korf I, Flicek P, Duan D, Brent MR (2001) Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1:S140-148

Leroy P, Guilhot N, Sakai H, Bernard A, Choulet F, Theil S, Reboux S, Amano N, Flutre T, Pelegrin C, Ohyanagi H, Seidel M, Giacomoni F, Reichstadt M, Alaux M, Gicquello E, Legeai F, Cerutti L, Numa H, Tanaka T, Mayer K, Itoh T, Quesneville H, Feuillet C (2012) TriAnnot: a versatile and high performance pipeline for the automated annotation of plant genomes. Front Plant Sci 3:5

Li H (2018) Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34:3094-3100

Liang C, Mao L, Ware D, Stein L (2009) Evidence-based gene predictions in plant genomes. Genome Res 19:1912-1923

Ni F, Qi J, Hao Q, Lyu B, Luo MC, Wang Y, Chen F, Wang S, Zhang C, Epstein L, Zhao X, Wang H, Zhang X, Chen C, Sun L, Fu D (2017) Wheat *Ms2* encodes for an orphan protein that confers male sterility in grass species. Nat Commun 8:15121

Salamov AA, Solovyev VV (2000) Ab initio gene finding in *Drosophila* genomic DNA. Genome Res 10:516-522

Slater GS, Birney E (2005) Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6:31

Song B, Sang Q, Wang H, Pei H, Wang F, Gan X (2019) A weighted sequence alignment strategy for gene structure annotation lift over from reference genome to a newly sequenced individual. bioRxiv

Stanke M, Schoffmann O, Morgenstern B, Waack S (2006) Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7:62

Venturini L, Caim S, Kaithakottil GG, Mapleson DL, Swarbreck D (2018) Leveraging multiple transcriptome assembly methods for improved gene structure annotation. Gigascience 7

Wang K, Wang D, Zheng X, Qin A, Zhou J, Guo B, Chen Y, Wen X, Ye W, Zhou Y, Zhu Y (2019) Multi-strategic RNA-seq analysis reveals a high-resolution transcriptional landscape in cotton. Nat Commun 10:4714

Wheelan SJ, Church DM, Ostell JM (2001) Spidey: a tool for mRNA-to-genomic alignments. Genome Res 11:1952-1957

Wu TD, Watanabe CK (2005) GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21:1859-1875

------

[1] https://funannotate.

[2] https://www.ncbi.nlm./books/NBK169439

[3] http://pgsb./plant