分享

尝鲜|生信好文十篇

 生物_医药_科研 2019-01-04

经常有朋友问,bioRxiv上那么多预印本(preprint),怎么判断哪些已经发表了呢?最简单的办法,就是看每一篇预印本作者列表下如果有下面这一行红字(见下图),则代表已经该文发表了。

 

而如果还未发表,则是

 

当然,也有少部分文章,即使正式发表也不一定在bioRxiv显示已发表字样,可能是由于作者在投稿时未选择与bioRxiv原文关联的缘故吧。所以最保险的方法,还是在网上搜一下咯。

 

自去年六月推出bioRxiv生信好文速览栏目后,我们每月一期,坚持为大家推送了68篇精选出来的预印本文章。半年过去了,我们为推送了68preprint!截至目前,其中有12篇已在同行评议期刊发表,且大部分都是领域内颇具影响力的杂志,包括CellNature GeneticsPNASeLifeGenome BiologyGigaScienceBioinformatics等等。详细的列表在文末给出。闲话少说,下面为大家奉上12月份bioRxiv生信好文十篇。

 

1. GenomicsPacBio大显神威:一只蚊子完成组装基因组

A High-Quality De Novo Genome Assembly from a Single Mosquito using PacBio SequencingCC-BY-NC-ND 4.0

A high-quality reference genome is a fundamental resource for functional genetics, comparative genomics, and population genomics, and is increasingly important for conservation biology. PacBio Single Molecule, Real-Time (SMRT) sequencing generates long reads with uniform coverage and high consensus accuracy, making it a powerful technology for de novo genome assembly. Improvements in throughput and concomitant reductions in cost have made PacBio an attractive core technology for many large genome initiatives, however, relatively high DNA input requirements (~5 μg for standard library protocol) have placed PacBio out of reach for many projects on small organisms that have lower DNA content, or on projects with limited input DNA for other reasons. Here we present a high-quality de novo genome assembly from a single Anopheles coluzzii mosquito. A modified SMRTbell library construction protocol without DNA shearing and size selection was used to generate a SMRTbell library from just 100 ng of starting genomic DNA. The sample was run on the Sequel System with chemistry 3.0 and software v6.0, generating, on average, 25 Gb of sequence per SMRT Cell with 20 hour movies, followed by diploid de novo genome assembly with FALCON-Unzip. The resulting curated assembly had high contiguity (contig N50 3.5 Mb) and completeness (more than 98% of conserved genes are present and full-length). In addition, this single-insect assembly now places 667 (>90%) of formerly unplaced genes into their appropriate chromosomal contexts in the AgamP4 PEST reference. We were also able to resolve maternal and paternal haplotypes for over 1/3 of the genome. By sequencing and assembling material from a single diploid individual, only two haplotypes are present, simplifying the assembly process compared to samples from multiple pooled individuals. The method presented here can be applied to samples with starting DNA amounts as low as 100 ng per 1 Gb genome size. This new low-input approach puts PacBio- based assemblies in reach for small highly heterozygous organisms that comprise much of the diversity of life.

 

2. GenomicsNanopore不甘落后:美洲锥虫基因组重测序助力治病原理破解

Nanopore sequencing significantly improves genome assembly of the eukaryotic protozoan parasite Trypanosoma cruziCC-BY-NC 4.0

Chagas disease was described by Carlos Chagas, who first identified the parasite Trypanosoma cruzi from a two-year-old girl called Berenice. Many T. cruzi sequencing projects based on short reads have demonstrated that genome assembly and downstream comparative analyses are extremely challenging in this species, given that half of its genome is composed of repetitive sequences. Here, we report de novo assemblies, annotation and comparative analyses of the Berenice strain using a combination of Illumina short reads and MinION long reads. Our work demonstrates that Nanopore sequencing improves T. cruzi assembly contiguity and increases the assembly size in ~16 Mb. Specifically, we found that assembly improvement also refines the completeness of coding regions for both single copy genes and repetitive transposable elements. Beyond its historical and epidemiological importance, Berenice constitutes a fundamental resource since it now represents the best-quality assembly available for TcII, a highly prevalent lineage causing human infections in South America. The availability of Berenice genome expands the known genetic diversity of T. cruzi and facilitates more comprehensive evolutionary inferences. Our work represents the first report of Nanopore technology used to resolve complex protozoan genomes, supporting its subsequent application for improving trypanosomatid and other highly repetitive genomes.

 

原文图1A-C

 

3. BioinformaticsT-COFFEE新版本root-to-leave regressive computation使多序列比对更快更准确

Fast and accurate large multiple sequence alignments using root-to-leave regressive computationCC-BY-NC-ND 4.0

Inferences derived from large multiple alignments of biological sequences are critical to many areas of biology, including evolution, genomics, biochemistry, and structural biology. However, the complexity of the alignment problem imposes the use of approximate solutions. The most common is the progressive algorithm, which starts by aligning the most similar sequences, incorporating the remaining ones following the order imposed by a guide-tree. We developed and validated on protein sequences a regressive algorithm that works the other way around, aligning first the most dissimilar sequences. Our algorithm produces more accurate alignments than non-regressive methods, especially on datasets larger than 10,000 sequences. By design, it can run any existing alignment method in linear time thus allowing the scale-up required for extremely large genomic analyses.

 

原文图1 Regressive algorithm overview

 

4. Transcriptomics】基因组所蔡军推出单细胞测序分析新工具scCapsNet

scCapsNet: a deep learning classifier with the capability of interpretable feature extraction, applicable for single cell RNA data analysisCC-BY-NC-ND 4.0

Recent advances in single cell RNA sequencing (scRNA-seq) call more computational analysis methods. As the data for non-characterized cells accumulates quickly, supervised learning model is an ideal tool to classify the non-characterized cells based on the previously well characterized cells. However, deep learning model is an appropriate tool to deal with vast and complex data such as RNA-seq data, but lacks of interpretability. Here for the first time, we present scCapsNet, a deep learning model adapted from CapsNet. The scCapsNet model retains the capsule parts of CapsNet and replaces the part of convolutional neural networks with several parallel fully connected neural networks. We apply scCapsNet to scRNA-seq data of mouse retinal bipolar cells and human peripheral blood mononuclear cells (PBMC). The results show that scCapsNet performs well as a classifier. Meanwhile, the results also demonstrate that the parallel fully connected neural networks function like feature detectors as we supposed. The scCapsNet model provides precise contribution of each extracted feature to the cell type recognition. Furthermore, we mix the RNA expression of two cells with different cell types and then use the scCapsNet model trained with non-mixed data to predict the cell types in the mixed data. Our scCapsNet model could predict cell types in a cell mixture with high accuracy.

 

5. Genomics】华盛顿大学圣路易分校:近两万人类基因组结构变异分析

Mapping and characterization of structural variation in 17,795 deeply sequenced human genomesCC-BY-NC-ND 4.0

A key goal of whole genome sequencing (WGS) for human genetics studies is to interrogate all forms of variation, including single nucleotide variants (SNV), small insertion/deletion (indel) variants and structural variants (SV). However, tools and resources for the study of SV have lagged behind those for smaller variants. Here, we used a cloud-based pipeline to map and characterize SV in 17,795 deeply sequenced human genomes from common disease trait mapping studies. We publicly release site-frequency information to create the largest WGS-based SV resource to date. On average, individuals carry 2.9 rare SVs that alter coding regions, which affect the dosage or structure of 4.2 genes and account for 4.0-11.2% of rare high-impact coding alleles. Based on a computational model, we estimate that SVs account for 17.2% of rare alleles genome-wide whose predicted deleterious effects are equivalent to loss-of-function (LoF) coding alleles; ~90% of such SVs are non-coding deletions (mean 19.1 per genome). We report 158,991 ultra-rare SVs and show that ~2% of individuals carry ultra-rare megabase-scale SVs, nearly half of which are balanced and/or complex rearrangements. Finally, we exploit this resource to infer the dosage sensitivity of genes and non-coding elements, revealing strong trends related to regulatory element class, conservation and cell-type specificity. This work will help guide SV analysis and interpretation in the era of WGS.

 

6. Bioinformatics】基因组相似性测算新工具Dashing

Dashing: Fast and Accurate Genomic Distances with HyperLogLogCC-BY 4.0

Dashing is a fast and accurate software tool for estimating similarities of genomes or sequencing datasets. It uses the HyperLogLog sketch together with cardinality estimation methods that specialize in set unions and intersections. Dashing sketches genomes more rapidly than previous MinHash-based methods while providing greater accuracy across a wide range of input sizes and sketch sizes. It can sketch and calculate pairwise distances for over 87K genomes in under 6 minutes. Dashing is open source and available at https://github.com/dnbaker/dashing.

 

7. Genomics】中科院遗传所钱文峰实验室报道肿瘤发生过程中蛋白复合体各亚基的协同DNA replication

Selection for synchronized replication of genes encoding the same protein complex during tumorigenesis

DNA replication alters the dosage balance among genes; at the mid-S phase, early-replicating genes have doubled their copies while late-replicating genes have not. Dosage imbalance among proteins, especially within members of a protein complex, is toxic to cells. Here, we propose the synchronized replication hypothesis: genes sensitive to stoichiometric relationships will be replicated simultaneously to maintain stoichiometry. In support of this hypothesis, we observe that genes encoding the same protein complex have similar replication timing, but surprisingly, only in fast-proliferating cells such as embryonic stem cells and cancer cells. The synchronized replication observed in cancer cells, but not in slow-proliferating differentiated cells, is due to convergent evolution during tumorigenesis that restores synchronized replication timing within protein complexes. Collectively, our study reveals that the selection for dosage balance during S phase plays an important role in the optimization of the replication-timing program; that this selection is relaxed during differentiation as the cell cycle is elongated, and restored as the cell cycle shortens during tumorigenesis.

 

8. EpigenomicsMASC2改进版面世,修补多个bug

Improved peak-calling with MACS2CC-BY 4.0

The computational analyses of genome-enrichment assays, such as ChIP-seq and ATAC-seq, are typically concluded with a peak-calling program that identifies genomic regions that are significantly enriched. The most popular peak-caller, MACS2, assumes that the input alignment files are for single-end sequence reads by default, yet those with paired-end Illumina sequence data frequently use this default setting. This leads to erroneous coverage values and suboptimal peak identification. However, using the correct paired-end mode can introduce another set of artifacts. After thoroughly reviewing the MACS2 source code, we have modified it to limit these and other problems. Our updated version is freely available (https://github.com/jsh58/MACS).

 

9. Evolution】替代传统进化树分析bootstrap的新指标

New methods to calculate concordance factors for phylogenomic datasetsCC-BY 4.0

We introduce and implement two measures for quantifying genealogical concordance in phylogenomic datasets: the gene concordance factor (gCF) and the site concordance factor (sCF). For every branch of a reference tree, gCF is defined as the percentage of decisive gene trees containing that branch. This measure is already in wide usage, but here we introduce a package that calculates it while accounting for variable taxon coverage among gene trees. sCF is a new measure defined as the percentage of decisive sites supporting a branch in the reference tree. gCF and sCF complement classical measures of branch support in phylogenetics by providing a full description of underlying disagreement among loci and sites. Availability and Implementation: An easy to use implementation and tutorial is freely available in the IQ-TREE software (http://www.). Supplementary information: Data are available at https:///10.5281/zenodo.1949290

 

10. Genomics】参考基因组偏倚对人类群体遗传学研究的影响

The presence and impact of reference bias on population genomic studies of prehistoric human populationsCC-BY 4.0

High quality reference genomes are an important resource in genomic research projects. A consequence is that DNA fragments carrying the reference allele will be more likely to map successfully, or receive higher quality scores. This reference bias can have effects on downstream population genomic analysis when heterozygous sites are falsely considered homozygous for the reference allele. In palaeogenomic studies of human populations, mapping against the human reference genome is used to identify endogenous human sequences. Ancient DNA studies usually operate with low sequencing coverages and fragmentation of DNA molecules causes a large proportion of the sequenced fragments to be shorter than 50bp -- reducing the amount of accepted mismatches, and increasing the probability of multiple matching sites in the genome. These ancient DNA specific properties are potentially exacerbating the impact of reference bias on downstream analyses, especially since most studies of ancient human populations use pseudo-haploid data, i.e. they randomly sample only one sequencing read per site. We show that reference bias is pervasive in published ancient DNA sequence data of prehistoric humans with some differences between individual genomic regions. We illustrate that the strength of reference bias is negatively correlated with fragment length. Reference bias can cause differences in the results of downstream analyses such as population affinities, heterozygosity estimates and estimates of archaic ancestry. These spurious results highlight how important it is to be aware of these technical artifacts and that we need strategies to mitigate the effect. Therefore, we suggest some post-mapping filtering strategies to resolve reference bias which help to reduce its impact substantially.

 

附录:2018年推送的bioRxiv好文中已发表的文章列表


五月:黑熊基因组

Srivastava A, Kumar Sarsani V, Fiddes I, et al. Genome assembly and gene expression in the American black bear provides new insights into the renal response to hibernation. DNA Res. 2018

五月:phastCons在线版

Ramani R, Krumholz K, Huang YF, Siepel A. PhastWeb: a web interface for evolutionary conservation scoring of multiple sequence alignments using phastCons and phyloP. Bioinformatics. 2018

五月:昆虫嗅觉受体基因起源

Brand P, Robertson HM, Lin W, et al. The origin of the odorant receptor gene family in insects. Elife. 2018

六月:丛林狩猎者的趋同进化

Bergey CM, Lopez M, Harrison GF, et al. Polygenic adaptation and convergent evolution on growth and cardiac genetic pathways in African and Asian rainforest hunter-gatherers. Proc Natl Acad Sci U S A. 2018;115(48):E11256-E11263

六月:八个果蝇基因组的重注释

Yang H, Jaime M, Polihronakis M, et al. Re-annotation of eight Drosophila genomes. Life Sci Alliance. 2018;1(6):e201800156

六月:全基因组测序分析揭示蒙大拿果蝇对寒冷的适应机理

Parker DJ, Wiberg RAW, Trivedi U, et al. Inter and Intraspecific Genomic Divergence in Drosophila montana Shows Evidence for Cold Adaptation. Genome Biol Evol. 2018;10(8):2086-2101

七月:RNAlater法样品贮藏对RNAseq的潜在影响

Passow CN, Kono TJY, Stahl BA, Jaggard JB, Keene AC, McGaugh SE. Nonrandom RNAseq gene expression associated with RNAlater and flash freezing storage methods. Mol Ecol Resour. 2018

七月:澳中学者联手解析小麦基因组关键序列

Keeble-Gagnère G, Rigault P, Tibbits J, et al. Optical and physical mapping with local finishing enables megabase-scale resolution of agronomically important regions in the wheat genome. Genome Biol. 2018;19(1):112

七月:针对骨质疏松症的GWAS分析

Morris JA, Kemp JP, Youlten SE, et al. An atlas of genetic influences on osteoporosis in humans and mice. Nat Genet. 2018

八月:宏基因组分箱(metagenome binning)软件哪家强

Meyer F, Hofmann P, Belmann P, et al. AMBER: Assessment of Metagenome BinnERs. Gigascience. 2018;7(6)

九月:678种真核微生物基因组的重新组装和注释

Johnson LK, Alexander H, Brown CT. Re-assembly, quality evaluation, and annotation of 678 microbial eukaryotic reference transcriptomes. Gigascience. 2018

九月:Perturb-ATAC——单细胞测序里的CRISPR+ATAC-seq

Rubin AJ, Parker KR, Satpathy AT, et al. Coupled Single-Cell CRISPR Screening and Epigenomic Profiling Reveals Causal Gene Regulatory Networks. Cell. 2018


2019年,遇见更好的自己



 


    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多