全基因组测序指南（二）

生物_医药_科研 2018-12-11

展开全文

一、NGS常用的文件格式

FASTA：是原始的序列文件，两行代表一个reads，没有序列测序的质量信息（拓展名为 .fas 或 .fa）。

FASTQ：目前最为常用的原始序列文件，四行代表一个reads，第一行是identify，第二行是序列（AGCT），第三行是“+”（无意义），第四行是质量信息。

SAM：是带有比对信息的序列文件（即告诉你这个reads在染色体上的位置等信息），是fastq文件经过比对软件比对后得到的结果（例如bowtie2，bwa）。

BAM：是sam格式的二进制文件，可以用samtools工具实现sam和bam文件之间的转化。

BED：是表示一个区域的注释文件，常用有6列，12列，不过形式比较灵活。

GTF：包括GFF等，是专门用来注释基因区域的文件，9列，最后一列可以自由变换。一般比bed文件要大。

VCF：Variant Call Format，是一个用于存储基因序列突变信息的文本格式。表示单碱基突变、插入/缺失、拷贝数变异和结构变异等

二、有用的资源

1. 基因组测序相关的论坛

l http:/// （注册后才可以访问）

l http://www./ （可以正常访问）

2. 基因组测序、组装和下游分析的切入文章

如果大家对某一方面感兴趣，可以仔细读一下下面推荐的文章

	相关文章
文库的准备和测序	1. Next-generation DNA sequencing methods 2. Next-generation sequencing platforms
质量控制和预处理	1. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data 2. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies 3. ConDeTri–a content dependent read trimmer for Illumina data
基因组组装	1. Sequence assembly demystified. 2. Genome assembly reborn: recent computational challenges 3. Sense from sequence reads: methods for alignment and assembly
基因组组装质量评估	1. Assemblathon 1: a competitive assessment of de novo short read assembly methods 2. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. 3. Evaluation of next-generation sequencing software in mapping and assembly.
基因组注释	A beginner’s guide to eukaryotic genome annotation
基因组比对	1. Inference of human population history from individual whole-genome sequences. 2. How to map billions of short reads onto genomes. 3. Evaluation of next-generation sequencing software in mapping and assembly
数据处理	1. The Sequence Alignment/Map format and SAMtools. 2. BEDTools: a flexible suite of utilities for comparing genomic features.
以单倍体为基础的方法	1. Haplotype phasing: existing methods and new developments. 2. The importance of phase information for human genomics. 3. Inference of population structure using dense haplotype data.
Variant calling	1. SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data. 2. A framework for variation discovery and genotyping using next-generation DNA sequencing data 3. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline
Population genomic summary statistics	1. SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data. 2. The variant call format and VCFtool