FASTA:是原始的序列文件,两行代表一个reads,没有序列测序的质量信息(拓展名为 .fas 或 .fa)。 FASTQ:目前最为常用的原始序列文件,四行代表一个reads,第一行是identify,第二行是序列(AGCT),第三行是“+”(无意义),第四行是质量信息。 SAM:是带有比对信息的序列文件(即告诉你这个reads在染色体上的位置等信息),是fastq文件经过比对软件比对后得到的结果(例如bowtie2,bwa)。 BAM:是sam格式的二进制文件,可以用samtools工具实现sam和bam文件之间的转化。 BED:是表示一个区域的注释文件,常用有6列,12列,不过形式比较灵活。 GTF:包括GFF等,是专门用来注释基因区域的文件,9列,最后一列可以自由变换。一般比bed文件要大。 VCF:Variant Call Format,是一个用于存储基因序列突变信息的文本格式。表示单碱基突变、插入/缺失、 拷贝数变异和结构变异等
1. 基因组测序相关的论坛
l http:/// (注册后才可以访问) l http://www./ (可以正常访问) l http://www./ (可以正常访问) 2. 基因组测序、组装和下游分析的切入文章
如果大家对某一方面感兴趣,可以仔细读一下下面推荐的文章
| 相关文章 | 文库的准备和测序 | 1. Next-generation DNA sequencing methods 2. Next-generation sequencing platforms | 质量控制和预处理 | 1. NGS QC Toolkit: a toolkit for quality control of next generation sequencing data 2. Prevention, diagnosis and treatment of high-throughput sequencing data pathologies 3. ConDeTri–a content dependent read trimmer for Illumina data | 基因组组装 | 1. Sequence assembly demystified. 2. Genome assembly reborn: recent computational challenges 3. Sense from sequence reads: methods for alignment and assembly | 基因组组装质量评估 | 1. Assemblathon 1: a competitive assessment of de novo short read assembly methods 2. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. 3. Evaluation of next-generation sequencing software in mapping and assembly. | 基因组注释 | A beginner’s guide to eukaryotic genome annotation | 基因组比对 | 1. Inference of human population history from individual whole-genome sequences. 2. How to map billions of short reads onto genomes. 3. Evaluation of next-generation sequencing software in mapping and assembly | 数据处理 | 1. The Sequence Alignment/Map format and SAMtools. 2. BEDTools: a flexible suite of utilities for comparing genomic features. | 以单倍体为基础的方法 | 1. Haplotype phasing: existing methods and new developments. 2. The importance of phase information for human genomics. 3. Inference of population structure using dense haplotype data. | Variant calling | 1. SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data. 2. A framework for variation discovery and genotyping using next-generation DNA sequencing data 3. From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline | Population genomic summary statistics | 1. SNP calling, genotype calling, and sample allele frequency estimation from New-Generation Sequencing data. 2. The variant call format and VCFtool |
3. 数据库或分析平台
l 集成的生物信息分析平台Galaxy (http:///) (https:///) l 网络基因组注释平台Web Apollo (http:///) l 基因组在线数据库 (http:///cgi-bin/GOLD/index.cgi) l ENSEMBL数据库 (http://www./index.html) l UCSC Genome Browser (http://genomebrowser./) l 植物基因组大小查询: http://data./cvalues/ l 动物基因组大小查询: http://www./ l 基因组组装信息查询:https://www.ncbi.nlm./assembly
|