分享

数据挖掘 | 转录因子与motif(基础概念篇)

 link171 2019-09-18

友情提示:由于涉及概念的内容较多

故文中蓝色区域的内容建议重点参考

TF

transcription factor, TF, 转录因子, 是一种蛋白, 通过特异性结合调控区域的 DNA 序列来调控基因的转录过程, 一个转录因子可以同时调控多个基因:

In molecular biology, a transcription factor (TF) (or sequence-specific DNA-binding factor) is a protein that controls the rate of transcription of genetic information from DNA to messenger RNA, by binding to a specific DNA sequence.

TFs are key regulators of biological processes that function by binding to transcriptional regulatory regions (e.g., promoters, enhancers) to control the expression of their target genes.

人类基因组中可编码2000+个TFs

transcription factor binding site, TFBS, 转录因子结合位点, 是与转录因子结合的 DNA 序列, 长度通常在5~20bp,同一个转录因子在不同的基因上的结合位点具有一定的保守性,不完全相同:

Transcription factor binding motifs (TFBMs) are genomic sequences that specifically bind to transcription factors. The consensus sequence of a TFBM is variable, and there are a number of possible bases at certain positions in the motif, whereas other positions have a fixed base.

transcription factor binding motif, TFBM, 转录因子结合域, binding site 和 binding motif 常被混淆使用,对于其区别,参照一篇文献:

文中有描述如下:

A single TF can recognize dozens to hundreds of DNA binding site sequences over a range of binding affinities. Hence, the TF binding specificity (i.e., preferential binding of specific sequences) cannot be adequately represented using any one DNA sequence. Instead, TF binding specificities are often represented as binding site motifs, which summarize the collection of preferentially bound sequences. These motifs can be used to scan sequences of interest (e.g., genomic regions) to predict TF binding sites.

即,motif汇总了一个TF所有可能的结合位点(TFBS),并用于描述结合位点的特异性。

motif

motifs are a more practical representation of consensus elements in biological sequences, allowing for a more detailed description of the variability at each site.

Common types of motifs that are responsible for binding to DNA can be found in different transcription factors.

Each TF typically recognizes a collection of similar DNA sequences, which can be represented as binding site motifs using models such as position weight matrices (PWMs)

motif 可以用多种方法、模型去表示。举个例子,某个转录因子的结合位点序列如下:

最基本的表达方式是一致性序列 (consensus sequences):

A collection of DNA binding sites, typically referred to as a DNA binding motif, can be represented by a consensus sequence.

Given a set of sequences, a consensus sequence (also called canonical sequence) is the sequence obtained by taking the most frequent residues of nucleic acids / amino acids at each position.

即,从给定的一组序列中,选择由每个位点出现频率最高的碱基组成的一段序列,本例中为AAGAAA

https://www./discussion/912b207972304bf3a337e5473eca32ac

虽然简单,但是很明显,这样的表达方式是以牺牲准确性为代价的,有点以偏概全的意思…

由最终序列,无法得到某个位点可能出现的其他碱基,当然,你可以使用 IUPAC 编码方式去表示可能出现的两种或多种碱基,例如第二个位点可能出现A或者T,在 IUPAC 编码中以W来表示,但是仍然无法表示某种碱基出现的概率等信息!

http://www./sms2/iupac.html

故,需要更准确的模型来更好的表示motif

1、Position Frequency Matrices(PFMs, 位置频率矩阵),又被称为Position Count Matrix (PCM),矩阵中的数值是所有序列中,每个位点出现某碱基的频数:

列数等于序列长度,每列加和为6(共计6条序列),如所有序列的第一个碱基都是A,故在表中第一列A为6,其余碱基出现次数均为0!

2、Position Probability Matrix (PPM),矩阵中的数值是某碱基出现的频率(碱基出现次数/列总和):

每列加和为1,不同列之间相互独立。基于每个位点出现某碱基的可能性,可以推断出现某序列的可能性,例如AAGAAA的可能性约15%(=1*0.67*0.5*0.83*0.83*0.66)。如果起始序列数比较少,则会在PPM矩阵中出现较多的0值,可以增加个假值来矫正...

3、Position Weight Matrix (PWM, 位置权重矩阵),又被称为position-specific weight matrix (PSWM)、position-specific scoring matrix (PSSM)、logodds scoring matrix (LSM)。PWM矩阵由Score值组成:

Each column provides a score per nucleotide representing the relative preference for the given base at that position in the binding site.

最常用的Score计算方法是基于背景碱基 (随机出现) 频率,对真实的碱基频率进行矫正,并取log对数转换:

基于该公式可知,当某个特定碱基出现的可能性高于背景时,Score会为正值,否则为负值。假设每个碱基的背景概率均为0.25,则本例中PWM矩阵为:

以第二个位置的T碱基Score值为例,Score = log2(2/6/0.25) ≈ 0.415

同理,可以计算某个特定的序列的Score值,每个位置Score值相加即可:

In order to score a sequence, add up the score for the letters at the specific positions

如序列AAGAAA:

Score = 2+1.425+1+1.737+1.737+1.415 = 9.314

与PPM矩阵类似,显而易见的是矩阵中包含较多负无穷值-Inf,导致某些特定序列最终Score值也为负无穷(如AAAAAA),进而排除该序列出现可能性,可能会丢失关键信息...所以,同样可以对ProbN使用假值矫正:

由此可知,上示几种矩阵模型可以方便的进行转换!

TFs调控基因

在确定了TF的motif并将其表示为PWM之后,人们通常还希望进一步识别受该TF调节的基因。潜在的靶基因可以通过识别基因启动子区域是否含有该TF结合的motif来确定:

In addition to determine the sequence specificities of a TF and represent this specificities as a PWM, one usually wants to identify genes being regulated by this TF. Putative targets of a TF can be determined by finding genes whose promoter region contains the motif bound by that TF.

启动子区域示意图:

In genetics, a promoter is a region of DNA that initiates transcription of a particular gene. Promoters are located near the transcription start sites (TSS) of genes, on the same strand and upstream on the DNA.

启动子区域的定位是相对于转录起始位点TSS的,一般定义为其上游2kb:

As promoters are typically immediately adjacent to the gene in question, positions in the promoter are designated relative to the transcriptional start site, where transcription of DNA begins for a particular gene (i.e., positions upstream are negative numbers counting back from -1, for example -100 is a position 100 base pairs upstream).

Promoters can be about 100–1000 base pairs long.

https://en./wiki/Promoter_(genetics)

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多