de bruijn and OLC

bengua1985 2012-11-11

展开全文

Hi there

I think there are various issues here

1. First, something that is not as pedantic as it first sounds. De Bruijn and overlap graphs are not algorithms. They are data structures. To give a broad analogy, they are different ways of filing and summarising your data, but say not much about what you do with the data once it is stored. The reviewer's statement that they have the same mathematical characteristics is reasonable, although there is a lot of devil in the detail. (see below) In principle one might apply the same algorithm to both data structures.

2. In the special case of infinite coverage, if you choose the right parameters (de Bruiijn kmer=overlap=read-length-1) then the overlap and de Bruijn graphs are the same. Because of this people tend to think of them as equivalent. However with finite coverage, it is unknown whether the two formulations are equivalent. If you need a reference for that, I think the end of Richard Durbin and Jared Simpson's FM index paper will do. For a given depth of coverage and given genome (which implies something about repeat structure), and given read length, it's not clear that you necessarily make the same choices of kmer/overlap parameter for the two approaches and therefore it's not clear you get equivalent results. HOWEVER....

3. The real issue is one of experimental design, cost, and of data properties. Overlap graphs do not scale so well (in general) with volumes of data, and so tend to be used with longer reads and lower coverage. That said, look up the SGA paper (again Simpson and Durbin), recently out in Genome Research. 454 data is expensive but the reads are longer, so you sequence to lower depth (which means it is harder to deal with errors). De Bruijn graphs should scale better with coverage,but then your choice of kmer requires a trade off between repeat resolution and coverage. In short - de Bruijn assemblers and overlap assemblers tend to be used on different TYPES of data with implicitly different experimental design (read length and coverage). This implies a difference in assembly properties.

4. Generally, all assembler papers have an introduction where they describe a general data structure and some algorithms, and then deep in the details they have a bunch of heuristics. These will also have a big effect on the differences between results of specific different assemblers.

5. Transcriptome assembly is hard, and I would expect the major differences in assembly properties not to be due to the data structure, but in how much work has gone into the actual assembler itself.

So returning to your original question, I'm not sure I understand the sentence as you typed it. "..between heterogeneous 454-transcriptome reads between de Bruijn..etc". Do you mean given a bunch of 454 transcriptome reads, you'd expect de Bruijn and overlap assemblers to perform differently? I think it's not a very helpful thing to think about. Different assembly tools will perform differently depending on how much work has gone in.Depending on your depth of coverage, and whether the specific assembler can cope with the 454 error model, and whether it is implemented/tested/designed well, you'll get better or worse results.

I'd recommend Mihai Pop's excellent 2009 article (De novo assembly reborn) as a good introduction to various issues

best

Zam