cs - Text difference algorithm - Stack Overflow

daomucun 2010-12-18

展开全文

up vote 25 down vote favorite

Hi, I need an algorithm that can compare two text files and highlight their difference and ( even better!) can compute their difference in a meaningful way ( meaning, two similar files should have a similarity score higher than two dissimilar files, with the word "similar" defined in the normal terms). It sounds easy to implement, but it's not.

The implementation can be in c# or python, thanks.

edited Sep 28 '08 at 11:00

aku
21.9k255106

asked Sep 28 '08 at 10:12

Ngu Soon Hui
13.2k654144

67% accept rate

	Just to be clear, are you asking for textual similarity or semantic similarity? – Torsten Marek Sep 28 '08 at 10:43
	Textual similarity. I supposed that semantic similarity still has a long way to go :) – Ngu Soon Hui Sep 28 '08 at 10:53
	It's not that difficult. A simple bag-of-words model goes a long way. – Torsten Marek Sep 28 '08 at 11:02

10 Answers

active newest votes

up vote 10 down vote accepted

In Python, there is difflib, as also others have suggested.

difflib offers the SequenceMatcher class, which can be used to give you a similarity ratio. Example function:

def text_compare(text1, text2, isjunk=None):
    return difflib.SequenceMatcher(isjunk, text1, text2).ratio()

answered Sep 28 '08 at 23:02

ΤΖΩΤΖΙΟΥ
10.6k31841

Thanks, this is the exact answer I need. – Ngu Soon Hui Sep 29 '08 at 2:03

up vote 21 down vote

Look at difflib. (Python)

That will calculate the diffs in various formats. You could then use the size of the context diff as a measure of how different two documents are?

answered Sep 28 '08 at 10:14

Douglas Leeder
17.3k21643

up vote 15 down vote

I can recommend to take a look at Neil Fraser's code and articles:

google-diff-match-patch

Currently available in Java, JavaScript, C++ and Python. Regardless of language, each library features the same API and the same functionality. All versions also have comprehensive test harnesses.

Neil Fraser: Diff Strategies - for theory and implementation notes

answered Sep 28 '08 at 11:04

aku
21.9k255106

up vote 6 down vote

Bazaar contains an alternative difference algorithm, called patience diff (there's more info in the comments on that page) which is claimed to be better than the traditional diff algorithm. The file 'patiencediff.py' in the bazaar distribution is a simple command line front end.

answered Sep 28 '08 at 10:35

Daniel James
1,444311

up vote 5 down vote

If you need a finer granularity than lines, you can use Levenshtein distance. Levenshtein distance is a straight-forward measure on how to similar two texts are.
You can also use it to extract the edit logs and can a very fine-grained diff, similar to that on the edit history pages of SO. Be warned though that Levenshtein distance can be quite CPU- and memory-intensive to calculate, so using difflib,as Douglas Leder suggested, is most likely going to be faster.

Cf. also this answer.

edited Sep 28 '08 at 10:36

answered Sep 28 '08 at 10:27

Torsten Marek
9,93612149

	I think you need to put in a link to that Cf. at the end. :-) – Douglas Leeder Sep 28 '08 at 10:31
	Thanks, I screwed up the markup. – Torsten Marek Sep 28 '08 at 10:36

up vote 4 down vote

My current understanding is that the best solution to the Shortest Edit Script (SES) problem is Myers "middle-snake" method with the Hirschberg linear space refinement.

The Myers algorithm is described in:

E. Myers, ``An O(ND) Difference Algorithm and Its Variations,''
Algorithmica 1, 2 (1986), 251-266.

The GNU diff utility uses the Myers algorighm.

The "similarity score" you speak of is called the "edit distance" in the literature which is the number of inserts or deletes necessary to transform one sequence into the other.

Note that a number of people have cited the Levenshtein distance algorithm but that is, albeit easy to implement, not the optimal solution as it is inefficient (requires the use of a possibly huge n*m matrix) and does not provide the "edit script" which is the sequence of edits that could be used to transform one sequence into the other and visa versa.

For a good Myers / Hirschberg implementation look at:

http://www./~miallen/libmba/dl/src/diff.c

The particular library that it is contained within is not longer maintained but to my knowledge the diff.c module itself is still correct.

Mike

answered Jan 26 '09 at 0:44

user8134
628146

up vote 2 down vote

As stated, use difflib. Once you have the diffed output, you may find the Levenshtein distance of the different strings as to give a "value" of how different they are.

answered Sep 28 '08 at 10:33

paradoja
713414

up vote 2 down vote

There are a number of distance metrics, as paradoja mentioned there is the Levenshtein distance, but there is also NYSIIS and Soundex. In terms of Python implementations, I have used py-editdist and ADVAS before. Both are nice in the sense that you get a single number back as a score. Check out ADVAS first, it implements a bunch of algorithms.

answered Sep 28 '08 at 19:21

johnp
1012

up vote 0 down vote

Can you searching on 'LCS' Algorithm

answered Dec 21 '09 at 1:31

zeadqunes
1

up vote 0 down vote

One method I've employed for a different functionality, to calculate how much data was new in a modified file, could perhaps work for you as well.

I have a diff/patch implementation C# that allows me to take two files, presumably old and new version of the same file, and calculate the "difference", but not in the usual sense of the word. Basically I calculate a set of operations that I can perform on the old version to update it to have the same contents as the new version.

To use this for the functionality initially described, to see how much data was new, I simple ran through the operations, and for every operation that copied from the old file verbatim, that had a 0-factor, and every operation that inserted new text (distributed as part of the patch, since it didn't occur in the old file) had a 1-factor. All characters was given this factory, which gave me basically a long list of 0's and 1's.

All I then had to do was to tally up the 0's and 1's. In your case, with my implementation, a low number of 1's compared to 0's would mean the files are very similar.

This implementation would also handle cases where the modified file had inserted copies from the old file out of order, or even duplicates (ie. you copy a part from the start of the file and paste it near the bottom), since they would both be copies of the same original part from the old file.

I experimented with weighing copies, so that the first copy counted as 0, and subsequent copies of the same characters had progressively higher factors, in order to give a copy/paste operation some "new-factor", but I never finished it as the project was scrapped.

If you're interested, my diff/patch code is available from my Subversion repository.