preface:最近业务上涉及一些文本匹配计算的东西,包括以往也涉及到,用到模糊匹配,但之前并没有深究原理。这次详细看了下模糊计算的得分怎么计算的。编辑距离计算略。
thefuzzy:
difflib:计算两个字符串差异的包。有主要的SequenceMatcher类。
SequenceMatcher类:主要的ratio计算相似得分。get_opcodes计算四种(增删改等)操作。
一、difflib
ratio计算关键原理:
Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T.
源代码:/opt/anaconda3/lib/python3.8/difflib.py
解释:T分别是两个序列的元素长度和,M是匹配到的个数。
abc、a:匹配了a,故结果是(2*1)/(3+1) = 0.5
aabc、a:也只匹配了a,虽然前面有两个a,但可以理解为后面匹配到了的a用掉了,故(2*1)/(4+1)=0.4
aabc、aa:匹配到了第一个a、第二个a,共两个,故(2*2)/(4+2)=2/3
def _calculate_ratio(matches, length): return 2.0 * matches / length """Return a measure of the sequences' similarity (float in [0,1]). Where T is the total number of elements in both sequences, and M is the number of matches, this is 2.0*M / T. Note that this is 1 if the sequences are identical, and 0 if they have nothing in common.""" matches = sum(triple[-1] for triple in self.get_matching_blocks()) return _calculate_ratio(matches, len(self.a) + len(self.b))
要想计算ratio,需要先计算get_matching_blocks:
get_matching_blocks关键原理:依赖最长子序列find_longest_match
def find_longest_match(self, alo, ahi, blo, bhi): a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__ besti, bestj, bestsize = alo, blo, 0 for i in range(alo, ahi): for j in b2j.get(a[i], nothing): k = newj2len[j] = j2lenget(j-1, 0) + 1 besti, bestj, bestsize = i-k+1, j-k+1, k while besti > alo and bestj > blo and \ not isbjunk(b[bestj-1]) and \ a[besti-1] == b[bestj-1]: besti, bestj, bestsize = besti-1, bestj-1, bestsize+1 while besti+bestsize < ahi and bestj+bestsize < bhi and \ not isbjunk(b[bestj+bestsize]) and \ a[besti+bestsize] == b[bestj+bestsize]: while besti > alo and bestj > blo and \ isbjunk(b[bestj-1]) and \ a[besti-1] == b[bestj-1]: besti, bestj, bestsize = besti-1, bestj-1, bestsize+1 while besti+bestsize < ahi and bestj+bestsize < bhi and \ isbjunk(b[bestj+bestsize]) and \ a[besti+bestsize] == b[bestj+bestsize]: return Match(besti, bestj, bestsize) def get_matching_blocks(self): if self.matching_blocks is not None: return self.matching_blocks la, lb = len(self.a), len(self.b) alo, ahi, blo, bhi = queue.pop() i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi) if k: # if k is 0, there was no matching block matching_blocks.append(x) queue.append((alo, i, blo, j)) if i+k < ahi and j+k < bhi: queue.append((i+k, ahi, j+k, bhi)) for i2, j2, k2 in matching_blocks: # Is this block adjacent to i1, j1, k1? if i1 + k1 == i2 and j1 + k1 == j2: non_adjacent.append((i1, j1, k1)) non_adjacent.append((i1, j1, k1)) non_adjacent.append( (la, lb, 0) ) self.matching_blocks = list(map(Match._make, non_adjacent)) return self.matching_blocks
get_opcodes关键原理:
- 把一个序列变成另外一个序列所需要增加、删除、改动、等于四种操作的情况。
二、thefuzz
四种模糊匹配计算方式:
- 简单匹配(Ratio)、非完全匹配(Partial Ratio)、忽略顺序匹配(Token Sort Ratio)和去重子集匹配(Token Set Ratio)
Ratio:直接调用了difflib的SequenceMatcher
token_sort_ratio、token_set_ratio:
- 做了简单的处理。再调用Ratio
- 处理:将两个字符串空格分割开来,得到两个集合a、b。其中
- a&b排序拼接在一起得到sorted_sect,交集
- (a-b、a&b)排序拼接一起得到combined_1to2。差集+交集
- (b-a、a&b)排序拼接一起combined_2to1。另外一个差集+交集
- 计算ratio(sorted_sect)、ratio(combined_1to2)、ratio(combined_2to1)三者之间的最大值。
源代码:/opt/anaconda3/lib/python3.8/site-packages/thefuzz.py
def _token_set(s1, s2, partial=True, force_ascii=True, full_process=True): """Find all alphanumeric tokens in each string... - construct two strings of the form: <sorted_intersection><sorted_remainder> - take ratios of those two strings - controls for unordered partial matches""" if not full_process and s1 == s2: p1 = utils.full_process(s1, force_ascii=force_ascii) if full_process else s1 p2 = utils.full_process(s2, force_ascii=force_ascii) if full_process else s2 if not utils.validate_string(p1): if not utils.validate_string(p2): tokens1 = set(p1.split()) tokens2 = set(p2.split()) intersection = tokens1.intersection(tokens2) diff1to2 = tokens1.difference(tokens2) diff2to1 = tokens2.difference(tokens1) sorted_sect = " ".join(sorted(intersection)) sorted_1to2 = " ".join(sorted(diff1to2)) sorted_2to1 = " ".join(sorted(diff2to1)) combined_1to2 = sorted_sect + " " + sorted_1to2 combined_2to1 = sorted_sect + " " + sorted_2to1 sorted_sect = sorted_sect.strip() combined_1to2 = combined_1to2.strip() combined_2to1 = combined_2to1.strip() ratio_func = partial_ratio ratio_func(sorted_sect, combined_1to2), ratio_func(sorted_sect, combined_2to1), ratio_func(combined_1to2, combined_2to1)
其他:略。如:from fuzz import process
process.extractOne
process.extract
注意:中文使用fuzz.token_set_ratio(t1, t2),t1、t2需要用空格隔开一个个字
参考:
|