python——thefuzzy、difflib详解

laq6 2023-04-11 发布于江西

展开全文

preface：最近业务上涉及一些文本匹配计算的东西，包括以往也涉及到，用到模糊匹配，但之前并没有深究原理。这次详细看了下模糊计算的得分怎么计算的。编辑距离计算略。

thefuzzy：

python的模糊匹配包，java也有实现。里面基本上基于difflib来实现的。
地址：https://github.com/seatgeek/thefuzz
安装：pip install thefuzz

difflib：计算两个字符串差异的包。有主要的SequenceMatcher类。

SequenceMatcher类：主要的ratio计算相似得分。get_opcodes计算四种(增删改等)操作。

一、difflib

ratio计算关键原理：

Where T is the total number of elements in both sequences, and
M is the number of matches, this is 2.0*M / T.

源代码：/opt/anaconda3/lib/python3.8/difflib.py

解释：T分别是两个序列的元素长度和，M是匹配到的个数。

abc、a：匹配了a，故结果是(2*1)/(3+1) = 0.5

aabc、a：也只匹配了a，虽然前面有两个a，但可以理解为后面匹配到了的a用掉了，故(2*1)/(4+1)=0.4

aabc、aa：匹配到了第一个a、第二个a，共两个，故(2*2)/(4+2)=2/3

def _calculate_ratio(matches, length):
    if length:
        return 2.0 * matches / length
    return 1.0
def ratio(self):
    """Return a measure of the sequences' similarity (float in [0,1]).
    Where T is the total number of elements in both sequences, and
    M is the number of matches, this is 2.0*M / T.
    Note that this is 1 if the sequences are identical, and 0 if
    they have nothing in common."""
    matches = sum(triple[-1] for triple in self.get_matching_blocks())
    return _calculate_ratio(matches, len(self.a) + len(self.b))

要想计算ratio，需要先计算get_matching_blocks：

get_matching_blocks关键原理：依赖最长子序列find_longest_match

    def find_longest_match(self, alo, ahi, blo, bhi):
        a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
        besti, bestj, bestsize = alo, blo, 0
        j2len = {}
        nothing = []
        for i in range(alo, ahi):
            j2lenget = j2len.get
            newj2len = {}
            for j in b2j.get(a[i], nothing):
                # a[i] matches b[j]
                if j < blo:
                    continue
                if j >= bhi:
                    break
                k = newj2len[j] = j2lenget(j-1, 0) + 1
                if k > bestsize:
                    besti, bestj, bestsize = i-k+1, j-k+1, k
            j2len = newj2len
        while besti > alo and bestj > blo and \
              not isbjunk(b[bestj-1]) and \
              a[besti-1] == b[bestj-1]:
            besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
        while besti+bestsize < ahi and bestj+bestsize < bhi and \
              not isbjunk(b[bestj+bestsize]) and \
              a[besti+bestsize] == b[bestj+bestsize]:
            bestsize += 1
        while besti > alo and bestj > blo and \
              isbjunk(b[bestj-1]) and \
              a[besti-1] == b[bestj-1]:
            besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
        while besti+bestsize < ahi and bestj+bestsize < bhi and \
              isbjunk(b[bestj+bestsize]) and \
              a[besti+bestsize] == b[bestj+bestsize]:
            bestsize = bestsize + 1
        return Match(besti, bestj, bestsize)
    def get_matching_blocks(self):
        if self.matching_blocks is not None:
            return self.matching_blocks
        la, lb = len(self.a), len(self.b)
        queue = [(0, la, 0, lb)]
        matching_blocks = []
        while queue:
            alo, ahi, blo, bhi = queue.pop()
            i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
            if k:   # if k is 0, there was no matching block
                matching_blocks.append(x)
                if alo < i and blo < j:
                    queue.append((alo, i, blo, j))
                if i+k < ahi and j+k < bhi:
                    queue.append((i+k, ahi, j+k, bhi))
        matching_blocks.sort()
        i1 = j1 = k1 = 0
        non_adjacent = []
        for i2, j2, k2 in matching_blocks:
            # Is this block adjacent to i1, j1, k1?
            if i1 + k1 == i2 and j1 + k1 == j2:
                k1 += k2
            else:
                if k1:
                    non_adjacent.append((i1, j1, k1))
                i1, j1, k1 = i2, j2, k2
        if k1:
            non_adjacent.append((i1, j1, k1))
        non_adjacent.append( (la, lb, 0) )
        self.matching_blocks = list(map(Match._make, non_adjacent))
        return self.matching_blocks

get_opcodes关键原理：

把一个序列变成另外一个序列所需要增加、删除、改动、等于四种操作的情况。

二、thefuzz

四种模糊匹配计算方式：

简单匹配（Ratio）、非完全匹配（Partial Ratio）、忽略顺序匹配（Token Sort Ratio）和去重子集匹配（Token Set Ratio）

Ratio：直接调用了difflib的SequenceMatcher

token_sort_ratio、token_set_ratio：

做了简单的处理。再调用Ratio
处理：将两个字符串空格分割开来，得到两个集合a、b。其中
- a&b排序拼接在一起得到sorted_sect，交集
- (a-b、a&b)排序拼接一起得到combined_1to2。差集+交集
- (b-a、a&b)排序拼接一起combined_2to1。另外一个差集+交集
- 计算ratio(sorted_sect)、ratio(combined_1to2)、ratio(combined_2to1)三者之间的最大值。

源代码：/opt/anaconda3/lib/python3.8/site-packages/thefuzz.py

@utils.check_for_none
def _token_set(s1, s2, partial=True, force_ascii=True, full_process=True):
    """Find all alphanumeric tokens in each string...
        - treat them as a set
        - construct two strings of the form:
            <sorted_intersection><sorted_remainder>
        - take ratios of those two strings
        - controls for unordered partial matches"""
    if not full_process and s1 == s2:
        return 100
    p1 = utils.full_process(s1, force_ascii=force_ascii) if full_process else s1
    p2 = utils.full_process(s2, force_ascii=force_ascii) if full_process else s2
    if not utils.validate_string(p1):
        return 0
    if not utils.validate_string(p2):
        return 0
    # pull tokens
    tokens1 = set(p1.split())
    tokens2 = set(p2.split())
    intersection = tokens1.intersection(tokens2)
    diff1to2 = tokens1.difference(tokens2)
    diff2to1 = tokens2.difference(tokens1)
    sorted_sect = " ".join(sorted(intersection))
    sorted_1to2 = " ".join(sorted(diff1to2))
    sorted_2to1 = " ".join(sorted(diff2to1))
    combined_1to2 = sorted_sect + " " + sorted_1to2
    combined_2to1 = sorted_sect + " " + sorted_2to1
    # strip
    sorted_sect = sorted_sect.strip()
    combined_1to2 = combined_1to2.strip()
    combined_2to1 = combined_2to1.strip()
    if partial:
        ratio_func = partial_ratio
    else:
        ratio_func = ratio
    pairwise = [
        ratio_func(sorted_sect, combined_1to2),
        ratio_func(sorted_sect, combined_2to1),
        ratio_func(combined_1to2, combined_2to1)
    ]
    return max(pairwise)