分享

python——thefuzzy、difflib详解

 laq6 2023-04-11 发布于江西

preface:最近业务上涉及一些文本匹配计算的东西,包括以往也涉及到,用到模糊匹配,但之前并没有深究原理。这次详细看了下模糊计算的得分怎么计算的。编辑距离计算略。

thefuzzy:

difflib:计算两个字符串差异的包。有主要的SequenceMatcher类。

SequenceMatcher类:主要的ratio计算相似得分。get_opcodes计算四种(增删改等)操作。

一、difflib

ratio计算关键原理:

  1. Where T is the total number of elements in both sequences, and
  2. M is the number of matches, this is 2.0*M / T.

源代码:/opt/anaconda3/lib/python3.8/difflib.py

解释:T分别是两个序列的元素长度和,M是匹配到的个数。

abc、a:匹配了a,故结果是(2*1)/(3+1) = 0.5

aabc、a:也只匹配了a,虽然前面有两个a,但可以理解为后面匹配到了的a用掉了,故(2*1)/(4+1)=0.4

aabc、aa:匹配到了第一个a、第二个a,共两个,故(2*2)/(4+2)=2/3

  1. def _calculate_ratio(matches, length):
  2. if length:
  3. return 2.0 * matches / length
  4. return 1.0
  5. def ratio(self):
  6. """Return a measure of the sequences' similarity (float in [0,1]).
  7. Where T is the total number of elements in both sequences, and
  8. M is the number of matches, this is 2.0*M / T.
  9. Note that this is 1 if the sequences are identical, and 0 if
  10. they have nothing in common."""
  11. matches = sum(triple[-1] for triple in self.get_matching_blocks())
  12. return _calculate_ratio(matches, len(self.a) + len(self.b))

要想计算ratio,需要先计算get_matching_blocks:

get_matching_blocks关键原理:依赖最长子序列find_longest_match

  1. def find_longest_match(self, alo, ahi, blo, bhi):
  2. a, b, b2j, isbjunk = self.a, self.b, self.b2j, self.bjunk.__contains__
  3. besti, bestj, bestsize = alo, blo, 0
  4. j2len = {}
  5. nothing = []
  6. for i in range(alo, ahi):
  7. j2lenget = j2len.get
  8. newj2len = {}
  9. for j in b2j.get(a[i], nothing):
  10. # a[i] matches b[j]
  11. if j < blo:
  12. continue
  13. if j >= bhi:
  14. break
  15. k = newj2len[j] = j2lenget(j-1, 0) + 1
  16. if k > bestsize:
  17. besti, bestj, bestsize = i-k+1, j-k+1, k
  18. j2len = newj2len
  19. while besti > alo and bestj > blo and \
  20. not isbjunk(b[bestj-1]) and \
  21. a[besti-1] == b[bestj-1]:
  22. besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
  23. while besti+bestsize < ahi and bestj+bestsize < bhi and \
  24. not isbjunk(b[bestj+bestsize]) and \
  25. a[besti+bestsize] == b[bestj+bestsize]:
  26. bestsize += 1
  27. while besti > alo and bestj > blo and \
  28. isbjunk(b[bestj-1]) and \
  29. a[besti-1] == b[bestj-1]:
  30. besti, bestj, bestsize = besti-1, bestj-1, bestsize+1
  31. while besti+bestsize < ahi and bestj+bestsize < bhi and \
  32. isbjunk(b[bestj+bestsize]) and \
  33. a[besti+bestsize] == b[bestj+bestsize]:
  34. bestsize = bestsize + 1
  35. return Match(besti, bestj, bestsize)
  36. def get_matching_blocks(self):
  37. if self.matching_blocks is not None:
  38. return self.matching_blocks
  39. la, lb = len(self.a), len(self.b)
  40. queue = [(0, la, 0, lb)]
  41. matching_blocks = []
  42. while queue:
  43. alo, ahi, blo, bhi = queue.pop()
  44. i, j, k = x = self.find_longest_match(alo, ahi, blo, bhi)
  45. if k: # if k is 0, there was no matching block
  46. matching_blocks.append(x)
  47. if alo < i and blo < j:
  48. queue.append((alo, i, blo, j))
  49. if i+k < ahi and j+k < bhi:
  50. queue.append((i+k, ahi, j+k, bhi))
  51. matching_blocks.sort()
  52. i1 = j1 = k1 = 0
  53. non_adjacent = []
  54. for i2, j2, k2 in matching_blocks:
  55. # Is this block adjacent to i1, j1, k1?
  56. if i1 + k1 == i2 and j1 + k1 == j2:
  57. k1 += k2
  58. else:
  59. if k1:
  60. non_adjacent.append((i1, j1, k1))
  61. i1, j1, k1 = i2, j2, k2
  62. if k1:
  63. non_adjacent.append((i1, j1, k1))
  64. non_adjacent.append( (la, lb, 0) )
  65. self.matching_blocks = list(map(Match._make, non_adjacent))
  66. return self.matching_blocks

get_opcodes关键原理:

  • 把一个序列变成另外一个序列所需要增加、删除、改动、等于四种操作的情况。

二、thefuzz

四种模糊匹配计算方式:

  • 简单匹配(Ratio)、非完全匹配(Partial Ratio)、忽略顺序匹配(Token Sort Ratio)和去重子集匹配(Token Set Ratio)

Ratio:直接调用了difflib的SequenceMatcher

token_sort_ratio、token_set_ratio:

  • 做了简单的处理。再调用Ratio
  • 处理:将两个字符串空格分割开来,得到两个集合a、b。其中
    • a&b排序拼接在一起得到sorted_sect,交集
    • (a-b、a&b)排序拼接一起得到combined_1to2。差集+交集
    • (b-a、a&b)排序拼接一起combined_2to1。另外一个差集+交集
    • 计算ratio(sorted_sect)、ratio(combined_1to2)、ratio(combined_2to1)三者之间的最大值。

源代码:/opt/anaconda3/lib/python3.8/site-packages/thefuzz.py

  1. @utils.check_for_none
  2. def _token_set(s1, s2, partial=True, force_ascii=True, full_process=True):
  3. """Find all alphanumeric tokens in each string...
  4. - treat them as a set
  5. - construct two strings of the form:
  6. <sorted_intersection><sorted_remainder>
  7. - take ratios of those two strings
  8. - controls for unordered partial matches"""
  9. if not full_process and s1 == s2:
  10. return 100
  11. p1 = utils.full_process(s1, force_ascii=force_ascii) if full_process else s1
  12. p2 = utils.full_process(s2, force_ascii=force_ascii) if full_process else s2
  13. if not utils.validate_string(p1):
  14. return 0
  15. if not utils.validate_string(p2):
  16. return 0
  17. # pull tokens
  18. tokens1 = set(p1.split())
  19. tokens2 = set(p2.split())
  20. intersection = tokens1.intersection(tokens2)
  21. diff1to2 = tokens1.difference(tokens2)
  22. diff2to1 = tokens2.difference(tokens1)
  23. sorted_sect = " ".join(sorted(intersection))
  24. sorted_1to2 = " ".join(sorted(diff1to2))
  25. sorted_2to1 = " ".join(sorted(diff2to1))
  26. combined_1to2 = sorted_sect + " " + sorted_1to2
  27. combined_2to1 = sorted_sect + " " + sorted_2to1
  28. # strip
  29. sorted_sect = sorted_sect.strip()
  30. combined_1to2 = combined_1to2.strip()
  31. combined_2to1 = combined_2to1.strip()
  32. if partial:
  33. ratio_func = partial_ratio
  34. else:
  35. ratio_func = ratio
  36. pairwise = [
  37. ratio_func(sorted_sect, combined_1to2),
  38. ratio_func(sorted_sect, combined_2to1),
  39. ratio_func(combined_1to2, combined_2to1)
  40. ]
  41. return max(pairwise)

其他:略。如:from fuzz import process

process.extractOne

process.extract

注意:中文使用fuzz.token_set_ratio(t1, t2),t1、t2需要用空格隔开一个个字

参考:

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多