整理】Python中的re.search和re.findall之间的区别和联系 + re.finall中带命名的组，不带命名的组，非捕获的组，没有分组四种类型之间的区别

戴维图书馆 2018-09-05

展开全文

之前自己曾被搞晕过很多次。

后来使用这些函数次数多了之后，终于比较清楚的弄懂了两者之间的区别和关系了。

尤其是一些细节方面的注意事项了。

在看下面的总结和代码之前，请先确保你对如下基本概念已经有所了解了：

【教程】详解Python正则表达式

【教程】详解Python正则表达式之： (…) group 分组

【教程】详解Python正则表达式之： (?P<name>…) named group 带命名的组

下面，简单总结如下：

re.search和re.findall的区别和联系

函数返回结果

常见的获得对应的值的方法

常见疑问及解答

re.search

一个Match对象

通过Match对象内的group编号或命名，获得对应的值

问：为何search只匹配到一项，而不是所有匹配的项？
答：因为search本身的功能就是:
从左到右，去计算是否匹配，如果有匹配，就返回。
即只要找到匹配，就返回了。
所以，最多只会匹配一个，
而不会匹配多个。
想要匹配多个，请去使用re.findall

re.findall

一个列表；

列表中每个元素的值的类型，取决于你的正则表达式的写法
- 是元组tuple：当你的正则表达式中有（带捕获的）分组（简单可理解为有括号）
  - 而tuple的值，是各个group的值所组合出来的
- 是字符串：当你的正则表达式中没有捕获的组（不分组，或非捕获分组）
  - 字符串的值，是你的正则表达式所匹配到的单个完整的字符串

直接获得对应的列表
每个列表中的值，一般就是你想要的值了

参见下面的详细解释，需要注意四种不同类型的正则表达式的效果的区别。

其中，对于re.findall，又需要特殊注意四种不同类型的正则表达式的效果，都不太一样：

re.finall使用正则表达式的类型	返回值的类型相同点	返回值的区别	用途
不分组=no group	都是返回列表类型的值	列表中每个值，都是完整匹配的字符串	适用于，先通过此种方法获得对应的完整匹配到的字符串，然后再针对每个字符串，提取所需的（对应的每个域，每个组）的值
非捕获分组=non-capturing group	都是返回列表类型的值	列表中每个值，都是完整匹配的字符串	同上，只不过是从正则表达式的形式上，和分组的类型（不带命名的组或带命名的组）中，一一对应，方便逻辑是理解后续所要处理的值
不带命名的分组=unnamed group	都是返回列表类型的值	列表中每个值，都是元祖（tuple）类型的值，内容是每个分组的值的组合	适用于，直接通过findall，就可以获得多个匹配的字符串中，每个字符串中特定的组的内容，省却了再次通过re.search再去提取的工作了
带命名的分组=named group	都是返回列表类型的值	列表中每个值，都是元祖（tuple）类型的值，内容是每个分组的值的组合	同上，但是在正则表达式的形式上，更容易看清楚各个分组的含义

如何深入理解上述的含义，则需要代码详细的演示：

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

#!/usr/bin/python
# -*- coding: utf-8 -*-
"""
【整理】Python中的re.search和re.findall之间的区别和联系 + re.finall中带命名的组，不带命名的组，非捕获的组，没有分组四种类型之间的区别
 
http://www./python_re_search_vs_re_findall
 
Version:    2012-11-16
Author:     Crifan
"""
 
import re;
 
# 提示：
# 在看此教程之前，请先确保已经对下列内容已了解：
# 【教程】详解Python正则表达式
# http://www./detailed_explanation_about_python_regular_express/
# 【教程】详解Python正则表达式之： (…) group 分组
# http://www./detailed_explanation_about_python_regular_express_about_group/
# 【教程】详解Python正则表达式之： (?P<name>…) named group 带命名的组
# http://www./detailed_explanation_about_python_regular_express_named_group/
 
searchVsFindallStr = """
pic url test 1http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg
pic url test 2http://1881.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35ee46g213.jpg
pic url test 2http://1802.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae361ac6g213.jpg
"""
 
singlePicUrlP_noGroup = "http://\w+\.\w+\.\w+.+?/\w+?.jpg"; # 不带括号，即没有group的
singlePicUrlP_nonCapturingGroup = "http://(?:\w+)\.(?:\w+)\.(?:\w+).+?/(?:\w+?).jpg"; #非捕获的组 == non-capturing group
singlePicUrlP_namedGroup = "http://(?P<field1>\w+)\.(?P<field2>\w+)\.(?P<field3>\w+).+?/(?P<filename>\w+?).jpg"; #带命名的group == named group
singlePicUrlP_unnamedGroup = "http://(\w+)\.(\w+)\.(\w+).+?/(\w+?).jpg"; #不带命名的group == unnamed group
 
# 1. re.search
#通过search，只能获得单个的字符串
#因为search不像findall，会去搜索所有符合条件的
foundSinglePicUrl = re.search(singlePicUrlP_namedGroup, searchVsFindallStr);
#searc只会在找到第一个符合条件的之后，就停止搜索了
print "foundSinglePicUrl=",foundSinglePicUrl; #foundSinglePicUrl= <_sre.SRE_Match object at 0x01F75230>
#然后返回对应的Match对象
print "type(foundSinglePicUrl)=",type(foundSinglePicUrl); #type(foundSinglePicUrl)= <type '_sre.SRE_Match'>
if(foundSinglePicUrl):
    #对应的，如果带括号了，即带group，是可以通过group来获得对应的值的
    field1 = foundSinglePicUrl.group("field1");
    field2 = foundSinglePicUrl.group("field2");
    field3 = foundSinglePicUrl.group("field3");
    filename = foundSinglePicUrl.group("filename");
     
    group1 = foundSinglePicUrl.group(1);
    group2 = foundSinglePicUrl.group(2);
    group3 = foundSinglePicUrl.group(3);
    group4 = foundSinglePicUrl.group(4);
     
    #field1=1821, filed2=img, field3=pp, filename=u121516081_136ae35f9d5g213
    print "field1=%s, filed2=%s, field3=%s, filename=%s"%(field1, field2, field3, filename);
     
    #此处也可以看到，即使group是命名了，但是也还是对应着索引号1,2,3,4的group的值的
    #两者是等价的，只是通过名字去获得对应的组的值，相对更加具有可读性，且不会出现搞混淆组的编号的问题
    #group1=1821, group2=img, group3=pp, group4=u121516081_136ae35f9d5g213
    print "group1=%s, group2=%s, group3=%s, group4=%s"%(group1, group2, group3, group4);
 
# 2. re.findall - no group
#通过findall，想要获得整个字符串的话，就要使用不带括号的，即没有分组
foundAllPicUrl = re.findall(singlePicUrlP_noGroup, searchVsFindallStr);
#findall会找到所有的匹配的字符串
print "foundAllPicUrl=",foundAllPicUrl; #foundAllPicUrl= ['http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg', 'http://1881.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35ee46g213.jpg', 'http://1802.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae361ac6g213.jpg']
#然后作为一个列表返回
print "type(foundAllPicUrl)=",type(foundAllPicUrl); #type(foundAllPicUrl)= <type 'list'>
if(foundAllPicUrl):
    for eachPicUrl in foundAllPicUrl:
        print "eachPicUrl=",eachPicUrl; # eachPicUrl=http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg
         
        #此处，一般常见做法就是，针对每一个匹配到的，完整的字符串
        #再去使用re.search处理，提取我们所需要的值
        foundEachPicUrl = re.search(singlePicUrlP_namedGroup, eachPicUrl);
        print "type(foundEachPicUrl)=",type(foundEachPicUrl); #type(foundEachPicUrl)= <type '_sre.SRE_Match'>
        print "foundEachPicUrl=",foundEachPicUrl; #foundEachPicUrl= <_sre.SRE_Match object at 0x025D45F8>
        if(foundEachPicUrl):
            field1 = foundEachPicUrl.group("field1");
            field2 = foundEachPicUrl.group("field2");
            field3 = foundEachPicUrl.group("field3");
            filename = foundEachPicUrl.group("filename");
             
            #field1=1821, filed2=img, field3=pp, filename=u121516081_136ae35f9d5g213
            print "field1=%s, filed2=%s, field3=%s, filename=%s"%(field1, field2, field3, filename);
 
# 3. re.findall - non-capturing group
#其实，此处通过非捕获的组，去使用findall的效果，其实和上面使用的，没有分组的效果，是类似的：
foundAllPicUrlNonCapturing = re.findall(singlePicUrlP_nonCapturingGroup, searchVsFindallStr);
#findall同样会找到所有的匹配的整个的字符串
print "foundAllPicUrlNonCapturing=",foundAllPicUrlNonCapturing; #foundAllPicUrlNonCapturing= ['http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg', 'http://1881.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35ee46g213.jpg', 'http://1802.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae361ac6g213.jpg']
#同样作为一个列表返回
print "type(foundAllPicUrlNonCapturing)=",type(foundAllPicUrlNonCapturing);#type(foundAllPicUrlNonCapturing)= <type 'list'>
if(foundAllPicUrlNonCapturing):
    for eachPicUrlNonCapturing in foundAllPicUrlNonCapturing:
        print "eachPicUrlNonCapturing=",eachPicUrlNonCapturing; #eachPicUrlNonCapturing=http://1821.img.pp.sohu.com.cn/images/blog/2012/3/7/23/28/u121516081_136ae35f9d5g213.jpg
         
        #此处，可以根据需要，和上面没有分组的例子中类似，再去分别处理每一个字符串，提取你所需要的值
 
# 4. re.findall - named group
#接着再来演示一下，如果findall中，使用了带命名的group（named group）的结果：
foundAllPicGroups = re.findall(singlePicUrlP_namedGroup, searchVsFindallStr);
#则也是可以去查找所有的匹配到的字符串的
#然后返回的是列表的值
print "type(foundAllPicGroups)=",type(foundAllPicGroups); #type(foundAllPicGroups)= <type 'list'>
#只不过，列表中每个值，都是对应的，各个group的值了
print "foundAllPicGroups=",foundAllPicGroups; #foundAllPicGroups= [('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213'), ('1881', 'img', 'pp', 'u121516081_136ae35ee46g213'), ('1802', 'img', 'pp', 'u121516081_136ae361ac6g213')]
if(foundAllPicGroups):
    for eachPicGroups in foundAllPicGroups:
        #此处，不过由于又是给group命名了，所以，就对应着
        #(?P<field1>\w+) (?P<field2>\w+) (?P<field3>\w+) (?P<filename>\w+?) 这几个部分的值了
        print "eachPicGroups=",eachPicGroups; #eachPicGroups= ('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213')
        #由于此处有多个group，此处类型是tuple，其中由上述四个group所组成
        print "type(eachPicGroups)=",type(eachPicGroups); #type(eachPicGroups)= <type 'tuple'>
         
        #此处，可以根据需要，和上面没有分组的例子中类似，再去分别处理每一个字符串，提取你所需要的值
 
# 5. re.findall - unnamed group
#此处再来演示一下，findall中，如果使用带group，但是是没有命名的group（unnamed group）的效果：
foundAllPicGroupsUnnamed = re.findall(singlePicUrlP_unnamedGroup, searchVsFindallStr);
#此处，肯定也是返回对应的列表类型
print "type(foundAllPicGroupsUnnamed)=",type(foundAllPicGroupsUnnamed);#type(foundAllPicGroupsUnnamed)= <type 'list'>
#而列表中每个值，其实也是对应各个组的值的组合
print "foundAllPicGroupsUnnamed=",foundAllPicGroupsUnnamed; #foundAllPicGroupsUnnamed= [('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213'), ('1881', 'img', 'pp', 'u121516081_136ae35ee46g213'), ('1802', 'img', 'pp', 'u121516081_136ae361ac6g213')]
if(foundAllPicGroupsUnnamed):
    for eachPicGroupsUnnamed in foundAllPicGroupsUnnamed:
        #可以看到，同样的，每个都是一个tuple变量
        print "type(eachPicGroupsUnnamed)=",type(eachPicGroupsUnnamed);#type(eachPicGroupsUnnamed)= <type 'tuple'>
        #每个tuple中的值，仍是各个未命名的组的值的组合
        print "eachPicGroupsUnnamed=",eachPicGroupsUnnamed; #eachPicGroupsUnnamed= ('1821', 'img', 'pp', 'u121516081_136ae35f9d5g213')
         
        #此处，可以根据需要，和上面没有分组的例子中类似，再去分别处理每一个字符串，提取你所需要的值