《Python编程快速上手——让繁琐的工作自动化》读书笔记6

Four兄 2019-08-24

展开全文

第七章模式匹配与正则表达式

好像咕咕咕太久了，又滚来更新了。这次是第七章的内容，正则表达式，如果写的有问题，请给我留言，非常感谢。

在进行本章内容的笔记之前，先说一下，正则表达式是什么。

百度给的定义如下：正则表达式是对字符串操作的一种逻辑共识，就是用事先定义好的一些特定字符、及这些特定字符的组合，组成一个“规则字符串”，这个“规则字符串”用来表达对字符串的一种过滤逻辑。（感觉其实说的很清楚了，再简单一点就是说：类似一种速记的逻辑，用自己特定的方法表示信息）

不用正则表达式来查找文本模式

首先，书上举的例子，是在一些字符串中查找电话号码。电话号码的格式是xxx-xxx-xxxx。我先假定看到这篇读书笔记的读者们，都已经了解了Python，或者有其他语言的基础，那么，先请大家思考一下，应该怎么来实现呢？

最简单，完全不管包装的方法就是直接从键盘或者文件输入字符串，然后在“主函数”部分用if来进行判断。然后关于字符串、元组、列表部分如果到这里仍有疑问，就麻烦翻一下前面的内容，在此不赘述啦。

以下是书中提供的代码（我不记得我有没有上传过代码包了，如果没有我回头上传一下）


def isPhoneNumber(text):
    if len(text) != 12:
        return False  # not phone number-sized
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False  # not an area code
    if text[3] != '-':
        return False  # does not have first hyphen
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False  # does not have first 3 digits
    if text[7] != '-':
        return False  # does not have second hyphen
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False  # does not have last 4 digits
    return True  # 'text' is a phone number!
print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))

（输出展示）

415-555-4242 is a phone number:

True

Moshi moshi is a phone number:

False

几点注释：

1. isdecimal() 方法检查字符串是否只包含十进制字符。这种方法只存在于unicode对象。

注意:定义一个十进制字符串，只需要在字符串前添加 'u' 前缀即可。

isdecimal()方法语法：

str.isdecimal()

如果字符串是否只包含十进制字符返回True，否则返回False。

2.调用函数的方法和其他语言差距不大

3.一定要注意空格，我太长时间没写了，导致长时间报错（我真应该找到我的游标卡尺，枯了）

isPhoneNumber()函数的代码进行几项检查，看看text中的字符串是不是有效的电话号码。如果其中任意一项检查失败，函数就返回False。代码首先检查该字符串是否刚好有12个字符➊。然后它检查区号(就是text中的前3个字符)是否只包含数字❷。函数剩下的部分检查该字符串是否符合电话号码的模式:号码必须在区号后出现第一个短横线❸， 3个数字❹，然后是另一个短横线❺,最后是4个数字❻如果程序执行通过了所有的检查，它就返回True❼。

然后，再利用前面提到的切片的方法，我们还可以从一串字符（不像前面的直接判断一小段一小段的字符串是不是电话号码）中提取电话号码。代码如下：


def isPhoneNumber(text):
    if len(text) != 12:
        return False  # not phone number-sized
    for i in range(0, 3):
        if not text[i].isdecimal():
            return False  # not an area code
    if text[3] != '-':
        return False  # does not have first hyphen
    for i in range(4, 7):
        if not text[i].isdecimal():
            return False  # does not have first 3 digits
    if text[7] != '-':
        return False  # does not have second hyphen
    for i in range(8, 12):
        if not text[i].isdecimal():
            return False  # does not have last 4 digits
    return True  # 'text' is a phone number!
'''print('415-555-4242 is a phone number:')
print(isPhoneNumber('415-555-4242'))
print('Moshi moshi is a phone number:')
print(isPhoneNumber('Moshi moshi'))'''
message='Call me at 415-555-1011 tomorrow. 415-555-9999 is my office'
for i in range(len(message)):
    chunk = message[i:i+12]
    if isPhoneNumber(chunk):
        print('Phone number found: '+ chunk)
print('Done')

（输出展示）

Phone number found: 415-555-1011
Phone number found: 415-555-9999
Done

“

在for 循环的每次迭代中，取自message 的一段新的 12个字符被赋给变量chunk❶.例如，在第一次迭代， i是0, chunk被赋值为message[0:12] (即字符串'Call me at 4').在下次选代，i是1, chunk 被赋值为message[1:13] (字符串'all me at 4I')。
将chunk传递给isPhoneNumber(),看看它是否符合电话号码的模式❷。如果符合，就打印出这段文本。
继续遍历message,最终chunk中的12个字符会是一个电话号码。该循环遍历了整个字符串，测试了每一段12个字符，打印出所有满足isPhoneNumber()的chunk。当我们遍历完message,就打印出Done.
在这个例子中，虽然message中的字符串很短，但它也可能包含上百万个字符，程序运行仍然不需要一秒钟。使用正则表达式查找电话号码的类似程序，运行也不会超过一秒钟，但用正则表达式编写这类程序会快得多”

”

用正则表达式查找文本模式

我们还是回到上面的问题，电话号码，因为书呢是美国人写的，就按照他们的习惯，电话号码格式是xxx-xxx-xxxx，那么正则表达式会长什么样子呢？就是用约定俗成的符号\d来代替我前面随意用的x，\d\d\d-\d\d\d-\d\d\d\d，因为人呢是特别懒惰的，当然也是为了尽量避免失误，所以还有一个简化版本的：\d\d\d-\d\d\d-\d\d\d\d=》\d{3}-\d{3}-\d{4}，通过花括号中间加数字表示前面的符号重复几遍。

创建正则表达式对象

Python中所有的正则表达式都在re模块中

import re

如果不导入就会报错：NameError：balabalabala……

如果我们要创建一个Regex对象来匹配电话号码模式（让phoneNumRegex中包含一个Regex对象）：

phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')

匹配Regex对象

通过search()方法查找字符串

那么前面的def部分+切片查找部分就被search()替代了


import re
phoneNumRegex = re.compile(r'\d\d\d-\d\d\d-\d\d\d\d')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print('phone number found: '+ mo.group())

（输出展示）

phone number found: 415-555-4242

几点注释：

1.search()：http://www.cnblogs.com/aaronthon/p/9435967.html

2.group()：https://www.cnblogs.com/erichuo/p/7909180.html

用正则表达式匹配更多模式

可以使用括号分组（搭配group()使用）

比如上面提到的：\d\d\d-\d\d\d-\d\d\d\d=》(\d\d\d)-(\d\d\d-\d\d\d\d)

上面的代码改成：


import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
print(mo.group(1))
'''
print(mo.group(2))
print(mo.group(0))
print(mo.group())
print(mo.group(1)+mo.group(2))
'''

（输出展示）

415

如果把注释去掉，输出如下：

415
555-4242
415-555-4242
415-555-4242
415555-4242


import re
phoneNumRegex = re.compile(r'(\d\d\d)-(\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is 415-555-4242.')
'''
print(mo.group(1))
print(mo.group(2))
print(mo.group(0))
print(mo.group())
print(mo.group(1)+mo.group(2))
'''
areaCode,mainNumber= mo.groups()
print(areaCode)
print(mainNumber)

（输出展示）

415
555-4242

括号在正则表达式中有特殊的含义，但是如果你需要在文本中匹配括号，怎么办?例如，你要匹配的电话号码，可能将区号放在一对括号中。在这种情况下，就需要用倒斜杠对(和)进行字符转义。


import re
phoneNumRegex = re.compile(r'(\(\d\d\d\)) (\d\d\d-\d\d\d\d)')
mo = phoneNumRegex.search('My number is (415) 555-4242.')
print(mo.group(1))
print(mo.group(2))
print(mo.group(1)+' '+mo.group(2))

（输出展示）

(415)
555-4242
(415) 555-4242

用管道匹配多个分组

那么，“管道”是什么呢？在本书中，将字符‘|’称为“管道”，用于希望匹配许多表达式中的一个时。比如：


import re
heroRegex=re.compile(r'Batman|Tina Fey')
mo1=heroRegex.search('Batman and Tina Fey.')
print(mo1.group())
mo2=heroRegex.search('Tina Fey and Batman.')
print(mo2.group())

（输出展示）

Batman
Tina Fey

如果Batman 和Tina Fey都出现在字符串中，那么返回第一个出现的匹配文本。

（后面还会提到“findall()”方法，可以用来找到“所有”匹配的地方）

也可以使用管道来匹配多个模式中的一个。比如说，书上举例子要匹配'Batman'、'Batmobile'、'Batcopter'、'Batbat'中任意一个。因为都以‘Bat’开头。∴还可以简化：


import re
batRegex=re.compile(r'Bat(man|mobile|copter|bat)')
mo=batRegex.search('Batmobile lost a wheel.')
print(mo.group())
print(mo.group(1))

（输出展示）

Batmobile
mobile

方法调用mo.group()返回了完全匹配的文本‘Batmobile’，而mo.group(1)只是返回第一个括号分组内匹配的文本‘mobile’。

如果需要匹配正真的管道字符，就用倒斜杠转义->\（思考这个意思是：


import re
batRegex=re.compile(r'\||Batman|bat')
mo=batRegex.search('| Batman lost a \.')
print(mo.group())

）

用问号实现可选匹配

直接举例子吧


import re
batRegex=re.compile(r'Bat(wo)?man')
mo1=batRegex.search('The adventures of Batman.')
print(mo1.group())
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())

在这里'(wo)?'就是一个可选择的项，就是类似可以省略可以不省略的意思。

如果真的需要匹配问号的，同上，还是加上倒斜杠转义。

用星号匹配零次或多次


import re
batRegex=re.compile(r'Bat(wo)*man')
mo1=batRegex.search('The adventures of Batman.')
print(mo1.group())
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The adventures of Batwowowowowowowoman')
print(mo3.group())

就和？差不了多少啦，无非就是把一次或零次改成零次或无数次（突然想起来，据说女装只有零次和无数次~）

用加号匹配一次或多次

先看一个报错的：


import re
batRegex=re.compile(r'Bat(wo)+man')
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The adventures of Batwowowowowowowoman')
print(mo3.group())
mo1=batRegex.search('The adventures of Batman.')
print(mo1.group())

看一下报错信息：

Batwoman
Traceback (most recent call last):
Batwowowowowowowoman
File 'xxxxxxxxxx（存储位置）', line 10, in <module>
print(mo1.group())
AttributeError: 'NoneType' object has no attribute 'group'

Process finished with exit code 1

然后是不报错的：


import re
batRegex=re.compile(r'Bat(wo)+man')
mo2=batRegex.search('The adventures of Batwoman')
print(mo2.group())
mo3=batRegex.search('The adventures of Batwowowowowowowoman')
print(mo3.group())
mo1=batRegex.search('The adventures of Batman.')
print(mo1)

这个很容易理解的，因为加号要求至少有一个。

用花括号匹配待定次数


import re
haRegex=re.compile(r'(Ha){3}')
mo1=haRegex.search('HaHaHa.')
print(mo1.group())
mo2=haRegex.search('Ha')
print(mo2)

（输出展示）

HaHaHa
None
在正则表达式中：

(Ha){3,5}的意思呢，就是：((Ha)(Ha)(Ha)|(Ha)(Ha)(Ha)(Ha)|(Ha)(Ha)(Ha)(Ha)(Ha))酱紫

贪心和非贪心匹配

说到贪心，我又想起来我的那些看什么都是贪心的日子（DFS、BFS、线性规划等等看什么都是贪心）

---------------------------------未完，找时间填坑-------------------------------

插一个闲篇啊，我一边填坑，一边等着老师讲爬虫，然后在这本书的后面也提到了一个Python自带的模块——webbrowser，作用非常无聊（不过给我提供了一个不用<a></a>就能打开网页的方法）


import webbrowser
webbrowser.open('https://www.csdn.net/')

参考：

http://www.runoob.com/python/att-string-isdecimal.html

http://www.cnblogs.com/aaronthon/p/9435967.html

https://www.cnblogs.com/erichuo/p/7909180.html

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自： Four兄 > 《Python笔记》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

Four兄

关注对话

TA的最新馆藏

[转] 居间合同违约金数额可依据居间报酬来认定（仲裁机构编辑出版的参考性案例中确定的审判规则）
[转] 发包方在建设工程合同签订及履行过程中的法律风险及防范
[转] 二胡换把的导指、首滑指和尾滑指的正确练习技巧
[转] 【金鹰视点】房地产律师：以房抵工程款的法律风险及控制
[转] 以房抵工程款的合同通常系实践性合同，房子过户前有反悔的风险
[转] 二胡教程第三章《空弦练习》每天必练

喜欢该文的人也喜欢更多

热门阅读换一换

《Python编程快速上手——让繁琐的工作自动化》读书笔记6

第七章 模式匹配与正则表达式

不用正则表达式来查找文本模式

用正则表达式查找文本模式

用正则表达式匹配更多模式

贪心和非贪心匹配

第七章模式匹配与正则表达式