【原】盘点一个Python网络爬虫的正则表达式问题

Python进阶者 2023-09-23 发布于广东

展开全文

当奖率三军，北定中原。

大家好，我是皮皮。

一、前言

前几天在Python钻石群【空】问了一个Python网络爬虫的问题，一起来看看吧。下面是他的代码。

mport re

html='''
< img src="//www.chinadaily.com.cn/image_e/2017/logo.png" alt="chinadaily" />
        </a >
      </div>
      <div class="fl-right">
        <a href="//cn.chinadaily.com.cn" target="_top" shape="rect">< img src="//www.chinadaily.com.cn/image_e/2017/cnbut.png" />


'''

# print(html)
reg=r"<img.+src="
m=re.search(reg,html)
print(m)
a=m.end()
# print(a)
s=html[a:]
# print(s)
n=re.search(r"\".+\"",s)
# print(n)
b=n.end()
# print(b)
src=s[:b]
print(src)

二、实现过程

后来【瑜亮老师】给了一个建议，代码如下所示：

import re

html = '''
< img src="//www.chinadaily.com.cn/image_e/2017/logo.png" alt="chinadaily" />
        </a >
      </div>
      <div class="fl-right">
        <a href="//cn.chinadaily.com.cn" target="_top" shape="rect">< img src="//www.chinadaily.com.cn/image_e/2017/cnbut.png" />


'''
reg = r'< img src="//(.*?)"'
m = re.findall(reg, html)
print(m)

这样就可以获取所有的图片地址，顺利地解决了粉丝的问题。

后来还给了一个具体的爬虫代码，如下所示：

url="http://www.chinadaily.com.cn/"
html = requests.get(url).text
reg = r'img src="//(.*?)"'
img = re.findall(reg, html)

for i in img:
    i = "http://" + i
    if 'jpeg_w642' in i:
        i = i.replace('.jpeg_w642', '.jpeg')
    resp = requests.get(i).content
    with open(i.split('/')[-1], 'wb') as f:
        f.write(resp)