豆瓣电影top250是学习爬虫很好的入门案例。学习爬虫,首先我们应该清楚爬虫的流程。一、流程分析 二、代码实现与思路讲解 设置headers,设置编码解析格式,通过requests中的get方法获取网页回应。 headers查看方式: 进入网页,点击f12或查看开发者工具,点击Network-headers-User-Agent
def get_response(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}
response = requests.get(url,headers=headers)
response.encoding = 'UTF-8'
return response.text
** 2.分析源码 通过查看源码,可以看出每个电影都是放置在li标签中的,这样我们就非常清晰。
def get_nodes(html):
text = etree.HTML(html)
nodes = text.xpath('//li/div[@class="item"]') #锁定代码位置
infos = []
for node in nodes:
try:
key = {}
key['movieName'] = str(node.xpath('.//span[@class="title"][1]/text()')).strip("[']")
print(key['movieName'])
firstInfo = node.xpath('.//div[@class="bd"]/p/text()')[0]
secondInfo = node.xpath('.//div[@class="bd"]/p/text()')[1]
key['director'] = str(firstInfo.split("主演:")[0]).strip().strip('导演:')
key['actors'] = firstInfo.split("主演:")[1]
key['time'] = secondInfo.split('/')[0]
key['country'] = secondInfo.split('/')[1]
infos.append(key)
except:
key['actors'] = None
return infos
这样我们就可以爬取到电影名,导演,主演,上映时间,国家等电影信息。(如果对xpath语法有问题的同学可以去w3school官网去查看。语法相当简单好学) 3. 将信息写入csv文件 def save_file(infos):
headers = ['电影名称','导演','主演','上映时间','国家']
with open('DouBanMovieT250.csv','a+',encoding='UTF-8',newline='') as fp:
writer = csv.writer(fp)
writer.writerow(headers)
for key in infos:
writer.writerow([key['movieName'],key['director'],key['actors'],key['time'],key['country']])
这里我们先通过csv.writer(fp)方法生成一个csv对象。再通过这个对象来调用writerrow方法写入文件。到这里我们就完成了第一页的爬虫。 当然,我们的欲望不止爬取第一页。下面我们再来说一下翻页爬取。 if __name__ == '__main__':
urls = ['https://movie.douban.com/top250?start={}'.format(i) for i in range(0, 226, 25)]
for url in urls:
html = get_response(url)
infos = get_nodes(html)
save_file(infos)
通过查看网页的get请求可以知道,参数是固定变化的。这样我们直接可以通过for循环来遍历网页请求参数。
到这里,我们的豆瓣电影就爬取完成了。来看一下结果吧!
完整代码 import requests
from lxml import etree
import csv
def get_response(url):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36'}
response = requests.get(url,headers=headers)
response.encoding = 'UTF-8'
return response.text
def get_nodes(html):
text = etree.HTML(html)
nodes = text.xpath('//li/div[@class="item"]')
infos = []
for node in nodes:
try:
key = {}
key['movieName'] = str(node.xpath('.//span[@class="title"][1]/text()')).strip("[']")
print(key['movieName'])
firstInfo = node.xpath('.//div[@class="bd"]/p/text()')[0]
secondInfo = node.xpath('.//div[@class="bd"]/p/text()')[1]
key['director'] = str(firstInfo.split("主演:")[0]).strip().strip('导演:')
key['actors'] = firstInfo.split("主演:")[1]
key['time'] = secondInfo.split('/')[0]
key['country'] = secondInfo.split('/')[1]
infos.append(key)
except:
key['actors'] = None
return infos
def save_file(infos):
headers = ['电影名称','导演','主演','上映时间','国家']
with open('DouBanMovieT250.csv','a+',encoding='UTF-8',newline='') as fp:
writer = csv.writer(fp)
writer.writerow(headers)
for key in infos:
writer.writerow([key['movieName'],key['director'],key['actors'],key['time'],key['country']])
if __name__ == '__main__':
urls = ['https://movie.douban.com/top250?start={}'.format(i) for i in range(0, 226, 25)]
for url in urls:
html = get_response(url)
infos = get_nodes(html)
save_file(infos)
每天进步一点点,Keep Going!
|