第一时间接收最新Python干货! 
早起Python推荐搜索 大家好,关于Requests爬虫我们已经讲了很多。今天我们就说一下Scrapy框架各组件的详细设置方便之后更新Scrapy爬虫实战案例。Scrapy是纯Python语言实现的爬虫框架,简单、易用、拓展性高是其主要特点。这里不过多介绍Scrapy的基本知识点,主要针对其高拓展性详细介绍各个主要部件的配置方法。其实也不详细,不过应该能满足大多数人的需求了 : )。当然,更多信息可以仔细阅读官方文档。首先还是放一张Scrapy数据流的图供复习和参考。 接下来进入正题,有些具体的示例以某瓣spider为例。scrapy startproject <Project_name> scrapy genspider <spider_name> <domains> 如果想要创建全网爬取的便捷框架crawlspider,则用如下命令:scrapy genspider –t crawl <spider_name> <domains>
首先介绍最核心的部件spider.py,废话不多说,上代码,看注释 import scrapy # 有些命令如果有python基础的都明白,我不做过多介绍 import json # 需要做持久化所以导入item,也可以根据文件夹名慢慢导入 from ..items import DoubanItem
class DoubanSpider(scrapy.Spider): name = 'douban' allowed_domains = ['douban.com'] # 对单个爬虫设置请求头 custom_settings = { 'DEFAULT_REQUEST_HEADERS': { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' }} # 很多时候并不需要重载这个函数,如果需要定制化起始url或者单独设置请求头可以选择重载 def start_requests(self): page = 18 base_url = 'https://xxxx' for i in range(page): url = base_url.format(i * 20) req = scrapy.Request(url=url, callback=self.parse) # 对某个请求添加请求头,后面的请求如果要设置也是类似方法 # req.headers['User-Agent'] = '' yield req # 没有特别要解释,就是常规的页面解析抛给...(看数据流就明白了) def parse(self, response): json_str = response.body.decode('utf-8') res_dict = json.loads(json_str) for i in res_dict['subjects']: url = i['url'] yield scrapy.Request(url=url, callback=self.parse_detailed_page) # scrapy的response可以直接用xpath解析,基础东西大家都懂不赘述 def parse_detailed_page(self, response): title = response.xpath('//h1/span[1]/text()').extract_first() year = response.xpath('//h1/span[2]/text()').extract()[0] image = response.xpath('//img[@rel='v:image']/@src').extract_first() item = DoubanItem() item['title'] = title item['year'] = year item['image'] = image # 如果要下载图片需要单独设置,ImagePipelines,同样在settings和pipelines都需要相应设置 item['image_urls'] = [image] yield item
如果是全网爬取,则框架中spiders的部分开头会略有差别 rules = (Rule(LinkExtractor(allow=r'http:///digimon/.*/index.html'), callback='parse_item', follow=False),)
关键就是follow的设置了,是否到达既定深度和页面需要自己把握。提一嘴,请求头可以在三个地方设置,决定了请求头的影响范围 在settings中设置,范围最大,影响整个框架的所有spider 在spiders类变量处设置,影响该spider的所有请求 在具体请求中设置,只影响该request
三处设置的影响范围实际就是从全局到单个爬虫到单个请求。如果同时存在则单个请求的headers设置优先级最高! import scrapy
class DoubanItem(scrapy.Item): title = scrapy.Field() year = scrapy.Field() image = scrapy.Field() # 下载图片的ImagePipelines也需要设置items image_urls = scrapy.Field() # 持久化存储我选择用mysql,不具体展开 def get_insert_sql_and_data(self): # CREATE TABLE douban( # id int not null auto_increment primary key, # title text, `year` int, image text)ENGINE=INNODB DEFAULT CHARSET=UTF8mb4; insert_sql = 'INSERT INTO douban(title,`year`,image)' \ # 系统关键字需要加`` 'VALUES(%s,%s,%s)' data = (self['title'],self['year'],self['image']) return (insert_sql, data)
中间件就很灵性了,很多小伙伴也不一定用的到,但实际上在配置代理时很重要,一般需求不去配置SpiderMiddleware,主要针对DownloaderMiddleware进行修改 # 信号,这个名词在scrapy自定义拓展中很重要 from scrapy import signals # 本地配置的类,代码见后续,可以搭在自己的IP池上,也可以直接挂在收费IP(比如我) from proxyhelper import Proxyhelper # 多线程操作同一个对象需要锁,用法就是实例化以后一锁一释放 from twisted.internet.defer import DeferredLock
class DoubanSpiderMiddleware(object): # spider中间件不设置 pass
class DoubanDownloaderMiddleware(object): def __init__(self): # 对IP配置的代理和锁都实例化 self.helper = Proxyhelper() self.lock = DeferredLock()
@classmethod def from_crawler(cls, crawler): # 不修改 # This method is used by Scrapy to create your spiders. s = cls() crawler.signals.connect(s.spider_opened, signal=signals.spider_opened) return s
def process_request(self, request, spider): # request的数据流到达下载中间件的时候出发 self.lock.acquire() request.meta['Proxy'] = self.helper.get_proxy() self.lock.release() return None
def process_response(self, request, response, spider): # 对响应判断,如果不符合就换代理重新请求 if response.status != 200: self.lock.acquire() self.helper.update_proxy(request.meta['Proxy']) self.lock.release() return request return response
def process_exception(self, request, exception, spider): self.lock.acquire() self.helper.update_proxy(request.meta['Proxy']) self.lock.release() return request
def spider_opened(self, spider): # 不修改 spider.logger.info('Spider opened: %s' % spider.name)
import requests
class Proxyhelper(object): def __init__(self): self.proxy = self._get_proxy_from_xxx()
def get_proxy(self): return self.proxy
def update_proxy(self, proxy): if proxy == self.proxy: print('Updating a proxy') self.proxy = self._get_proxy_from_xxx()
def _get_proxy_from_xxx(self): url = '' # 此处修改url,最好是一次返回一个ip response = requests.get(url) return 'http://' + response.text.strip()
# 载入本地的mysql持久化类,按需自己写 from mysqlhelper import Mysqlhelper # 载入ImagesPipeline便于重载,自定义一些功能 from scrapy.pipelines.images import ImagesPipeline import hashlib from scrapy.utils.python import to_bytes from scrapy.http import Request
class DoubanImagesPipeline(ImagesPipeline): def get_media_requests(self, item, info): request_lst = [] for x in item.get(self.images_urls_field, []): req = Request(x) req.meta['movie_name'] = item['title'] # 获取名字 request_lst.append(req) return request_lst # 重载 def file_path(self, request, response=None, info=None): image_guid = hashlib.sha1(to_bytes(request.url)).hexdigest() return 'full/%s.jpg' % (request.meta['movie_name']) # 修改图片名
# 无特殊,有些步骤在items已经写完,实现pipelines和items功能上的分离 class DoubanPipeline(object): def __init__(self): self.mysqlhelper = Mysqlhelper()
def process_item(self, item, spider): if 'get_insert_sql_and_data' in dir(item): (insert_sql, data) = item.get_insert_sql_and_data() self.mysqlhelper.execute_sql(insert_sql, data) return item
极其关键的部件,注释已经在代码中标注 # 爬虫名称 BOT_NAME = 'Douban'
SPIDER_MODULES = ['Douban.spiders'] NEWSPIDER_MODULE = 'Douban.spiders'
# 客户端请求头 # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'Douban (+http://www.)'
# Obey robots.txt rules # 机器人协定 ROBOTSTXT_OBEY = False
# 并发请求数 # Configure maximum concurrent requests performed by Scrapy (default: 16) CONCURRENT_REQUESTS = 32
# 下载延迟 #DOWNLOAD_DELAY = 3 # The download delay setting will honor only one of: # 单域名和单IP并发数,会覆盖上面的设定 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 #CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default) #COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default) # 对爬虫进行监控 #TELNETCONSOLE_ENABLED = False # TELNETCONSOLE_ENABLED = True # TELNETCONSOLE_HOST = '127.0.0.1' # TELNETCONSOLE_PORT = [6023,] # 操作命令:cmd -> telent 127.0.0.1 6023-> est<>
# Override the default request headers: # 默认请求头,项目内所有爬虫有效 # DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', # 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36' # }
# 爬虫中间件 # SPIDER_MIDDLEWARES = { # # 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': None # 'Douban.middlewares.DoubanSpiderMiddleware': 543, # }
# Enable or disable downloader middlewares # See https://docs./en/latest/topics/downloader-middleware.html # 下载中间件 DOWNLOADER_MIDDLEWARES = { 'Douban.middlewares.DoubanDownloaderMiddleware': 560, # 更改为560的原因在于不同中间件细分很多亚组间,这些组间的数据大小决定了request和response数据流触碰的顺序,具体见官方文档 } # 允许url的访问时限 TIMEOUT = 10 # 深度限制 # DEPTH_LIMIT = 1
# 自定义拓展 EXTENSIONS = { 'Douban.extends.MyExtension': 500, }
# item-pipelines配置 ITEM_PIPELINES = { # 'scrapy.pipelines.images.ImagesPipeline': 1, # 图片下载器需要注册 'Douban.pipelines.DoubanImagesPipeline': 300, }
# 利用算法,自动限速 # Enable and configure the AutoThrottle extension (disabled by default) # See https://docs./en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False
# 启用缓存,较少用 # Enable and configure HTTP caching (disabled by default) # See https://docs./en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED = True #HTTPCACHE_EXPIRATION_SECS = 0 #HTTPCACHE_DIR = 'httpcache' #HTTPCACHE_IGNORE_HTTP_CODES = [] #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 图片下载器ImagePipeline的配置,按需开启 IMAGES_STORE = 'download'
自定义扩展,建议设置该部件需要对信号有了解,深入理解scrapy运行过程的信号触发,实际还是需要对数据流理解的完善。代码中我是利用自己写的类,本质就是利用喵提醒在某些特定时刻触发提醒(喵提醒打钱?)。当然也可以利用日志或者其他功能强化拓展功能,通过signal的不同触发时刻针对性设置 需要自己创建,创建位置如图: 
from scrapy import signals from message import Message
class MyExtension(object): def __init__(self, value): self.value = value
@classmethod def from_crawler(cls, crawler): val = crawler.settings.getint('MMMM') ext = cls(val)
crawler.signals.connect(ext.spider_opened, signal=signals.spider_opened) crawler.signals.connect(ext.spider_closed, signal=signals.spider_closed)
return ext
def spider_opened(self, spider): print('spider running')
def spider_closed(self, spider): message = Message('spider运行结束') message.push() print('spider closed')
runnings.py最后提一下吧,其实就是一个在python中运行cmd的命令 from scrapy.cmdline import execute execute('scrapy crawl douban'.split())
以上就是可以满足基本需求的Scrapy各部件配置,如果还不熟悉的话可以参考,之后我们将更新一些Scrapy爬虫实战案例。 本周继续给常读与常分享的读者送书!本周第一本书为很多读者要求的机器学习相关书籍👇

|