ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.
我在使用刮刀的时候偶尔会遇到这个问题.有没有办法可以解决这个问题并在它发生时运行一个函数?我无法在任何地方找到如何在线进行. 解决方法: 您可以做的是在Request实例中定义一个errback :
errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives 07001 as first parameter.
以下是您可以使用的一些示例代码(对于scrapy 1.0):
# -*- coding: utf-8 -*-
# errbacks.py
import scrapy
# from scrapy.contrib.spidermiddleware.httperror import HttpError
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError
class ErrbackSpider(scrapy.Spider):
name = "errbacks"
start_urls = [
"http://www./", # HTTP 200 expected
"http://www./status/404", # Not found error
"http://www./status/500", # server issue
"http://www.:12345/", # non-responding host, timeout expected
"http://www./", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.parse_httpbin,
errback=self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.error('Got successful response from {}'.format(response.url))
# do something useful now
def errback_httpbin(self, failure):
# log all errback failures,
# in case you want to do something special for some errors,
# you may need the failure's type
self.logger.error(repr(failure))
#if isinstance(failure.value, HttpError):
if failure.check(HttpError):
# you can get the response
response = failure.value.response
self.logger.error('HttpError on %s', response.url)
#elif isinstance(failure.value, DNSLookupError):
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
self.logger.error('DNSLookupError on %s', request.url)
#elif isinstance(failure.value, TimeoutError):
elif failure.check(TimeoutError):
request = failure.request
self.logger.error('TimeoutError on %s', request.url)
和scrapy shell中的输出(只有1次重试和5次下载超时):
$scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines:
2015-06-30 23:45:56 [scrapy] INFO: Spider opened
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www./> (failed 1 times): DNS lookup failed: address 'www.' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www./> (failed 2 times): DNS lookup failed: address 'www.' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www./
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www./> (referer: None)
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www./status/404> (referer: None)
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www./
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www./status/404
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www./status/500> (failed 1 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www./status/500> (failed 2 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www./status/500> (referer: None)
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www./status/500
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.:12345/> (failed 1 times): User timeout caused connection failure.
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.:12345/> (failed 2 times): User timeout caused connection failure.
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.:12345/
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
'downloader/request_bytes': 1748,
'downloader/request_count': 8,
'downloader/request_method_count/GET': 8,
'downloader/response_bytes': 12506,
'downloader/response_count': 4,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'downloader/response_status_count/500': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191),
'log_count/DEBUG': 10,
'log_count/ERROR': 9,
'log_count/INFO': 7,
'response_received_count': 3,
'scheduler/dequeued': 8,
'scheduler/dequeued/memory': 8,
'scheduler/enqueued': 8,
'scheduler/enqueued/memory': 8,
'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)
请注意scrapy如何在其统计信息中记录异常:
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
来源:https://www./content-1-485301.html
|