分享

python – 如何通过scrapy捕获错误,以便在收到User Timeout错误时可以执行某些操作?

 印度阿三17 2019-10-04

ERROR: Error downloading <GET URL_HERE>: User timeout caused connection failure.

我在使用刮刀的时候偶尔会遇到这个问题.有没有办法可以解决这个问题并在它发生时运行一个函数?我无法在任何地方找到如何在线进行.

解决方法:

您可以做的是在Request实例中定义一个errback

errback (callable) – a function that will be called if any exception was raised while processing the request. This includes pages that failed with 404 HTTP errors and such. It receives 07001 as first parameter.

以下是您可以使用的一些示例代码(对于scrapy 1.0):

# -*- coding: utf-8 -*-
# errbacks.py
import scrapy

# from scrapy.contrib.spidermiddleware.httperror import HttpError
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError


class ErrbackSpider(scrapy.Spider):
    name = "errbacks"
    start_urls = [
        "http://www./",              # HTTP 200 expected
        "http://www./status/404",    # Not found error
        "http://www./status/500",    # server issue
        "http://www.:12345/",        # non-responding host, timeout expected
        "http://www./",       # DNS error expected
    ]

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse_httpbin(self, response):
        self.logger.error('Got successful response from {}'.format(response.url))
        # do something useful now

    def errback_httpbin(self, failure):
        # log all errback failures,
        # in case you want to do something special for some errors,
        # you may need the failure's type
        self.logger.error(repr(failure))

        #if isinstance(failure.value, HttpError):
        if failure.check(HttpError):
            # you can get the response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        #elif isinstance(failure.value, DNSLookupError):
        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        #elif isinstance(failure.value, TimeoutError):
        elif failure.check(TimeoutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

和scrapy shell中的输出(只有1次重试和5次下载超时):

$scrapy runspider errbacks.py --set DOWNLOAD_TIMEOUT=5 --set RETRY_TIMES=1
2015-06-30 23:45:55 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-06-30 23:45:55 [scrapy] INFO: Optional features available: ssl, http11
2015-06-30 23:45:55 [scrapy] INFO: Overridden settings: {'DOWNLOAD_TIMEOUT': '5', 'RETRY_TIMES': '1'}
2015-06-30 23:45:56 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-06-30 23:45:56 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-06-30 23:45:56 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-06-30 23:45:56 [scrapy] INFO: Enabled item pipelines: 
2015-06-30 23:45:56 [scrapy] INFO: Spider opened
2015-06-30 23:45:56 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-06-30 23:45:56 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www./> (failed 1 times): DNS lookup failed: address 'www.' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [scrapy] DEBUG: Gave up retrying <GET http://www./> (failed 2 times): DNS lookup failed: address 'www.' not found: [Errno -5] No address associated with hostname.
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.DNSLookupError'>>
2015-06-30 23:45:56 [errbacks] ERROR: DNSLookupError on http://www./
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (200) <GET http://www./> (referer: None)
2015-06-30 23:45:56 [scrapy] DEBUG: Crawled (404) <GET http://www./status/404> (referer: None)
2015-06-30 23:45:56 [errbacks] ERROR: Got successful response from http://www./
2015-06-30 23:45:56 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:56 [errbacks] ERROR: HttpError on http://www./status/404
2015-06-30 23:45:56 [scrapy] DEBUG: Retrying <GET http://www./status/500> (failed 1 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Gave up retrying <GET http://www./status/500> (failed 2 times): 500 Internal Server Error
2015-06-30 23:45:57 [scrapy] DEBUG: Crawled (500) <GET http://www./status/500> (referer: None)
2015-06-30 23:45:57 [errbacks] ERROR: <twisted.python.failure.Failure <class 'scrapy.spidermiddlewares.httperror.HttpError'>>
2015-06-30 23:45:57 [errbacks] ERROR: HttpError on http://www./status/500
2015-06-30 23:46:01 [scrapy] DEBUG: Retrying <GET http://www.:12345/> (failed 1 times): User timeout caused connection failure.
2015-06-30 23:46:06 [scrapy] DEBUG: Gave up retrying <GET http://www.:12345/> (failed 2 times): User timeout caused connection failure.
2015-06-30 23:46:06 [errbacks] ERROR: <twisted.python.failure.Failure <class 'twisted.internet.error.TimeoutError'>>
2015-06-30 23:46:06 [errbacks] ERROR: TimeoutError on http://www.:12345/
2015-06-30 23:46:06 [scrapy] INFO: Closing spider (finished)
2015-06-30 23:46:06 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
 'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
 'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
 'downloader/request_bytes': 1748,
 'downloader/request_count': 8,
 'downloader/request_method_count/GET': 8,
 'downloader/response_bytes': 12506,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'downloader/response_status_count/500': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 6, 30, 21, 46, 6, 537191),
 'log_count/DEBUG': 10,
 'log_count/ERROR': 9,
 'log_count/INFO': 7,
 'response_received_count': 3,
 'scheduler/dequeued': 8,
 'scheduler/dequeued/memory': 8,
 'scheduler/enqueued': 8,
 'scheduler/enqueued/memory': 8,
 'start_time': datetime.datetime(2015, 6, 30, 21, 45, 56, 322235)}
2015-06-30 23:46:06 [scrapy] INFO: Spider closed (finished)

请注意scrapy如何在其统计信息中记录异常:

'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 2,
'downloader/exception_type_count/twisted.internet.error.TimeoutError': 2,
来源:https://www./content-1-485301.html

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多