网络爬虫的风险

laical1 2020-09-27

展开全文

随着互联网的发展，网络爬虫也越来越多，爬虫本身是一种网络技术，所以爬虫不是违法的技术。如果使用爬虫技术去做违法项目，例如：色情，赌博等违法业务，一旦发现就会触碰法律的禁止。

爬虫类型：

1、数据量小对爬取速度不敏感的可以使用request库实现网页爬虫

2、对数据要求规模较大，爬取速度敏感的可以使用Scrapy库实现网页采集

3、大数据采集，需要一定的研发团队开发，例如：电商，搜索引擎爬虫等

爬虫涉及的问题：

性能骚扰：爬虫快速访问服务器，超过了人类访问速度，对网站管理者来说就造成骚扰

法律风险：每个网站上的数据都有产权归宿，如果通过爬虫获取到的数据从中获利的话会有一定的法律风险

隐私泄露：网络爬虫会突破网站的限制，获取数据，造成了网站的隐私泄露

爬虫规避方式:

1、遵守网站robots协议

2、优化爬虫程序

3、禁止采集个人信息和隐私信息

爬虫限制：

任何一个网站都会有反爬限制，请求网站时，目标网站会检查HTTP请求的User-Agent，因为ua是浏览器标识，如果http请求没有ua，或ua太少，都会被网站运维统计异常的。这种情况，加上User-Agent，表明你是浏览器访问即可

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.1276.73 Safari/537.36', 'Referer':'https://www.baidu.com/s?ie=utf-8&f=8&rsv_bp=1&rsv_idx=1&tn=baidu&wd=nike'} response = requests.get(url=url, headers=headers)

#! -*- encoding:utf-8 -*- import requests import random # 要访问的目标页面 targetUrl = "http:///ip" # 要访问的目标HTTPS页面 # targetUrl = "https:///ip" # 代理服务器(产品官网 www.16yun.cn) proxyHost = "t.16yun.cn" proxyPort = "31111" # 代理隧道验证信息 proxyUser = "username" proxyPass = "password" proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } # 设置 http和https访问都是用HTTP代理 proxies = { "http" : proxyMeta, "https" : proxyMeta, } # 设置IP切换头 tunnel = random.randint(1,10000) headers = {"Proxy-Tunnel": str(tunnel)}