https://www./post/109470.html Step 1:每次请求更换一次 User-Agent SRE实战 互联网时代守护先锋,助力企业售后服务体系运筹帷幄!一键直达领取阿里云限量特价优惠。首先通过pip下载 fake_useragent ,命令是:pip install fake_useragent 下载完之后当然需要测试其是否下载成功,及其相关用法: import fake_useragent def UserAgent(): user= fake_useragent.UserAgent() headers = {"User-Agent": "{}".format(user.random)} return headers print(UserAgent()) #{'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36'} ,当然这是随机的,这就OK了! Step 2:每次请求时长可以设置随机(在一个范围内) 运用time.sleep(random.randint (1,3)),将其放入循环函数内,我设置的(1,3)是睡眠1秒或者2秒,再进行程序的运行: import time,random def RequestSleep(url): for i in range(30): time.sleep(random.randint(1,3)) html_file = requests.get(url)
Step 3:每次请求在请求头中添加代理IP 在一小段时间内,大量请求网站,可能会导致ip被暂时封掉,此时选用代理IP似乎是个不错的选择!在百度上搜索代理ip,则会有很多提供的网站,可以先尝试将其代理ip爬下来存入文本留其以后调用,可能有时候这些免费的代理IP响应慢或者已经挂掉了,或者出钱去租一个代理IP池(效果好一点)。 代理ip请求信息的安全分类有三种: 透明代理:请求的服务器知道你使用了代理,也知道你的真实IP 普通匿名代理:请求的服务器知道你使用了代理,但不知道你的真实IP 高级匿名代理:请求的服务器不知道你使用了代理,也不知道你的真实IP 综上所述,我的爬虫为了更深的伪装使用高匿代理IP,下面我就直接那我自己写的代码展示了: # _*_ coding:utf-8 _*_ # Author : Renio # TimeLog : 2020/2/2 0002 15:08 # FileName: Verfication.py # SoftWare: PyCharm
""" 1、请求头的随机生成 2、返回西刺的ip池列表,并进行选择出 ip地址、端口,高匿,加密方式 3、再次筛选,通过初次proxie填入筛选出能够使用的proxie并存于列表 4、前半部可以循环page次数,后者用于proxies 访问百度,status_code =200 则存入列,先暂且以txt文本保存 """
from random import random,randint,choice from fake_useragent import UserAgent from lxml import etree import requests,re,time,openpyxl,os
def RandomRequestHeader():
"""request headers random Direct Call return random headers """
usa = UserAgent() header = {"User-Agent" : "{}".format(usa.random)} return header
def RequestWebFile(XICIurl): """ select XICI url's ip in list need to call url The return list include HTTP(S)、IPaddr and Ports """ headers = RandomRequestHeader()
web_url = requests.get("{}".format(XICIurl),headers=headers,timeout=randint(1,3))
file = web_url.text html = etree.HTML(file) RoughScreen = html.xpath("//tr[@class='odd' or @class='']/td/text()")
FirstArrangement = [] for strs in RoughScreen:
if strs.isdigit() == True: FirstArrangement.append(strs) elif strs.isalpha() == True: FirstArrangement.append(strs) else: flag = True nums = 0 while flag : for strs1 in strs: nums += 1 if strs1 == ".": FirstArrangement.append(strs) break else: if nums == len(strs1) or nums > 5: flag = False
return FirstArrangement #type is list
def ScreeningTest(*ListTable): """ Call print(ScreeningTest(*RequestWebFile("https://www./nn/"))) """ MayUseIp = [] headers = RandomRequestHeader() FileList = ListTable for x in range(0,len(FileList),4): try: requests.get('http://wenshu.court.gov.cn/',headers = headers, proxies={"{}".format(FileList[x+3]): "{2}://{0}:{1}".format(FileList[x],FileList[x+1],FileList[x+3])}) except: pass else: Proxies = {"{}".format(FileList[x+3]): "{2}://{0}:{1}".format(FileList[x],FileList[x+1],FileList[x+3])} MayUseIp.append(Proxies) return MayUseIp
def VerficationProxies(): """ save available proxies in the list Direct Call """ headers = RandomRequestHeader() ProxiesList = [] for page in range(1,5): ProxiesL = ScreeningTest(*RequestWebFile("https://www./nn/1".replace("1","",page))) for i in range(len(ProxiesL)): proxies = choice(ProxiesL) web_url = requests.get("{}".format("https://www.baidu.com/"),headers = headers,proxies = proxies ) web_url.encoding="utf-8" if web_url.status_code == 200 : print(proxies) ProxiesList.append(proxies) file = open("proxies.txt","w+",encoding="utf-8") file.write(str(ProxiesList)) file.close() print(len(ProxiesList)) return ProxiesList
VerficationProxies()
|