分享

python爬虫必备,自建ip代理池,不惧封ip。

 小米VIP 2020-08-21

为什么要使用代理IP

在爬虫的过程中,很多网站会采取反爬虫技术,其中最经常使用的就是限制一个IP的访问次数。当你本地的IP地址被该网站封禁后,可能就需要换一个代理来爬虫。无私分享全套Python爬虫干货,如果你也想学习Python,@ 私信小编获取

开发思路

1、通过本地IP抓取第一批启动代理IP

我们从代理IP网站抓取代理IP的过程本身就是爬虫,如果短时间内请求次数过多会被网站禁止访问,因此我们需要利用本地IP去抓取第一批代理IP,然后使用代理IP去抓取新的代理IP。

2、对第一批启动的代理IP验证有效性后存入数据库

我们在数据库IP.db下建了两个表:proxy_ip_table(存储所有抓取的IP,用于查看抓取IP功能是否正常)和validation_ip_table(存储所有通过验证的IP,用于查看IP有效性)

第一步中获取的代理IP经检验后存入validation_ip_table,检验的实现如下:

  1. def ip_validation(self, ip):
  2. #判断是否高匿:非高匿的ip仍会出卖你的真实ip
  3. anonymity_flag = False
  4. if "高匿" in str(ip):
  5. anonymity_flag = True
  6. IP = str(ip[0]) + ":" + str(ip[1]);IP
  7. url = "http:///get" ##测试代理IP功能的网站
  8. proxies = { "https" : "https://" + IP} #为什么要用https而不用http我也不清楚
  9. headers = FakeHeaders().random_headers_for_validation()
  10. #判断是否可用
  11. validation_flag = True
  12. response = None
  13. try:
  14. response = requests.get(url = url, headers = headers, proxies = proxies, timeout = 5)
  15. except:
  16. validation_flag = False
  17. if response is None :
  18. validation_flag = False
  19. if anonymity_flag and validation_flag:
  20. return True
  21. else:
  22. return False

3、构建待访问的网址列表并循环抓取,每次抓取的ip_list经验证后存入数据库表

我们构建了待访问的网址列表

self.URLs = [ "https://www./nn/%d" % (index + 1) for index in range(100)] 

包含的模块

1、RandomHeaders.py

构造随机请求头,用于模拟不同的网络浏览器,调用方式:

包含的模块

1、RandomHeaders.py

构造随机请求头,用于模拟不同的网络浏览器,调用方式:

  1. from RandomHeaders import FakeHeaders
  2. #返回请求xici代理网站的请求头
  3. xici_headers = FakeHeaders().random_headers_for_xici

2、DatabaseTable.py

提供数据库的创建表和增删查功能,调用方式:

  1. from DatabaseTable import IPPool
  2. tablename = "proxy_ip_table"
  3. #tablename也可以是validation_ip_table
  4. IPPool(tablename).create() #创建表
  5. IPPool(tablename).select(random_flag = False)
  6. # random_flag = True时返回一条随机记录,否则返回全部记录
  7. IPPool(table_name).delete(delete_all = True) #删除全部记录

3、GetProxyIP.py

核心代码,有几个函数可以实现不同的功能:

  • 从0开始完成建表、抓取IP和存入数据库的功能
  1. from GetProxyIP import Carwl
  2. Crawl().original_run()
  • 当代理IP个数不够的时候,根据url_list列表进行抓取,将合适的IP存入列表
from GetProxyIP import Carwl
#其他提供代理IP的网站
  1. url_kuaidaili = ["https://www./free/inha/%d" % (index + 1) for index in range(10,20)]
  2. Crawl().get_more_run(url_list)

Python爬虫必备:从0到1构建自己的免费爬虫代理IP池

 

  • 当IP池太久没用时,需要对IP有效性进行验证,不符合要求的IP需要删除
  1. from GetProxyIP import Carwl
  2. Crawl().proxy_ip_validation()

 

 

部分代码

 

1、RandomHeaders.py

提供随机请求头,模仿浏览器访问以应付反爬

  1. # -*- coding: utf-8 -*-
  2. """
  3. Created on Tue Jan 29 10:36:28 2019
  4. @author: YANG
  5. 功能:生成随机请求头,模拟不同的浏览器访问
  6. """
  7. import random
  8. from fake_useragent import UserAgent
  9. class FakeHeaders(object):
  10. """
  11. 生成随机请求头
  12. """
  13. def __init__(self):
  14. self.__UA = [
  15. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0",
  16. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
  17. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0;",
  18. "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070803 Firefox/1.5.0.12",
  19. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.11 TaoBrowser/2.0 Safari/536.11",
  20. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
  21. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/17.17134",
  22. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
  23. "Opera/9.27 (Windows NT 5.2; U; zh-cn)",
  24. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)",
  25. "Opera/9.80 (Windows NT 5.1; U; zh-cn) Presto/2.9.168 Version/11.50",
  26. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Maxthon/4.9.2.1000 Chrome/39.0.2146.0Safari/537.36",
  27. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36",
  28. "Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
  29. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E)",
  30. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 BIDUBrowser/8.3 Safari/537.36",
  31. "Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999",
  32. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
  33. "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
  34. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; 360SE)",
  35. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36",
  36. "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:46.0) Gecko/20100101 Firefox/46.0",
  37. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.57.2 (KHTML, like Gecko) Version/5.1.7 Safari/534.57.2",
  38. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.154 Safari/537.36 LBBROWSER",
  39. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.89 Safari/537.1",
  40. "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0",
  41. "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2486.0 Safari/537.36 Edge/13.10586",
  42. "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)",
  43. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
  44. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
  45. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.87 Safari/537.36 OPR/37.0.2178.32",
  46. "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; InfoPath.3; rv:11.0) like Gecko",
  47. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0; .NET4.0C; .NET4.0E; LBBROWSER)",
  48. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
  49. "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
  50. "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
  51. "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.84 Safari/535.11 SE 2.X MetaSr 1.0",
  52. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
  53. "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)",
  54. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36 Core/1.47.277.400 QQBrowser/9.4.7658.400",
  55. "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
  56. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 UBrowser/5.6.12150.8 Safari/537.36",
  57. "Mozilla/5.0 (Windows; U; Windows NT 5.1) Gecko/20070309 Firefox/2.0.0.3",
  58. "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko",
  59. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SV1; QQDownload 732; .NET4.0C; .NET4.0E; SE 2.X MetaSr 1.0)",
  60. "Mozilla/4.0 (compatible; MSIE 12.0",
  61. "Mozilla/5.0 (Windows; U; Windows NT 5.2) Gecko/2008070208 Firefox/3.0.1",
  62. "Mozilla/5.0 (Windows NT 5.2) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.122 Safari/534.30",
  63. "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36",
  64. "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 10.0; WOW64; Trident/7.0; Touch; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; Tablet PC 2.0)",
  65. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/21.0.1180.71 Safari/537.1 LBBROWSER",
  66. "Mozilla/5.0 (Windows NT 10.0; WOW64; rv:38.0) Gecko/20100101 Firefox/38.0",
  67. "Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0)",
  68. "Mozilla/5.0 (Windows NT 5.1; rv:44.0) Gecko/20100101 Firefox/44.0",
  69. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36 TheWorld 7",
  70. "Mozilla/5.0 (Windows NT 6.1; rv,2.0.1) Gecko/20100101 Firefox/4.0.1",
  71. "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.122 Safari/537.36 SE2.X MetaSr 1.0",
  72. ]
  73. #UserAgent用户代理,主要提供浏览器类型及版本、操作系统及版本和浏览器内核等信息
  74. def random_headers_for_xici(self):
  75. headers = {
  76. "User-Agent": UserAgent().random, ##随机选择UA
  77. "Accept-Language": "zh-CN,zh;q=0.9,en-US;q=0.8,en;q=0.7",
  78. "Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  79. "Accept-Encoding":"gzip, deflate, br",
  80. "Cache-Control":"max-age=0",
  81. "Connection":"keep-alive",
  82. "Host":"www.",
  83. "Upgrade-Insecure-Requests":"1"
  84. }
  85. return headers
  86. def random_headers_for_validation(self):
  87. headers = {
  88. "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
  89. "Accept-Encoding": "gzip, deflate",
  90. "Accept-Language": "zh-CN,zh;q=0.9",
  91. "Connection": "close",
  92. "Host": "",
  93. "Upgrade-Insecure-Requests": "1",
  94. "User-Agent": UserAgent().random}
  95. return headers
  96. if __name__ == "__main__":
  97. print("随机抽取20条headers:")
  98. for i in range(20):
  99. print(FakeHeaders().random_headers_for_xici())

2、DatabaseTable.py

提供数据库功能,这里提供了能存储IP的数据库IP.db。

  1. import sqlite3 ##可以在 Python 程序中使用 SQLite 数据库
  2. import time
  3. class IPPool(object):
  4. ##存储ip的数据库,包括两张表ip_table和all_ip_table
  5. ##insert和建表语句绑定在一起
  6. def __init__(self,table_name):
  7. self.__table_name = table_name
  8. self.__database_name = "IP.db" ##IPPool对应的数据库为IP.db
  9. ##初始化类,传入参数table_name
  1. ef create(self):
  2. conn = sqlite3.connect(self.__database_name, isolation_level = None)
  3. conn.execute(
  4. "create table if not exists %s(IP CHAR(20) UNIQUE, PORT INTEGER, ADDRESS CHAR(50), TYPE CHAR(50), PROTOCOL CHAR(50))"
  5. % self.__table_name)
  6. print("IP.db数据库下%s表建表成功" % self.__table_name)
  7. ##建表语句
  8. def insert(self, ip):
  9. conn = sqlite3.connect(self.__database_name, isolation_level = None)
  10. #isolation_level是事务隔离级别,默认是需要自己commit才能修改数据库,置为None则自动每次修改都提交
  11. for one in ip:
  12. conn.execute(
  13. "insert or ignore into %s(IP, PORT, ADDRESS, TYPE, PROTOCOL) values (?,?,?,?,?)"
  14. % (self.__table_name),
  15. (one[0], one[1], one[2], one[3], one[4]))
  16. conn.commit() #提交insert 但是已经设置isolaion_level为None,所以应该不需要
  17. conn.close()
  18. def select(self,random_flag = False):
  19. conn = sqlite3.connect(self.__database_name,isolation_level = None)
  20. ##连接数据库
  21. cur=conn.cursor()
  22. #cursor用于接受返回的结果
  23. if random_flag:
  24. cur.execute(
  25. "select * from %s order by random() limit 1"
  26. % self.__table_name)
  27. result = cur.fetchone()
  28. #如果是random_flag为T则随机抽取一条记录并返回
  29. else:
  30. cur.execute("select * from %s" % self.__table_name)
  31. result = cur.fetchall()
  32. cur.close()
  33. conn.close()
  34. return result
  35. def delete(self, IP = ('1',1,'1','1','1'), delete_all=False):
  36. conn = sqlite3.connect(self.__database_name,isolation_level = None)
  37. if not delete_all:
  38. n = conn.execute("delete from %s where IP=?" % self.__table_name,
  39. (IP[0],))
  40. #逗号不能省,元组元素只有一个的时候一定要加
  41. print("删除了",n.rowcount,"行记录")
  42. else:
  43. n = conn.execute("delete from %s" % self.__table_name)
  44. print("删除了全部记录,共",n.rowcount,"行")
  45. conn.close()

为了帮助大家更轻松的学好Python,我给大家分享一套Python学习资料,希望对正在学习的你有所帮助!

获取方式:关注并私信小编 “ 学习 ”,即可免费获取!

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多