分享

python3 selenium 抓取网页多个表格数据,并导入execl中

 老三的休闲书屋 2020-12-05

1. 首先我需要抓取数据的网址是:

'https://mtj.baidu.com/data/mobile/device'

2. 打开浏览器,输入网址,是这个页面

3. 我想要抓取,品牌,机型,系统,分辨率,联网右边的表格

4. 使用 pycharm IDE,下载 selenium模块

                        pip install selenium(请在python安装时勾选pip)

5. 代码如下

  1. #!/usr/bin/env python
  2. # -*- coding: UTF-8 -*-
  3. from selenium import webdriver
  4. from selenium.webdriver.chrome.options import Options
  5. import time
  6. import openpyxl
  7. import sys
  8. import datetime
  9. import importlib
  10. import xlwt
  11. import xlrd
  12. url = 'https://mtj.baidu.com/data/mobile/device'
  13. def wait(class_name):
  14. for trytimes in range(0, 10):
  15. # noinspection PyBroadException
  16. try:
  17. browser.find_element_by_class_name(class_name).click()
  18. break
  19. except Exception:
  20. time.sleep(0.5)
  21. def waits(class_name):
  22. for trytimes in range(0, 10):
  23. # noinspection PyBroadException
  24. try:
  25. element = browser.find_elements_by_class_name(class_name)
  26. break
  27. except Exception:
  28. time.sleep(10)
  29. return element
  30. def save_data(dict):
  31. fileName = u'百度研究学院移动平台.xls'
  32. # 新建新的Excel文档
  33. wb = xlwt.Workbook(encoding='utf-8')
  34. for d in dict:
  35. sheet = wb.add_sheet(d, cell_overwrite_ok=True)
  36. headlist = [d, '占比']
  37. row = 0
  38. col = 0
  39. for head in headlist:
  40. sheet.write(col, row, head)
  41. row += 1
  42. i = 0
  43. for data in dict[d]:
  44. if (i % 2 == 0):
  45. col += 1
  46. sheet.write(col, i % 2, data)
  47. i += 1
  48. wb.save(fileName)
  49. def wait_refresh():
  50. try:
  51. browser.refresh() # 刷新方法 refresh
  52. print ('test pass: refresh successful')
  53. time.sleep(1)
  54. except Exception as e:
  55. print ('Exception found', format(e))
  56. def get_data():
  57. #保存5个类别的数据,list_button中是class_name
  58. list_button = ['icon-brand','icon-device', 'icon-os', 'icon-screen','icon-network']
  59. #字典保存所有数据
  60. icon_brand = []
  61. icon_device = []
  62. icon_os = []
  63. icon_screen = []
  64. icon_network = []
  65. dict = {'icon-brand':icon_brand, 'icon-device':icon_device, 'icon-os':icon_os , 'icon-screen':icon_screen , 'icon-network':icon_network }
  66. #分别点击5个按钮,保存数据,品牌,机型,系统分辨率,联网
  67. for button in list_button:
  68. print('************',button,'********************')
  69. wait(button)
  70. time.sleep(2)
  71. element_name = browser.find_elements_by_class_name('dtd1')
  72. element_rank = browser.find_elements_by_class_name('dtd3')
  73. name_list = []
  74. rank_list = []
  75. listen = len(element_name)
  76. for name in element_name:
  77. print(element_name)
  78. name_list.append(name.get_attribute('textContent'))
  79. #print(name.get_attribute('textContent'))
  80. for rank in element_rank:
  81. rank_list.append(rank.get_attribute('textContent'))
  82. for i in range(0, listen):
  83. dict[button].append(name_list[i])
  84. dict[button].append(rank_list[i])
  85. print(dict)
  86. return dict
  87. ######################################################################################
  88. #打开浏览器
  89. browser = webdriver.Chrome()
  90. #最大化窗口
  91. browser.maximize_window()
  92. #输入网址
  93. browser.get(url)
  94. #获取数据
  95. dict_data = get_data()
  96. #写入表格
  97. importlib.reload(sys)
  98. save_data(dict_data)

5. 生成的表格形式如下

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多