工作任务和目标:自动获取百度实时热搜榜的标题和热搜指数 ![]() 标题:<div class="c-single-text-ellipsis"> 东部战区台岛战巡演练模拟动画 <!--48--></div> ![]() <div class="hot-index_1Bl1a"> 4946724 </div> 第一步,在deepseek中输入如下提示词: 你是一个Python爬虫专家,完成以下网页爬取的Python脚本任务: 在F:\aivideo文件夹里面新建一个Excel文件:topbaidu.xlsx 设置chromedriver的路径为:"D:\Program Files\chromedriver125\chromedriver.exe" 用selenium打开网页:https://top.baidu.com/board?tab=realtime; 请求标头为: Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7 Accept-Encoding: gzip, deflate, br, zstd Accept-Language: zh-CN,zh;q=0.9,en;q=0.8 Cache-Control: max-age=0 Connection: keep-alive Host: http://top.baidu.com Referer: https://top.baidu.com/board?platform=pc&tab=homepage&sa=pc_index_homepage_all Sec-Ch-Ua: "Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24" Sec-Ch-Ua-Mobile: ?0 Sec-Ch-Ua-Platform: "Windows" Sec-Fetch-Dest: document Sec-Fetch-Mode: navigate Sec-Fetch-Site: same-origin Sec-Fetch-User: ?1 Upgrade-Insecure-Requests: 1 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36 解析源代码,并打印输出; 定位class="c-single-text-ellipsis"的div标签,提取其文本内容作为热搜标题,保存到topbaidu.xlsx的第1列; 定位class="hot-index_1Bl1a"的div标签,提取其文本内容作为热搜指数,保存到topbaidu.xlsx的第2列; 注意: 每一步都要输出相关信息到屏幕; 每解析一个网页随机暂停1-10秒; 设置请求头,来应对反爬虫机制; DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用,并且在后续版本中被移除。为了解决这个问题,我们可以使用 concat 函数来代替 append ![]() 第二步,查看审阅deepseek生成的Python代码: import os import time import random import pandas as pd from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from selenium.webdriver.chrome.options import Options # 设置chromedriver的路径 chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe" # 设置请求头 headers = { 'authority': 'top.baidu.com', 'method': 'GET', 'path': '/board?tab=realtime', 'scheme': 'https', 'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7', 'accept-encoding': 'gzip, deflate, br, zstd', 'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8', 'cache-control': 'max-age=0', 'cookie': '__root_domain_v=.baidu.com; _qddaz=QD.484716194472545; _ntes_origin_from=sogou; _ntes_nuid=af525edefbd3d363cd5876dbe902d85b; s_n_f_l_n3=01cb466d244b7db51716517702302; _antanalysis_s_id=1716517702995; UserProvince=%u5168%u56FD; ne_analysis_trace_id=1716518441938; vinfo_n_f_l_n3=01cb466d244b7db5.1.0.1716517702302.0.1716518672850', 'if-modified-since': 'Fri, 24 May 2024 02:40:05 GMT', 'priority': 'u=0, i', 'referer': 'https://www.sogou.com/link?url=hedJjaC291M4oaTBlXc5yCiioa5eDVgw', 'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"', 'sec-ch-ua-mobile': '?0', 'sec-ch-ua-platform': '"Windows"', 'sec-fetch-dest': 'document', 'sec-fetch-mode': 'navigate', 'sec-fetch-site': 'cross-site', 'sec-fetch-user': '?1', 'upgrade-insecure-requests': '1', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36' } # 设置Chrome选项 chrome_options = Options() chrome_options.add_argument(f'user-agent={headers["user-agent"]}') # 设置Chrome服务 service = Service(chromedriver_path) # 启动Chrome浏览器 driver = webdriver.Chrome(service=service, options=chrome_options) # 打开网页 url = 'https://top.baidu.com/board?tab=realtime' driver.get(url) # 等待页面加载完成 time.sleep(5) # 获取网页源代码 html = driver.page_source print("网页源代码:") print(html) # 创建Excel文件 excel_path = os.path.join('F:\\aivideo', 'topbaidu.xlsx') df = pd.DataFrame(columns=['热搜标题', '热搜指数']) df.to_excel(excel_path, index=False) # 定位热搜标题和指数 titles = driver.find_elements(By.CLASS_NAME, 'c-single-text-ellipsis') indices = driver.find_elements(By.CLASS_NAME, 'hot-index_1Bl1a') # 保存热搜标题和指数到Excel文件 data = [] for title, index in zip(titles, indices): title_text = title.text index_text = index.text # 输出相关信息 print(f"热搜标题: {title_text}") print(f"热搜指数: {index_text}") # 添加数据到列表 data.append({'热搜标题': title_text, '热搜指数': index_text}) # 随机暂停 time.sleep(random.randint(1, 10)) # 使用concat将数据添加到DataFrame df = pd.concat([df, pd.DataFrame(data)], ignore_index=True) df.to_excel(excel_path, index=False) # 关闭浏览器 driver.quit() print("热搜信息已保存到Excel文件。") 第三步,打开visual studio code软件,新建一个py文件,将Python代码复制到这个文件中,按下F5键运行程序: 程序运行结果: ![]() ![]() |
|