【原】AI网络爬虫-自动获取百度实时热搜榜

AIGC部落 2024-05-26 发布于广东

展开全文

工作任务和目标：自动获取百度实时热搜榜的标题和热搜指数

标题：<div class="c-single-text-ellipsis"> 东部战区台岛战巡演练模拟动画 </div>

第一步，在deepseek中输入如下提示词：

你是一个Python爬虫专家，完成以下网页爬取的Python脚本任务：

在F:\aivideo文件夹里面新建一个Excel文件：topbaidu.xlsx

设置chromedriver的路径为："D:\Program Files\chromedriver125\chromedriver.exe"

用selenium打开网页：https://top.baidu.com/board?tab=realtime；

请求标头为：

Accept:

text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7

Accept-Encoding:

gzip, deflate, br, zstd

Accept-Language:

zh-CN,zh;q=0.9,en;q=0.8

Cache-Control:

max-age=0

Connection:

keep-alive

Host:

http://top.baidu.com

Referer:

https://top.baidu.com/board?platform=pc&tab=homepage&sa=pc_index_homepage_all

Sec-Ch-Ua:

"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"

Sec-Ch-Ua-Mobile:

Sec-Ch-Ua-Platform:

"Windows"

Sec-Fetch-Dest:

document

Sec-Fetch-Mode:

navigate

Sec-Fetch-Site:

same-origin

Sec-Fetch-User:

Upgrade-Insecure-Requests:

User-Agent:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36

解析源代码，并打印输出；

定位class="c-single-text-ellipsis"的div标签，提取其文本内容作为热搜标题，保存到topbaidu.xlsx的第1列；

定位class="hot-index_1Bl1a"的div标签，提取其文本内容作为热搜指数，保存到topbaidu.xlsx的第2列；

注意：

每一步都要输出相关信息到屏幕；

每解析一个网页随机暂停1-10秒；

设置请求头，来应对反爬虫机制；

DataFrame.append 方法在 pandas 1.4.0 版本中已经被弃用，并且在后续版本中被移除。为了解决这个问题，我们可以使用 concat 函数来代替 append

第二步，查看审阅deepseek生成的Python代码：

import os

import time

import random

import pandas as pd

from selenium import webdriver

from selenium.webdriver.chrome.service import Service

from selenium.webdriver.common.by import By

from selenium.webdriver.chrome.options import Options

# 设置chromedriver的路径

chromedriver_path = "D:\\Program Files\\chromedriver125\\chromedriver.exe"

# 设置请求头

headers = {

'authority': 'top.baidu.com',

'method': 'GET',

'path': '/board?tab=realtime',

'scheme': 'https',

'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',

'accept-encoding': 'gzip, deflate, br, zstd',

'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8',

'cache-control': 'max-age=0',

'cookie': '__root_domain_v=.baidu.com; _qddaz=QD.484716194472545; _ntes_origin_from=sogou; _ntes_nuid=af525edefbd3d363cd5876dbe902d85b; s_n_f_l_n3=01cb466d244b7db51716517702302; _antanalysis_s_id=1716517702995; UserProvince=%u5168%u56FD; ne_analysis_trace_id=1716518441938; vinfo_n_f_l_n3=01cb466d244b7db5.1.0.1716517702302.0.1716518672850',

'if-modified-since': 'Fri, 24 May 2024 02:40:05 GMT',

'priority': 'u=0, i',

'referer': 'https://www.sogou.com/link?url=hedJjaC291M4oaTBlXc5yCiioa5eDVgw',

'sec-ch-ua': '"Google Chrome";v="125", "Chromium";v="125", "Not.A/Brand";v="24"',

'sec-ch-ua-mobile': '?0',

'sec-ch-ua-platform': '"Windows"',

'sec-fetch-dest': 'document',

'sec-fetch-mode': 'navigate',

'sec-fetch-site': 'cross-site',

'sec-fetch-user': '?1',

'upgrade-insecure-requests': '1',

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.36'