python代码运行速度有点慢 ? 教你使用多线程速度飞升

python芊 2023-01-08 发布于湖南

展开全文

前言

嗨喽，大家好呀~这里是爱看美女的茜茜呐

又到了学Python时刻~

在我们爬取数据的时候,有时候它运行速度会非常慢

那么怎么解决呢?

这里给大家带来一个多线程的方法

我们用采集二手车来举例

环境使用:

Python 3.8
Pycharm

模块使用:

requests 数据请求模块
parsel 数据解析模块
re
csv 内置模块

一. 代码实现步骤:

发送请求, 模拟浏览器对于url地址发送请求
获取数据, 获取服务器返回响应数据
开发者工具: response
解析数据, 提取我们想要的数据内容
提取: 房源基本信息
保存数据, 把数据保存表格文件里面
多页数据采集

二. 代码展示

基础版

导入模块

# 导入数据请求模块 --> 第三方模块 需要安装 pip install requests
import requests
# 导入数据解析模块 --> 第三方模块 需要安装 pip install parsel
import parsel
# 导入csv
import csv
# 导入时间模块
import time

PS：完整源码或数据集如有需要的小伙伴可以加下方的群去找管理员免费领取

time_1 = time.time()

创建文件 <对象>

f = open('data.csv', mode='a', encoding='utf-8', newline='')
csv_writer = csv.DictWriter(f, fieldnames=[
    '标题'
    '小区',
    '总价',
    '单价',
    '户型',
    '面积',
    '朝向',
    '装修',
    '楼层',
    '建筑日期',
    '建筑类型',
    '详情页',
])

写入表头

csv_writer.writeheader()

"""

发送请求, 模拟浏览器对于url地址发送请求

伪装模拟: 请求头字典数据类型构建完整键值对
headers 可以直接在开发者工具里面进行复制
<Response [200]> 响应对象
200 状态码表示请求成功

"""

for page in range(1, 101):
    try:
        print(f'==================正在采集第{page}页的数据内容==================')

请求链接

( 因不可抗原因，不能出现网址，会发不出去，用图片代替了，大家照着敲一下 )

模拟伪装

        headers = {
            # User-Agent 用户代理, 表示浏览器基本身份信息
            'User-Agent': ' Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.0.0 Safari/537.36'
        }

发送请求

        response = requests.get(url, headers)

"""

获取数据, 获取服务器返回响应数据
开发者工具: response
获取网页源代码

response.text 获取响应文本数据, 字符串数据类型获取html字符串数据内容
response.json() 获取响应字典数据 json数据必须是完整json数据格式

解析数据, 提取我们想要的数据内容
提取: 房源基本信息
解析方法:

xpath
re正则
css
json数据处理

css选择器: 根据标签属性提取数据内容

1. 看数据在那个标签里面

"""

        html_data = requests.get(link).text
        select = parsel.Selector(html_data)

把获取下来 html字符串数据 response.text , 转成可解析对象

        selector = parsel.Selector(response.text)

第一次提取, 把包含房源数据信息标签全部获取下来获取所有li标签

        lis = selector.css('sellListContent li')

for循环把列表里元素一个一个提取出来

        for li in lis:
            源码、解答、资源、学习交流可加企鹅裙：261823976##
            title = li.css('.title a::text').get()  # 标题
            href = li.css('.title a::attr(href)').get()  # 详情页
            totalPrice = li.css('.totalPrice span::text').get()  # 售价
            unitPrice = li.css('.unitPrice span::text').get()  # 单价
            string = select.css('.comments div:nth-child(7) .comment_text::text').get()

join把列表合并字符串

            area = '-'.join(li.css('.info .flood .positionInfo a::text').getall())  # 小区
            houseInfo = li.css('.info .address .houseInfo::text').get()

split 把字符串分割成列表

            houseType = houseInfo.split(' | ')[0]  # 户型
            houseArea = houseInfo.split(' | ')[1]  # 面积
            orientation = houseInfo.split(' | ')[2]  # 朝向
            renovation = houseInfo.split(' | ')[3]  # 装修
            floor = houseInfo.split(' | ')[4]  # 楼层

判断 houseInfo.split(' | ') 有多少个元素, 如果6个元素说明没有建造日期

            if len(houseInfo.split(' | ')) == 6:
                date = ''
            else:
                date = houseInfo.split(' | ')[5]
            buildingType = houseInfo.split(' | ')[-1]  # 建筑类型
            dit = {
                '标题': title,
                '小区': area,
                '总价': totalPrice,
                '单价': unitPrice,
                '户型': houseType,
                '面积': houseArea,
                '朝向': orientation,
                '装修': renovation,
                '楼层': floor,
                源码、解答、资源、学习交流可加企鹅裙：261823976##
                '建筑日期': date,
                '建筑类型': buildingType,
                '详情页': href,
            }
            csv_writer.writerow(dit)
            print(string)
    except:
        print('报名系统课程可以添加清风老师微信: pythonmiss')

多线程版

mport requests
import parsel
import re
import csv
# 线程池模块
import concurrent.futures
import time

def get_response(html_url):
    """
    发送请求函数
    :param html_url:
    :return:
    """
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def get_content(html_url):
    """
    获取数据函数
    :param html_url:
    :return:
    """
    response = get_response(html_url)
    html_data = get_response(link).text
    selector = parsel.Selector(response.text)
    select = parsel.Selector(html_data)
    lis = selector.css('.sellListContent li')
    content_list = []
    for li in lis:

        title = li.css('.title a::text').get()  # 标题
        area = '-'.join(li.css('.positionInfo a::text').getall())  # 小区
        Price = li.css('.totalPrice span::text').get()  # 总价
        Price_1 = li.css('.unitPrice span::text').get().replace('元/平', '')  # 单价
        houseInfo = li.css('.houseInfo::text').get()  # 信息
        HouseType = houseInfo.split(' | ')[0]  # 户型
        HouseArea = houseInfo.split(' | ')[1].replace('平米', '')  # 面积
        direction = houseInfo.split(' | ')[2].replace(' ', '')  # 朝向
        renovation = houseInfo.split(' | ')[3]  # 装修
        floor_info = houseInfo.split(' | ')[4]
        floor = floor_info[:3]  # 楼层
        floor_num = re.findall('(\d+)层', floor_info)[0]  # 层数
        BuildingType = houseInfo.split(' | ')[-1]
        string = select.css('.comments div:nth-child(7) .comment_text::text').get()
        href = li.css('.title a::attr(href)').get()  # 详情页
        if len(houseInfo.split(' | ')) == 6:
            date = 'None'
        else:
            date = houseInfo.split(' | ')[5].replace('年建', '')  # 日期
        print(string)
        dit = {
        源码、解答、资源、学习交流可加企鹅裙：261823976##
            '标题': title,
            '内容': string,
            '小区': area,
            '总价': Price,
            '单价': Price_1,
            '户型': HouseType,
            '面积': HouseArea,
            '朝向': direction,
            '装修': renovation,
            '楼层': floor,
            '层数': floor_num,
            '建筑日期': date,
            '建筑类型': BuildingType,
            '详情页': href,
        }
        content_list.append(dit)
    return content_list


def main(page):
    """
    主函数
    :param page:
    :return:
    """
    print(f'===============正在采集第{page}页的数据内容===============')

    content_list = get_content(html_url=url)
    for content in content_list:
        csv_writer.writerow(content)

if __name__ == '__main__':
    time_1 = time.time()
    link = 'http:// *******.com/article/149'
    # 创建文件
    f = open('data多线程.csv', mode='a', encoding='utf-8', newline='')
    csv_writer = csv.DictWriter(f, fieldnames=[
        '标题',
        '内容',
        '小区',
        '总价',
        '单价',
        '户型',
        '面积',
        '朝向',
        '装修',
        '楼层',
        '层数',
        '建筑日期',
        '建筑类型',
        '详情页',
    ])
    csv_writer.writeheader()

    # 线程池执行器 max_workers 最大线程数
    exe = concurrent.futures.ThreadPoolExecutor(max_workers=10)
    for page in range(1, 11):
        exe.submit(main, page)
    exe.shutdown()
    time_2 = time.time()
    use_time = int(time_2 - time_1)
    # 总计耗时: 9
    print('总计耗时:', use_time)