【原】【爬虫案例】采集北上广深天气数据

老马的程序人生 2022-12-26 发布于湖南

展开全文

爬取网站：https://tianqi.2345.com/

爬取目标：采集北上广深2020~2022年每天的天数数据，包括“最高温”，“最低温”，“天气”，“风力风向”，“空气质量指数”，并存储在CSV文件中。

历史天气数据：

北京：https://tianqi.2345.com/wea_history/54511.htm
上海：https://tianqi.2345.com/wea_history/58362.htm
广州：https://tianqi.2345.com/wea_history/59287.htm
深圳：https://tianqi.2345.com/wea_history/59493.htm

爬取代码：

import csv
import requests
import time
from bs4 import BeautifulSoup

with open(r'.\北上广深历史天气.csv', mode='w+', newline='', encoding='utf-8') as f:
    csv_writer = csv.writer(f)
    csv_writer.writerow(['城市', '日期', '最高温', '最低温', '天气', '风力风向', '空气质量指数'])
    city_dict = {'北京': 54511, '上海': 58362, '广州': 59287, '深圳': 59493}
    for city in city_dict:
        time.sleep(1000)
        for year in range(2020, 2023):
            for month in range(1, 13):
                url = f'https://tianqi.2345.com/Pc/GetHistory' \
                      f'?areaInfo%5BareaId%5D={city_dict[city]}' \
                      f'&areaInfo%5BareaType%5D=2&date%5Byear%5D={year}' \
                      f'&date%5Bmonth%5D={month}'
                response = requests.get(url=url)
                json_data = response.json()
                html_data = json_data['data']
                page = BeautifulSoup(html_data, "html.parser")
                table = page.find("table", attrs={"class": "history-table"})
                trs = table.find_all("tr")
                for it in trs[1:]:
                    td = it.find_all('td')
                    e1 = td[0].text  # 日期
                    e2 = td[1].text  # 最高温
                    e3 = td[2].text  # 最低温
                    e4 = td[3].text  # 天气
                    e5 = td[4].text  # 风力风向
                    e6 = td[5].text  # 空气质量指数
                    lst = [city, e1, e2, e3, e4, e5, e6]
                    print(lst)
                    csv_writer.writerow(lst)