爬取网站:https://tianqi.2345.com/ 爬取目标:采集北上广深2020~2022年每天的天数数据,包括“最高温”,“最低温”,“天气”,“风力风向”,“空气质量指数”,并存储在CSV文件中。 历史天气数据: - 北京:https://tianqi.2345.com/wea_history/54511.htm
- 上海:https://tianqi.2345.com/wea_history/58362.htm
- 广州:https://tianqi.2345.com/wea_history/59287.htm
- 深圳:https://tianqi.2345.com/wea_history/59493.htm
爬取代码: import csv import requests import time from bs4 import BeautifulSoup
with open(r'.\北上广深历史天气.csv', mode='w+', newline='', encoding='utf-8') as f: csv_writer = csv.writer(f) csv_writer.writerow(['城市', '日期', '最高温', '最低温', '天气', '风力风向', '空气质量指数']) city_dict = {'北京': 54511, '上海': 58362, '广州': 59287, '深圳': 59493} for city in city_dict: time.sleep(1000) for year in range(2020, 2023): for month in range(1, 13): url = f'https://tianqi.2345.com/Pc/GetHistory' \ f'?areaInfo%5BareaId%5D={city_dict[city]}' \ f'&areaInfo%5BareaType%5D=2&date%5Byear%5D={year}' \ f'&date%5Bmonth%5D={month}' response = requests.get(url=url) json_data = response.json() html_data = json_data['data'] page = BeautifulSoup(html_data, "html.parser") table = page.find("table", attrs={"class": "history-table"}) trs = table.find_all("tr") for it in trs[1:]: td = it.find_all('td') e1 = td[0].text # 日期 e2 = td[1].text # 最高温 e3 = td[2].text # 最低温 e4 = td[3].text # 天气 e5 = td[4].text # 风力风向 e6 = td[5].text # 空气质量指数 lst = [city, e1, e2, e3, e4, e5, e6] print(lst) csv_writer.writerow(lst)
最后结果:
一键三连,一起学习⬇️
|