上篇文章:爬虫笔记——东方财富科创板数据爬取(selenium方法)介绍了selenium爬取东方财富科创板数据的方法,这里介绍requests的爬取方法。 =================================================== 网页观察在f12检查中,未能发现数据加载来源,于是转到网页源代码中发现了下面的网址:东方财富科创板数据网页源代码页面 ====================================================== 网页源代码分析打开后为如下页面: ====================================================== =========================================================== 具体代码# 东方财富科创板数据爬取(requests方法)import requestsimport reimport pandas as pdHEADERS = { 'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36' }def main(): url = 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=KCB_YSB&token=70f12f2f4f091e459a279469fe49eca5&st={sortType}&sr={sortRule}&p=1&ps=149&js=var%20{jsname}={pages:(tp),data:(x),font:(font)}{param}' response = requests.get(url) text = response.text # re准则查找数据 company_name = re.findall(''issue_name':'(.*?)'',text) code = re.findall(''issue_code':'(.*?)'',text) detail_url = list(map(lambda x: 'http://data.eastmoney.com/kcb/detail/' + str(x) + '.html',code)) state = re.findall(''latest_check_status':'(.*?)'',text) reg_address = re.findall(''reg_address':'(.*?)'',text) industry = re.findall(''csrc_industry':'(.*?)'',text) sponsor_org = re.findall(''sponsor_org':'(.*?)'',text) law_firm = re.findall(''law_firm':'(.*?)'',text) account_firm = re.findall(''account_firm':'(.*?)'',text) update_date = re.findall(''update_date':'([\d-]{10})',text) accept_date = re.findall(''accept_date':'([\d-]{10})',text) ssname = re.findall(''ssname':'(.*?)'',text) # 数据合并 kcb_data = pd.DataFrame() kcb_data['公司名称'] = company_name kcb_data['公司简称'] = ssname kcb_data['公司详情网址'] = detail_url kcb_data['审核状态'] = state kcb_data['注册地'] = reg_address kcb_data['行业'] = industry kcb_data['保荐机构'] = sponsor_org kcb_data['律师事务所'] = law_firm kcb_data['会计师事务所'] = account_firm kcb_data['更新日期'] = update_date kcb_data['受理日期'] = accept_date # 数据保存 kcb_data.to_excel('./data/kcb_data_spider_requests.xlsx',encoding='utf-8-sig',index=False) print('爬取完成!') main()
可以看到用requests方法是直接在一个页面中爬取数据,故速度要快得多,且代码也较少。 |
|