爬虫笔记——东方财富科创板数据爬取（requests方法）

傑克h7x 2019-08-28

展开全文

网页观察
网页源代码分析
具体代码

上篇文章：爬虫笔记——东方财富科创板数据爬取（selenium方法）介绍了selenium爬取东方财富科创板数据的方法，这里介绍requests的爬取方法。
requests方法：
优点：速度快
缺点：需要在检查f12或者网页源代码中寻找数据源链接

===================================================

网页观察

东方财富科创板数据链接

在f12检查中，未能发现数据加载来源，于是转到网页源代码中发现了下面的网址：东方财富科创板数据网页源代码页面

在这里插入图片描述

======================================================

网页源代码分析

打开后为如下页面：
网址：数据来源函数定义页面

======================================================
可以发现函数定义形式和显示页面中的相同，且有数据来源的网址，则可以通过这个网址：数据来源网址来爬取数据，注意在函数定义页面并未定义参数，这里定义了page为1（第一页），每页pagesize为149（每页显示149条数据，因为这里一共只有149条，数据多了可以多定义几页）。
在这里插入图片描述

===========================================================
那么得到该网址后就可以通过re准则获取想要的数据了，下面为具体的代码：

具体代码

# 东方财富科创板数据爬取（requests方法）import requestsimport reimport pandas as pdHEADERS = {        'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'        }def main():        url = 'http://dcfm.eastmoney.com/em_mutisvcexpandinterface/api/js/get?type=KCB_YSB&token=70f12f2f4f091e459a279469fe49eca5&st={sortType}&sr={sortRule}&p=1&ps=149&js=var%20{jsname}={pages:(tp),data:(x),font:(font)}{param}'        response = requests.get(url)    text = response.text    # re准则查找数据    company_name = re.findall(''issue_name':'(.*?)'',text)    code = re.findall(''issue_code':'(.*?)'',text)    detail_url = list(map(lambda x: 'http://data.eastmoney.com/kcb/detail/' + str(x) + '.html',code))    state = re.findall(''latest_check_status':'(.*?)'',text)    reg_address = re.findall(''reg_address':'(.*?)'',text)    industry = re.findall(''csrc_industry':'(.*?)'',text)    sponsor_org = re.findall(''sponsor_org':'(.*?)'',text)    law_firm = re.findall(''law_firm':'(.*?)'',text)    account_firm = re.findall(''account_firm':'(.*?)'',text)    update_date = re.findall(''update_date':'([\d-]{10})',text)    accept_date = re.findall(''accept_date':'([\d-]{10})',text)    ssname = re.findall(''ssname':'(.*?)'',text)        # 数据合并    kcb_data = pd.DataFrame()    kcb_data['公司名称'] = company_name    kcb_data['公司简称'] = ssname    kcb_data['公司详情网址'] = detail_url    kcb_data['审核状态'] = state    kcb_data['注册地'] = reg_address    kcb_data['行业'] = industry    kcb_data['保荐机构'] = sponsor_org    kcb_data['律师事务所'] = law_firm    kcb_data['会计师事务所'] = account_firm    kcb_data['更新日期'] = update_date    kcb_data['受理日期'] = accept_date        # 数据保存    kcb_data.to_excel('./data/kcb_data_spider_requests.xlsx',encoding='utf-8-sig',index=False)    print('爬取完成！')    main()