R爬虫实践—抓取国自然基金信息【上篇】

生物_医药_科研 2020-09-24

展开全文

国自然基因的爬取最初是由于工作需求，需要整理汇总相关的国自然信息，方便定位科研热点。早期，主要通过比较机械的方式，关键词检索——肉眼选择所有信息——然后复制粘贴，相当费时间。当然，很多公共号也会整理汇总提供，但大多需要发朋友圈之类的操作。通过R爬虫就可以快速获取国自然信息，大大提高效率，以前可能需要2-3天才能完成，而爬虫几分钟就可以解决这个事情。

国自然信息来源

我们知道爬虫就是能提取网页上的可见信息，只要看得见，理论上爬虫就可以获取。因此，要爬取国自然信息，需要知道它们的来源网站，经检索主要有以下几种：

1-国家自然科学基金委员会

网址：

https://isisn.nsfc.gov.cn/egrantindex/funcindex/prjsearch-list

该网站是官网来源，它的检索设置非常细化，不能通过一次性获得宽泛的检索结果；同时涉及多次验证码，因此对爬虫并不友好。简单检索信息可以，想要爬取获取信息，不推荐该网站。当然对于爬虫大牛来说，这些都不是障碍。

2-国家自然科学基金结果查询- NSFC

网址：http://nsfc.biomart.cn/

丁香通提供的检索平台，该平台收录信息更新不及时，目前还停留在2015，直接pass吧。但该网站提供有丰富的国自然标书资源，不知道能不能爬虫，哈哈哈。

3-medsci

网址 : https://www.medsci.cn/sci/nsfc.do

该网站可以忽略，收录的信息很不全面。

4-科学网

网址：http://fund.sciencenet.cn/

该网站检索设置不复杂，可以通过简单关键词获取足够多的信息，且没有验证码设置，对爬虫友好。但是呢，该网站在任一关键词检索下只提供200条信息的展示，信息不全面。重点是，现在居然收费了！还是直接pass吧。

5-国家自然基金项目查询(V2.0正式版)

网址：http://fund./

强烈推荐该网站用于国自然信息爬取，网站检索设置不复杂，可以通过简单关键词获取足够多的信息，且没有验证码设置。同时，信息显示全面，是爬取国自然信息的最佳选择！

如何爬取国自然信息？

下面抛出这样一个要求：如何爬取2016-2019年间lncRNA相关的基金项目？

首先，打开网页http://fund./，输入关键词，如下所示：

得到如下检索界面，共计得到1200多条信息，每条信息包括题目、负责人、申请单位、研究类型、金额等信息。

分析网址规律，发现按照网址有三部分组成：[http://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/]+[页码]+[.html]

http://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/1.htmlhttp://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/2.htmlhttp://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/3.htmlhttp://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/4.html.......http://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/lastpage_number.html

接下来，正式进行爬取~~~

安装加载相应的R包

rm(list = ls())library(rvest)library(stringr)

针对第一页进行信息抓取

url1 <- c('http://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/1.html')web <- read_html(url1)#---获得基金标题---Title <- web %>% html_nodes('ul a li h3') %>% html_text() # 标题内容解析Title#---获得众多信息---Information <- web %>% html_nodes('ul a li span') %>% html_text() #获取了负责人和单位信息Information#---获取负责人信息---Author <- Information[grep('负责人', Information)]Author#---获得申请单位---Department <- Information[grep('申请单位', Information)]Department#---获取研究类型---jijintype <- Information[grep('研究类型', Information)]jijintype#---获得项目号---Project <- Information[grep('项目批准号', Information)]Project#---获取批准时间---Date <- Information[grep('批准年度', Information)]Date#---获取基金金额---Money <- Information[grep('金额', Information)]Money result <- data.frame(Title=Title,Author=Author,Department=Department, Type=jijintype,Project=Project,Date=Date,Money=Money)

获取所有页面信息，那该检索条件下最大页码是多少？先获取最大页码数（当然可以直接点击最后一页获取相应页码）

#进入总网页第一页url1<-'http://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/1.html'#读取网页内容web <- read_html(url1,encoding = 'utf-8')lastpage_link <- web %>% html_nodes('div.layui-box a') %>% html_attr('href')lastpage_link <- paste0('http://fund./',lastpage_link[length(lastpage_link)])
lastpage_web <- read_html(lastpage_link,encoding = 'utf-8')lastpage_number <-  lastpage_web %>% html_nodes('div.layui-box span.current') %>% html_text() %>% as.integer()

利用循环，获取所有页面信息

i=1site <- 'http://fund./Index/index/title/lncRNA/start_year/2016/end_year/2019/xmid/0/search/1/p/'

#创建一个空的数据框用来存储抓取的数据results <- data.frame(Title='题目',Author='负责人',Department='申请单位', Type='研究类型',Project='项目批准号',Date='批准年度',Money='金额')

for(i in 1:lastpage_number){ url <- paste0(site,i,'.html') web <- read_html(url,encoding = 'utf-8') #---获得基金标题--- Title <- web %>% html_nodes('ul a li h3') %>% html_text() # 标题内容解析 #---获得众多信息--- Information <- web %>% html_nodes('ul a li span') %>% html_text() #获取了负责人和单位信息 #---获取负责人信息--- Author <- Information[grep('负责人', Information)] #---获得申请单位--- Department <- Information[grep('申请单位', Information)] #---获取研究类型--- jijintype <- Information[grep('研究类型', Information)] #---获得项目号--- Project <- Information[grep('项目批准号', Information)] #---获取批准时间--- Date <- Information[grep('批准年度', Information)] #---获取基金金额--- Money <- Information[grep('金额', Information)] result <- data.frame(Title=Title,Author=Author,Department=Department, Type=jijintype,Project=Project,Date=Date,Money=Money) #合并所有页面数据成数据框 results <- rbind(results,result)}write.csv(results,file = '2016-2019_lncRNA国自然.csv')

爬取结果如下，共计1200多条记录。

北京中康博生物科技有限公司（beijing Cnkingbio Biotechnology Co.LTD）是北方乃至全国最大的Affymetrix检测中心之一，公司以数据分析为特色，整合Affymetrix基因芯片、Illumina二代测序、个性化生物信息分析三项核心服务。立足生命科学，为临床与基础研究领域的科学工作者提供分子生物学高端技术服务。