前面我们遇到了一个爬虫难题:爬虫最怕遇到JavaScript依赖性的动态网页,选择了[在R里面配置selenium爬虫环境](),仅仅是安装和配置好了在R里面使用selenium爬虫,打开一个JavaScript控制的动态网页仅仅是爬虫的开始,接下来需要跟这个网页进行各式各样的交互。首先放出一些学习链接: - https:///tutorials/rselenium_tutorial/
- http:///2019/01/22/tutorial-web-scraping-rselenium/
如果你使用谷歌浏览器自行打开网页:http://www./plasmid/template/plasmid/plasmid_list.html会发现它的源代码里面根本就没有各式各样的下一页等按钮,全部被隐藏在了如下所示的JavaScript脚本里面: <script type="text/javascript"> document.write('<script src="../../js/common/base.js?v='+new Date().getTime()+'" type="text/javascript" charset="utf-8"><\/script>'); document.write('<script src="../../js/util/jump.js?v='+new Date().getTime()+'" type="text/javascript" charset="utf-8"><\/script>'); document.write('<script src="../../js/common/common_methods.js?v='+new Date().getTime()+'" type="text/javascript" charset="utf-8"><\/script>'); document.write('<script src="../../js/common/parts.js?v='+new Date().getTime()+'" type="text/javascript" charset="utf-8"><\/script>'); document.write('<script src="../../js/plasmid/plasmid_list.js?v='+new Date().getTime()+'" type="text/javascript" charset="utf-8"><\/script>'); document.write('<script src="../../js/plasmid/plasmid_list_mobile.js?v='+new Date().getTime()+'" type="text/javascript" charset="utf-8"><\/script>'); </script> 但是这些JavaScript脚本是网页开发者并不会公开的,所以我们没办法去查看它们里面的函数。但是可以使用selenium爬虫打开的网页,代码如下:library(RSelenium) library(rvest) library(stringr)
################调用R包######################################### library(rvest) # 为了read_html函数 library(RSelenium) # 为了使用JavaScript进行网页抓取
###############连接Server并打开浏览器############################ remDr <- remoteDriver(remoteServerAddr = "127.0.0.1" , port = 4444 , browserName = "chrome")#连接Server remDr$open() #打开浏览器 remDr$navigate("http://www./plasmid/template/plasmid/plasmid_list.html") #打开网页 这个时候查看源代码里面就有一些内容了,需要重点关注的就是“下一页”,需要对HTML有一点认知,知道 id,class,css,name的区别: <div id="pages"><div class="layui-box layui-laypage layui-laypage-default" id="layui-laypage-1"> <a href="javascript:;" class="layui-laypage-prev layui-disabled" data-page="0">上一页</a> <span class="layui-laypage-curr"><em class="layui-laypage-em"></em><em>1</em></span> <a href="javascript:;" data-page="2">2</a><span class="layui-laypage-spr">…</span> <a href="javascript:;" class="layui-laypage-last" title="尾页" data-page="3396">3396</a> <a href="javascript:;" class="layui-laypage-next" data-page="2">下一页</a></div></div> # <a href="javascript:;" class="layui-laypage-next" data-page="2">下一页</a></div></div> # 解析上面的html源代码 webElem <- remDr$findElement(using = 'class', value = "layui-laypage-next") webElem$clickElement() 这个网页每次仅仅是刷新部分plasmid出来,所以需要依次循环3396次,拿到全部plasmid信息,获取其超链接url。# Extracting links links=remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes("a") %>% html_attr("href") links i=1 while(i<3396){ i=i+1 webElem <- remDr$findElement(using = 'class', value = "layui-laypage-next") webElem$clickElement() lks=remDr$getPageSource()[[1]] %>% read_html() %>% html_nodes("a") %>% html_attr("href") print(lks) links=c(links,lks) } links=unique(links) links save(links,file = 'plasmid_detail_links.Rdata') 当然了,拿到的links本身,还需进行二次访问,继续摘取里面的信息。前面的links变量里面储存了全部的plasmid的介绍页面的url接下来就循环访问每个plasmid网页获取信息,代码如下: load(file = 'plasmid_detail_links.Rdata') links kp=grepl('plasmid_detail.html',links) links=links[kp] length(links) remDr <- remoteDriver(remoteServerAddr = "127.0.0.1" , port = 4444 , browserName = "chrome")#连接Server remDr$open() #打开浏览器 for (i in 1:5000) { print(i) url=links[i] remDr$navigate(url) #打开网页 Sys.sleep(0.5) print(remDr$getCurrentUrl()) htmls=remDr$getPageSource()[[1]] # <div class="panel-body"> bodys <- htmls %>% read_html() %>% html_nodes('.panel-body') c1 <- bodys[[1]] %>% html_text() c2 <- bodys[[2]] %>% html_text() c3 <- bodys[[3]] %>% html_text() c1=gsub('\t','',c1);c1=gsub('\n','',c1); c2=gsub('\t','',c2);c2=gsub('\n','',c2); c3=gsub('\t','',c3);c3=gsub('\n','',c3); # id="plasmidName" plasmidName <- htmls %>% read_html() %>% html_nodes('#plasmidName') %>% html_text() # id="plasmid_identification" plasmid_identification <- htmls %>% read_html() %>% html_nodes('#plasmid_identification') %>% html_text() info=data.frame(plasmidName,plasmid_identification,c1,c2,c3) rm(htmls)
write.table(info,file = 'info1.txt', col.names = F,row.names = F, append = T) }
更复杂的使用RSelenium+rvest爬取动态或需登录页面教程- https://www.jianshu.com/p/e5b252c90e0d
- https://blog.csdn.net/qq_33291559/article/details/80028119
- https://www.jianshu.com/p/1fc6a6817160
- css selector手册:https://www.runoob.com/cssref/css-selectors.html
- xpath selector手册:https://www.runoob.com/xpath/xpath-tutorial.html
- xpath查找节点:https://www.cnblogs.com/txwen/p/7999485.html
- 关于GET和POST的有趣的解释:https://zhuanlan.zhihu.com/p/22536382
- RCurl解析:https://blog.csdn.net/kMD8d5R/article/details/78933384
- html文件及http基本知识:https://www.w3school.com.cn/tags/html_ref_byfunc.asp
- rvest模拟浏览行为:https://blog.csdn.net/weixu22/article/details/79237512
- rvest模拟点击网页:https://www./html/224799.html
文末友情宣传强烈建议你推荐我们生信技能树给身边的博士后以及年轻生物学PI,帮助他们多一点数据认知,让科研更上一个台阶:
|