爬取豆瓣电影top 250

萌小芊 2018-01-19

展开全文

晚上写的论文没有保存，要被自己蠢哭~然后就开始爬豆瓣电影250，总共有250部电影，10页。但在第5页和第10页时爬取的电影的简要介绍有缺失，这会造成该变量与其他变量长度不一，不能弄在同一个data.frame里面。然后弄成list，又不能实现rbind函数。所以我就用了比较笨的办法，把第5页和第10页的函数单独编辑~希望可以有大神写出更棒的代码和我交流~

library(magrittr)

library(rvest)

library(xml2)

library(stringr)

site1<-'https://movie.douban.com/top250?start='

site2<-'&filter='

movie<-data.frame()

for(i in 1:4){

fun<-function(i){

site<-paste(site1,25*(9-1),site2,sep='')

web<-read_html(site)

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction<-web%>%html_nodes('.inq')%>%html_text()

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<-seq(2,50,2)

number<-number1[no]

movie<-data.frame(name,introduction,remark,number)

}

movie<-rbind(movie,fun(i))

}

######在第5页时introduction变量只有24个，核对之后发现是第5页第四部《摔跤吧爸爸》没有介绍。因为能力有限，只能对第五页进行单独爬。

introduction<-NULL

fun<-function(i){

site<-paste(site1,25*(i-1),site2,sep='')

web<-read_html(site)

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction1<-web%>%html_nodes('.inq')%>%html_text()

introduction<-c(introduction1[1:3],'无介绍',introduction1[4:24])

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<-seq(2,50,2)

number<-number1[no]

movie<-data.frame(name,introduction,remark,number)

}

movie<-rbind(movie,fun(5))

########继续第6-9页

for(i in 6:9){

fun<-function(i){

site<-paste(site1,25*(9-1),site2,sep='')

web<-read_html(site)

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction<-web%>%html_nodes('.inq')%>%html_text()

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<-seq(2,50,2)

number<-number1[no]

movie<-data.frame(name,introduction,remark,number)

}

movie<-rbind(movie,fun(i))

}

######在第10页时introduction变量只有24个，核对之后发现是第10页第20部《你的名字》没有介绍。还是按照第5页的写法。

introduction<-NULL

fun<-function(i){

site<-paste(site1,25*(10-1),site2,sep='')

web<-read_html(site)

name<-web%>%html_nodes('.title:nth-child(1)')%>%html_text()%>%str_trim()

introduction1<-web%>%html_nodes('.inq')%>%html_text()

introduction<-c(introduction1[1:19],'无介绍',introduction1[20:24])

remark<-web%>%html_nodes('.rating_num')%>%html_text()%>%as.numeric()

number1<-web%>%html_nodes('.rating_num~span')%>%html_text()%>%str_trim()

no<-seq(2,50,2)

number<-number1[no]

movie<-data.frame(name,introduction,remark,number)

}

movie<-rbind(movie,fun(10))

write.csv(movie,'C:\\Users\\Administrator\\Desktop\\movie.csv')

一入爬虫深似海，从此复制是路人~

有兴趣的可以试一下，不用粘贴复制就可以得到豆瓣top250的电影了

本站是提供个人知识管理的网络存储空间，所有内容均由用户发布，不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息，谨防诈骗。如发现有害或侵权内容，请点击一键举报。

转藏分享

QQ空间 QQ好友新浪微博微信

献花（0） +1

来自：萌小芊 > 《统计》

举报/认领

0条评论

发表

请遵守用户评论公约

类似文章 更多

萌小芊

关注对话

TA的最新馆藏

[转] 基因注释软件GeneMarks和RAST
[转] 中国古代建筑等级制度
[转] Pandas处理文本数据|Pandas字符串处理|Pandas .str 属性|脑洞大开
[转] 一文读懂全外显子测序家系突变筛选策略
[转] 如何利用GAPIT进行GWAS分析
[转] 基因组注释2. 非编码基因和编码基因预测tRNAScan-SE、rRNAmmer和Prodigal

喜欢该文的人也喜欢更多

热门阅读换一换