python抓小说教程来了！urllib2、BeautifulSoup抓小说！

东西二王 2019-05-20

展开全文

库

urllib2

模拟http请求获取html

BeautifulSoup

根据选择器获取dom结点,可查看css选择器

抓取逻辑

1.查看起点免费小说列表：https://www.qidian.com/free/all

2.先搞懂一本书的抓取逻辑

2.1 根据选择器获取到书的链接和书名

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

bookCover = book.select('div[class='book-mid-info'] h4 > a')[0]

利用css选择器，直接定位到我们需要的div。

2.2 创建并打开文件

 bookFile = open('crawler/books/'   bookCover.string   '.txt', 
 'a ')

使用'a '模式打开，如果不存在就创建这个文件，如果存在，就追加内容。创建的txt文件名也就是抓取到的dom结点的text

2.3 跳转到正文内容

先获取到'div[class='book-mid-info'] h4 > a' 这个结点的href地址，然后获取到返回内容，如下图

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

再获取到免费试读这个结点的href，再获取它的返回内容

2.4 递归获取到每一张的内容，写入文件

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

通过class获取到结点内容，然后再获取到下一章的href然后递归获取每章内容。

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

如果没有下一页而是书末页就说明已经最后一章了，递归结束，一本书的内容也就获取完毕了。

循环获取当前页的每本书内容

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

每本书其实都是一个li标签，先获取到所有的li然后按照第二步进行遍历。

循环获取所有页面的书

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

当当前页面所有的书本都抓取完毕了，那么我们可以获取下>对应的href然后获取到返回内容，继续循环抓取。

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

直到抓取到最后一页,>这个dom结点的class会增加一个为lbf-pagination-disabled,可以根据这个来判断是否为最后一页。

成品展示

python抓小说教程来了！urllib2、BeautifulSoup抓小说！

完整代码

# coding=utf-8 import urllib2 import sys from bs4 import BeautifulSoup #设置编码 reload(sys) sys.setdefaultencoding('utf-8') startIndex = 0 #默认第0本 startPage = 0 #默认第0页 #获取一个章节的内容 def getChapterContent(file,url): try: bookContentRes = urllib2.urlopen(url) bookContentSoup = BeautifulSoup(bookContentRes.read(), 'html.parser') file.write(bookContentSoup.select('h3[class='j_chapterName']')[0].string '\n') for p in bookContentSoup.select('.j_readContent p'): file.write(p.next '\n') except BaseException: #如果出错了，就重新运行一遍 print(BaseException.message) getChapterContent(file, url) else: chapterNext = bookContentSoup.select('a#j_chapterNext')[0] if chapterNext.string != '书末页': nextUrl = 'https:' chapterNext['href'] getChapterContent(file,nextUrl) #获取当前页所有书的内容 def getCurrentUrlBooks(url): response = urllib2.urlopen(url) the_page = response.read() soup = BeautifulSoup(the_page, 'html.parser') bookArr = soup.select('ul[class='all-img-list cf'] > li') global startIndex if startIndex > 0: bookArr = bookArr[startIndex:] startIndex = 0 for book in bookArr: bookCover = book.select('div[class='book-mid-info'] h4 > a')[0] print '书名：' bookCover.string # 先创建.txt文件，然后获取文本内容写入 bookFile = open('crawler/books/' bookCover.string '.txt', 'a ') bRes = urllib2.urlopen('https:' bookCover['href']) bSoup = BeautifulSoup(bRes.read(), 'html.parser') bookContentHref = bSoup.select('a[class='red-btn J-getJumpUrl ']')[0]['href'] getChapterContent(bookFile, 'https:' bookContentHref) bookFile.close() nextPage = soup.select('a.lbf-pagination-next')[0] return nextPage['href'] if len(sys.argv)==1: pass elif len(sys.argv) == 2: startPage = int(sys.argv[1])/20 #从第几页开始下载 startIndex = int(sys.argv[1])%20 # 从第几本开始下载 elif len(sys.argv) > 2: startPage = int(sys.argv[1]) startIndex = int(sys.argv[2]) #根据传入参数设置从哪里开始下载 url = '//www.qidian.com/free/all?orderId=&vip=hidden&style=1&pageSize=20&siteid=1&pubflag=0&hiddenField=1&page=' str(startPage 1) #死循环直到没有下一页 while True: if url.startswith('//'): url = getCurrentUrlBooks('https:' url) else: break;