python开发新浪博客爬虫

quasiceo 2014-06-05

展开全文

python开发新浪博客爬虫

(2012-10-17 22:37:24)

标签：

闲暇有空，把我前段时间写的新浪blog小爬虫程序在这里与大家分享一下。这个爬虫程序功能就是从指定的网页上，过滤出新浪blog的地址，然后保存在文件内。程序的运行是采用多线程工作模式，默认是开5个线程。

一、程序组成：
1.BlogSpider
爬虫程序，主要利于正则表达式过滤出符合条件的网页链接地址。

2.QueueTask
爬虫的任务队列程序，控制爬虫程序多线程执行。

3.logging.conf
日志配置文件，用于配置爬虫程序运行的日志记录。

4.blogspider.log
爬虫程序运行的日志记录文件

5.testlist.txt
设置爬虫程序的要查找的网页网址列表

6.bloglist.txt
保存经过爬虫程序查找后过滤出符合条件的新的blog网页链接地址

二、运行步骤：
（测试环境win7+python2.7.3）
1.在testlist.txt中加入要爬虫的目标网页网址，这里我加入了新浪博客的主页（http://blog.sina.com.cn/）

2.执行命令 python QueueTask.py.
可以看到爬虫找到了许多博客网址（数量大约一千七百多），并保存在bloglist.txt中。

3.可以重复利用找到的博客列表，把bloglist.txt中的内容放在testlist.txt中，就如原子裂变一样，找到更多的博客网页地址。很好玩是把 :-)

有了这个博客爬虫，你就能收集到很多新浪博客列表信息，至于你要利于这些信息干什么，就见仁见智了。
三、核心代码:
-------------------------------BlogSpider----------------------------------------------
#-*- encoding: utf-8 -*-
#author : rayment
#CreateDate : 2012-07-19
#version 2.0
import re
import urllib2
import logging

logger = logging.getLogger('bloglog.spider')

def getBlogUrl(websize, savefile):
    '''
    parse all sinablog url from a assign html
    sinablog url include two rule:
    1)http://blog.sina.com.cn/xxxxx
    2)http://blog.sina.com.cn/x/xxxxxx
    re = http://blog.sina.com.cn/[\w]+[/\d]*
    '''
    urlre = re.compile(r'(http://[^/\\]+)', re.I)
    hrefre = re.compile(r'<a href=".*?<\/a>', re.I)
    blogre = re.compile(r'http://blog.sina.com.cn/[\w]+[/\d]*', re.I)
    filterre = re.compile(r'.htm?|.xml|</p>|http://blog.sina.com.cn/[\w]+/[\w]+/', re.I)
    urlmatch = urlre.match(websize)
    if not urlmatch:
        #print '%s is not a correct url.'%websize
        logger.info('%s is not a correct url.'%websize)
    else:
        try:
            urllist = []
            fd = urllib2.urlopen(websize)
            content =fd.read()
            #print '\nConnetion %s success...'%(websize)
            logger.info('Connetion %s success...'%(websize))
            hrefs = hrefre.findall(content)
            for href in hrefs:
                splits = href.split(' ')
                if len(splits) != 1:
                    href = splits[1]
                #get text of href tag
                matches = re.match('href="(.*)"', href)
                if matches is not None:
                    url = matches.group(1)
                    if blogre.match(url) is not None:
                        if filterre.findall(url):
                            pass
                        else:
                            urllist.append(url)
            saveFile(filterDuplicateData(urllist), savefile)
        except Exception, error:
            #print error
            logger.info(error)

def filterDuplicateData(ls):
    '''
    filter duplicate data
    '''
    newls = []
    for data in ls:
        if not data in newls:
            newls.append(data)
    '''
    newls = list(set(ls))
    newls.sort(key = ls.index)
    '''
    #print 'Search blog url------>'
    logger.info('Search blog url------>')
    num = 1
    for item in newls:
        #print '%d: %s'%(num, item)
        logger.info('%d: %s'%(num, item))
        num = num + 1
    return newls

def saveFile(bloglist, savefile):
    '''
    save urllist in a text file
    '''
    if bloglist and (len(bloglist) > 0):
        with open(savefile, 'rb') as fr:
            temp = fr.read()
        with open(savefile, 'ab') as fw:
            logger.info('Add blog url------>')
            #print 'Add blog url------>'
            for it in bloglist:
                it = it.strip()
                if it not in temp:
                    #print it
                    logger.info(it)
                    fw.write(it+'\n')
    else:
        #print 'There are not blog list.'
        logger.info('There are not blog list.')


if __name__ == '__main__':
    websize ='http://blog.sina.com.cn/raymentblog'
    savefile = 'D:\BlogSpider\\bloglist.txt'
    getBlogUrl(websize, savefile)
---------------------------------------------------------------------------------------

源代码下载