Python Spider, Web Crawler

londonKu 2012-04-16

展开全文

项目中需要写个简单的spider，实现对目录型网页（hao123，9991等等）提取域名功能；需要一级域名
简单高效程序，肯定不放过python快速的开发了；

于是google、google again。呵呵

有个开源的spider，代码写的优美，并且功能比较完善，API调用呗，easy了；
简要自己做个笔记吧：

借助google site是下载吧（http://sites.google.com/site/wangyanming111/ddd/ChilkatPython.zip）
1）为了方便使用，将chilkat.py、_chilkat.pyd放到python安装目录下lib文件中吧
2）算完成了 import chilkat；没有异常，说明o了
3）简单应用一下实现项目需要吧

一个hao123 就拿下数十万域名
code：
#coding=utf-8
import chilkat

spider = chilkat.CkSpider()
#spider.Initialize("www.")
spider.Initialize("www.hao123.com")

# Add the 1st URL:
#spider.AddUnspidered("http://www./")
spider.AddUnspidered("http://www.hao123.com/")

# Avoid URLs matching these patterns:
spider.AddAvoidPattern("*更多*")#别放过更多网页哦

urllistMore=[ ]
for i in range(0,50):

success = spider.CrawlNext()
if (success == True):
# Show the URL of the page just spidered.
#print spider.lastUrl()
urllistMore.append(spider.lastUrl())
# The HTML is available in the LastHtml property
else:
# Did we get an error or are there no more URLs to crawl?
if (spider.get_NumUnspidered() == 0):
print "No more URLs to spider"
else:
print spider.lastErrorText()

# Sleep 1 second before spidering the next URL.
spider.SleepMs(1000)

for urlName in urllistMore:
domainList = chilkat.CkStringArray()
domainList.put_Unique(True)
spider.AddUnspidered(urlName)
success = spider.CrawlNext()
for i in range(0,spider.get_NumOutboundLinks()):
url = spider.getOutboundLink(i)
domainList.Append(spider.getDomain(url))
for i in range(0,domainList.get_Count()):
print domainList.getString(i)