项目中需要写个简单的spider,实现对目录型网页(hao123,9991等等)提取域名功能;需要一级域名 简单高效程序,肯定不放过python快速的开发了; 于是google、google again。呵呵 有个开源的spider,代码写的优美,并且功能比较完善,API调用呗 ,easy了; 简要自己做个笔记吧: 借助google site是下载吧(http://sites.google.com/site/wangyanming111/ddd/ChilkatPython.zip) 1)为了方便使用,将chilkat.py、_chilkat.pyd放到python安装目录下lib文件中吧 2)算完成了 import chilkat; 没有异常,说明o了 3)简单应用一下 实现项目需要吧 一个hao123 就拿下数十万域名 code: #coding=utf-8 import chilkat spider = chilkat.CkSpider() #spider.Initialize("www.") spider.Initialize("www.hao123.com") # Add the 1st URL: #spider.AddUnspidered("http://www./") spider.AddUnspidered("http://www.hao123.com/") # Avoid URLs matching these patterns: spider.AddAvoidPattern("*更多*")#别放过更多网页哦 urllistMore=[ ] for i in range(0,50): success = spider.CrawlNext() if (success == True): # Show the URL of the page just spidered. #print spider.lastUrl() urllistMore.append(spider.lastUrl()) # The HTML is available in the LastHtml property else: # Did we get an error or are there no more URLs to crawl? if (spider.get_NumUnspidered() == 0): print "No more URLs to spider" else: print spider.lastErrorText() # Sleep 1 second before spidering the next URL. spider.SleepMs(1000) for urlName in urllistMore: domainList = chilkat.CkStringArray() domainList.put_Unique(True) spider.AddUnspidered(urlName) success = spider.CrawlNext() for i in range(0,spider.get_NumOutboundLinks()): url = spider.getOutboundLink(i) domainList.Append(spider.getDomain(url)) for i in range(0,domainList.get_Count()): print domainList.getString(i) |
|