学会Python这几个类库使用，快速写爬虫不是问题（详细步骤附源码）

长沙7喜 2017-12-16

展开全文

作为一种便捷地收集网上信息并从中抽取出可用信息的方式，网络爬虫技术变得越来越有用。使用Python这样的简单编程语言，你可以使用少量编程技能就可以爬取复杂的网站。

如果手机上显示代码错乱，请分享到QQ或者其他地方，用电脑查看！！！

python能干的东西有很多，这里不再过多叙述，直接重点干货。

首先对的scrapy命令行使用的一个介绍。

创建爬虫项目

scrapy startproject 项目名

例子如下：

localhost:spider zhaofan$ scrapy startproject test1New Scrapy project 'test1', using template directory '/Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages/scrapy/templates/project', created in: /Users/zhaofan/Documents/python_project/spider/test1You can start your first spider with: cd test1 scrapy genspider example localhost:spider zhaofan$

这个时候爬虫的目录结构就已经创建完成了,目录结构如下：

|____scrapy.cfg|____test1| |______init__.py| |____items.py| |____middlewares.py| |____pipelines.py| |____settings.py| |____spiders| | |______init__.py

接着我们按照提示可以生成一个spider,这里以百度作为例子,生成spider的命令格式为;

scrapy genspider 爬虫名字爬虫的网址

localhost:test1 zhaofan$ scrapy genspider baiduSpider baidu.comCreated spider 'baiduSpider' using template 'basic' in module: test1.spiders.baiduSpiderlocalhost:test1 zhaofan$

关于命令详细使用

命令的使用范围

这里的命令分为全局的命令和项目的命令，全局的命令表示可以在任何地方使用，而项目的命令只能在项目目录下使用

全局的命令有：

startproject

genspider

settings

runspider

shell

fetch

view

version

项目命令有：

crawl

check

list

edit

parse

bench

startproject

这个命令没什么过多的用法，就是在创建爬虫项目的时候用

genspider

用于生成爬虫，这里scrapy提供给我们不同的几种模板生成spider,默认用的是basic,我们可以通过命令查看所有的模板

localhost:test1 zhaofan$ scrapy genspider -lAvailable templates: basic crawl csvfeed xmlfeedlocalhost:test1 zhaofan$

当我们创建的时候可以指定模板，不指定默认用的basic,如果想要指定模板则通过

scrapy genspider -t 模板名字

localhost:test1 zhaofan$ scrapy genspider -t crawl zhihuspider zhihu.comCreated spider 'zhihuspider' using template 'crawl' in module: test1.spiders.zhihuspiderlocalhost:test1 zhaofan$

crawl

这个是用去启动spider爬虫格式为：

scrapy crawl 爬虫名字

这里需要注意这里的爬虫名字和通过scrapy genspider 生成爬虫的名字是一致的

check

用于检查代码是否有错误，scrapy check

list

scrapy list列出所有可用的爬虫

fetch

scrapy fetch url地址

该命令会通过scrapy downloader 讲网页的源代码下载下来并显示出来

这里有一些参数：

--nolog 不打印日志

--headers 打印响应头信息

--no-redirect 不做跳转

view

scrapy view url地址

该命令会讲网页document内容下载下来，并且在浏览器显示出来

学会Python这几个类库使用，快速写爬虫不是问题（详细步骤附源码）

因为现在很多网站的数据都是通过ajax请求来加载的，这个时候直接通过requests请求是无法获取我们想要的数据，所以这个view命令可以帮助我们很好的判断

shell

这是一个命令行交互模式

通过scrapy shell url地址进入交互模式

这里我么可以通过css选择器以及xpath选择器获取我们想要的内容（xpath以及css选择的用法会在下个文章中详细说明）,例如我们通过scrapy shell http://www.baidu.com

学会Python这几个类库使用，快速写爬虫不是问题（详细步骤附源码）

这里最后给我们返回一个response,这里的response就和我们通requests请求网页获取的数据是相同的。

view(response)会直接在浏览器显示结果

response.text 获取网页的文本

下图是css选择器的一个简单用法

学会Python这几个类库使用，快速写爬虫不是问题（详细步骤附源码）

settings

获取当前的配置信息

通过scrapy settings -h可以获取这个命令的所有帮助信息

localhost:jobboleSpider zhaofan$ scrapy settings -hUsage===== scrapy settings [options]Get settings valuesOptions=======--help, -h show this help message and exit--get=SETTING print raw setting value--getbool=SETTING print setting value, interpreted as a boolean--getint=SETTING print setting value, interpreted as an integer--getfloat=SETTING print setting value, interpreted as a float--getlist=SETTING print setting value, interpreted as a listGlobal Options----------------logfile=FILE log file. if omitted stderr will be used--loglevel=LEVEL, -L LEVEL log level (default: DEBUG)--nolog disable logging completely--profile=FILE write python cProfile stats to FILE--pidfile=FILE write process ID to FILE--set=NAME=VALUE, -s NAME=VALUE set/override setting (may be repeated)--pdb enable pdb on failure

拿一个例子进行简单的演示：(这里是我的这个项目的settings配置文件中配置了数据库的相关信息，可以通过这种方式获取，如果没有获取的则为None)

localhost:jobboleSpider zhaofan$ scrapy settings --get=MYSQL_HOST192.168.1.18localhost:jobboleSpider zhaofan$

runspider

这个和通过crawl启动爬虫不同，这里是scrapy runspider 爬虫文件名称

所有的爬虫文件都是在项目目录下的spiders文件夹中

version

查看版本信息，并查看依赖库的信息

localhost:~ zhaofan$ scrapy versionScrapy 1.3.2localhost:~ zhaofan$ scrapy version -vScrapy : 1.3.2lxml : 3.7.3.0libxml2 : 2.9.4cssselect : 1.0.1parsel : 1.1.0w3lib : 1.17.0Twisted : 17.1.0Python : 3.5.2 (v3.5.2:4def2a2901a5, Jun 26 2016, 10:47:25) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]pyOpenSSL : 16.2.0 (OpenSSL 1.0.2k 26 Jan 2017)Platform : Darwin-16.6.0-x86_64-i386-64bit

Scrapy提取数据有自己的一套机制，被称作选择器（selectors）,通过特定的Xpath或者CSS表达式来选择HTML文件的某个部分

Xpath是专门在XML文件中选择节点的语言，也可以用在HTML上。

CSS是一门将HTML文档样式化语言，选择器由它定义，并与特定的HTML元素的样式相关联。

XPath选择器

常用的路径表达式，这里列举了一些常用的，XPath的功能非常强大，内含超过100个的内建函数。

下面为常用的方法

nodeName 选取此节点的所有节点/ 从根节点选取// 从匹配选择的当前节点选择文档中的节点，不考虑它们的位置. 选择当前节点.. 选取当前节点的父节点@ 选取属性* 匹配任何元素节点@* 匹配任何属性节点Node() 匹配任何类型的节点

CSS选择器

CSS层叠样式表，语法由两个主要部分组成：选择器，一条或多条声明

Selector {declaration1;declaration2;……}

下面为常用的使用方法

.class .color 选择class=”color”的所有元素#id #info 选择id=”info”的所有元素* * 选择所有元素element p 选择所有的p元素element,element div,p 选择所有div元素和所有p元素element element div p 选择div标签内部的所有p元素[attribute] [target] 选择带有targe属性的所有元素[arrtibute=value] [target=_blank] 选择target=”_blank”的所有元素

选择器的使用例子

上面我们列举了两种选择器的常用方法，下面通过scrapy帮助文档提供的一个地址来做演示

地址：http://doc./en/latest/_static/selectors-sample1.html

这个地址的网页源码为：

<html>

<head>

<base href='http:///' />

<title>Example website</title>

</head>

<body>

<div id='images'>

<a href='image1.html'>Name: My image 1 <br /><img src='image1_thumb.jpg' /></a>

<a href='image2.html'>Name: My image 2 <br /><img src='image2_thumb.jpg' /></a>

<a href='image3.html'>Name: My image 3 <br /><img src='image3_thumb.jpg' /></a>

<a href='image4.html'>Name: My image 4 <br /><img src='image4_thumb.jpg' /></a>

<a href='image5.html'>Name: My image 5 <br /><img src='image5_thumb.jpg' /></a>

</div>

</body>

</html>

我们通过scrapy shell http://doc./en/latest/_static/selectors-sample1.html来演示两种选择器的功能

获取title

这里的extract_first()就可以获取title标签的文本内容,因为我们第一个通过xpath返回的结果是一个列表，所以我们通过extract()之后返回的也是一个列表，而extract_first()可以直接返回第一个值，extract_first()有一个参数default,例如：extract_first(default="")表示如果匹配不到返回一个空

In [1]: response.xpath('//title/text()')Out[1]: [<Selector xpath='//title/text()' data='Example website'>]In [2]: response.xpath('//title/text()').extract_first()Out[2]: 'Example website'In [6]: response.xpath('//title/text()').extract()Out[6]: ['Example website']

同样的我们也可以通过css选择器获取，例子如下：

In [7]: response.css('title::text')Out[7]: [<Selector xpath='descendant-or-self::title/text()' data='Example website'>]In [8]: response.css('title::text').extract_first()Out[8]: 'Example website'

查找图片信息

这里通过xpath和css结合使用获取图片的src地址：

In [13]: response.xpath('//div[@id="images"]').css('img')Out[13]: [<Selector xpath='descendant-or-self::img' data='<img src="image1_thumb.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image2_thumb.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image3_thumb.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image4_thumb.jpg">'>, <Selector xpath='descendant-or-self::img' data='<img src="image5_thumb.jpg">'>]In [14]: response.xpath('//div[@id="images"]').css('img::attr(src)').extract()Out[14]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']

查找a标签信息

这里分别通过xapth和css选择器获取a标签的href内容，以及文本信息，css获取属性信息是通过attr,xpath是通过@属性名

In [15]: response.xpath('//a/@href')Out[15]: [<Selector xpath='//a/@href' data='image1.html'>, <Selector xpath='//a/@href' data='image2.html'>, <Selector xpath='//a/@href' data='image3.html'>, <Selector xpath='//a/@href' data='image4.html'>, <Selector xpath='//a/@href' data='image5.html'>]In [16]: response.xpath('//a/@href').extract()Out[16]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']In [17]: response.css('a::attr(href)')Out[17]: [<Selector xpath='descendant-or-self::a/@href' data='image1.html'>, <Selector xpath='descendant-or-self::a/@href' data='image2.html'>, <Selector xpath='descendant-or-self::a/@href' data='image3.html'>, <Selector xpath='descendant-or-self::a/@href' data='image4.html'>, <Selector xpath='descendant-or-self::a/@href' data='image5.html'>]In [18]: response.css('a::attr(href)').extract()Out[18]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']In [27]: response.css('a::text').extract()Out[27]: ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']In [28]: response.xpath('//a/text()').extract()Out[28]: ['Name: My image 1 ', 'Name: My image 2 ', 'Name: My image 3 ', 'Name: My image 4 ', 'Name: My image 5 ']In [29]:

高级用法

查找属性名称包含img的所有的超链接，通过contains实现

In [36]: response.xpath('//a[contains(@href,"image")]/@href').extract()Out[36]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']In [37]: response.css('a[href*=image]::attr(href)').extract()Out[37]: ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html']In [38]:

查找img的src属性

In [41]: response.xpath('//a[contains(@href,"image")]/img/@src').extract()Out[41]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']In [42]: response.css('a[href*=image] img::attr(src)').extract()Out[42]: ['image1_thumb.jpg', 'image2_thumb.jpg', 'image3_thumb.jpg', 'image4_thumb.jpg', 'image5_thumb.jpg']In [43]:

提取a标签的文本中name后面的内容，这里提供了正则的方法re和re_first

In [43]: response.css('a::text').re('Name:(.*)')Out[43]: [' My image 1 ', ' My image 2 ', ' My image 3 ', ' My image 4 ', ' My image 5 ']In [44]: response.css('a::text').re_first('Name:(.*)')Out[44]: ' My image 1 '

以上是全部内容，只是善于分享，不足之处请包涵！爬虫基本的原理就是，获取源码，进而获取网页内容。一般来说，只要你给一个入口，通过分析，可以找到无限个其他相关的你需要的资源，进而进行爬取。

我也写了很多其他的非常简单的入门级的爬虫详细教程，关注后，点击我的头像，就可以查看到。

欢迎大家一起留言讨论和交流，谢谢！