First, you need to get a copy of the Nutch code. You can download a release from http://lucene./nutch/release/. Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion and build it with Ant.
Try the following command:
bin/nutch
This will display the documentation for the Nutch command script.
这部分工作有如下几步:
1、运行cygwin
安装完成cygwin后运行,执行命令:
cd d:nutch
cd nutch-0.9
cygwin所示的当前目录为:
/cygdrive/d/nutch/nutch-0.9
在此目录下执行命令:bin/nutch,如果正确的话,会有Usage:nutch COMMAND提示
Intranet: Configuration
To configure things for intranet crawling you must:
- Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain:
http://lucene./nutch/
- Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:
+^http://([a-z0-9]*\.)*/
This will include any url in the domain .
- Edit the file conf/nutch-site.xml, insert at minimum following properties into it and edit in proper values for the properties:
http.agent.name
HTTP ‘User-Agent‘ request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
http.agent.description
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
http.agent.url
A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
http.agent.email
An email address to advertise in the HTTP ‘From‘ request
header and User-Agent header. A good practice is to mangle this
address (e.g. ‘info at example dot com‘) to avoid spamming.
对于第一条应在d:\nutch\nutch-0.9下建文件夹urls,在此文件夹下建文本文件nutch.txt,其中的内容为:http://lucene./nutch/
对于第二条,打开conf/crawl-urlfilter.txt ,找到MY.DOMAIN.NAME ,修改为:
+^http://([a-z0-9]*\.)*/
对于第三条,此次实验使用nutch-default.xml, 修改如下属性:
http.agent.name
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
例如:
http.agent.name
NutchCVS
Our HTTP ‘User-Agent‘ request header.
http.robots.agents
*
The agent strings we‘ll look for in robots.txt files,
comma-separated, in decreasing order of precedence.
http.agent.description
Nutch
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
http.agent.url
http://lucene./nutch/
A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name.
http.agent.email
nutch-agent@lucene.
An email address to advertise in the HTTP ‘From‘ request
header and User-Agent header.
http.agent.version
Nutch-0.9
A version string to advertise in the User-Agent
header.
修改完成后保存。
Intranet: Running the Crawl
Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:
- -dir dir names the directory to put the crawl in.
- -threads threads determines the number of threads that will fetch in parallel.
- -depth depth indicates the link depth from the root page that should be crawled.
- -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
For example, a typical call might be:
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
Typically one starts testing one‘s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.
Once crawling has completed, one can skip to the Searching section below.
此处只需运行如下命令即可:
bin/nutch crawl urls -dir crawled-depth 3 -topN 50 >&crawl.log
运行完成后,会生成crawled文件夹和crawl.log日志文件。
在日志文件中会发现抛pdf文件错误,那是因为默认情况下不支持对pdf文件的索引,要想对pdf文件也进行正确的索上,找到nutch-default.xml中的plugin.includes属性,添加上pdf,即为parse-(text|html|js|pdf)。
crawled中包含有segment, linkdb, indexed, index, crawldb文件夹。
到此为止,索引数据准备完毕。
下面是如何在tomcat中运行。
将nutch-0.9.war拷到tomcat的webapps目录下,并改名为nutch.war;
进入conf\Catalina\localhost目录下,创建文件nutch.xml,内容如下:
启运tomcat;
进入解压后的webapps\nutch\WEB-INF\classes目录,将nutch-default.xml的search.dir设置为D:\nutch\nutch-0.9\crawled;
打开浏览器,运行http://localhost:8080/;
现就可以进行搜索了,输入apache,就可以查询得到相关的结果。