nutch项目配置

funson 2007-05-31

展开全文

nutch项目配置

http://lucene./nutch/tutorial8.html有如下的介绍：

Requirements

Java 1.4.x, either from Sun or IBM on Linux is preferred. Set NUTCH_JAVA_HOME to the root of your JVM installation.
Apache‘s Tomcat 4.x.
On Win32, cygwin, for shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)
Up to a gigabyte of free disk space, a high-speed connection, and an hour or so.

所以需要的准备工作如下：

1、下载nutch，使用最新的0.9版本，放在D:\nutch\nutch-0.9下；

2、在环境变量中设置NUTCH_JAVA_HOME为jdk的安装路径；

3、安装tomcat服务器，不作介绍；

4、因为是在windows环境下，所以需要下载安装cygwin来运行shell command。

准备工作完毕。

Getting Started

First, you need to get a copy of the Nutch code. You can download a release from http://lucene./nutch/release/. Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion and build it with Ant.

Try the following command:

bin/nutch

This will display the documentation for the Nutch command script.

这部分工作有如下几步：

1、运行cygwin

安装完成cygwin后运行，执行命令：

cd d:nutch

cd nutch-0.9

cygwin所示的当前目录为：

/cygdrive/d/nutch/nutch-0.9

在此目录下执行命令：bin/nutch，如果正确的话，会有Usage:nutch COMMAND提示

Intranet: Configuration

To configure things for intranet crawling you must:

Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain:
```
http://lucene./nutch/
                
```
Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:
```
+^http://([a-z0-9]*\.)*/
                
```
This will include any url in the domain .

Edit the file conf/nutch-site.xml, insert at minimum following properties into it and edit in proper values for the properties:

                
                http.agent.name
                
                HTTP ‘User-Agent‘ request header. MUST NOT be empty -
                please set this to a single word uniquely related to your organization.
                NOTE: You should also check other related properties:
                http.robots.agents
                http.agent.description
                http.agent.url
                http.agent.email
                http.agent.version
                and set their values appropriately.
                
                
                http.agent.description
                
                Further description of our bot- this text is used in
                the User-Agent header.  It appears in parenthesis after the agent name.
                
                
                http.agent.url
                
                A URL to advertise in the User-Agent header.  This will
                appear in parenthesis after the agent name. Custom dictates that this
                should be a URL of a page explaining the purpose and behavior of this
                crawler.
                
                
                http.agent.email
                
                An email address to advertise in the HTTP ‘From‘ request
                header and User-Agent header. A good practice is to mangle this
                address (e.g. ‘info at example dot com‘) to avoid spamming.

对于第一条应在d:\nutch\nutch-0.9下建文件夹urls，在此文件夹下建文本文件nutch.txt，其中的内容为：http://lucene./nutch/

对于第二条，打开conf/crawl-urlfilter.txt ，找到MY.DOMAIN.NAME ，修改为：

+^http://([a-z0-9]*\.)*/

对于第三条，此次实验使用nutch-default.xml，修改如下属性：

http.agent.name

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

例如：

http.agent.name
NutchCVS
Our HTTP ‘User-Agent‘ request header.

http.robots.agents
*
The agent strings we‘ll look for in robots.txt files,
comma-separated, in decreasing order of precedence.

http.agent.description
Nutch
Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.

http.agent.url
http://lucene./nutch/
A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name.

http.agent.email
nutch-agent@lucene.
An email address to advertise in the HTTP ‘From‘ request
header and User-Agent header.

http.agent.version
Nutch-0.9
A version string to advertise in the User-Agent
header.

修改完成后保存。

Intranet: Running the Crawl

Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:

-dir dir names the directory to put the crawl in.
-threads threads determines the number of threads that will fetch in parallel.
-depth depth indicates the link depth from the root page that should be crawled.
-topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50

Typically one starts testing one‘s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

Once crawling has completed, one can skip to the Searching section below.

此处只需运行如下命令即可：

bin/nutch crawl urls -dir crawled-depth 3 -topN 50 >&crawl.log

运行完成后，会生成crawled文件夹和crawl.log日志文件。

在日志文件中会发现抛pdf文件错误，那是因为默认情况下不支持对pdf文件的索引，要想对pdf文件也进行正确的索上，找到nutch-default.xml中的plugin.includes属性，添加上pdf，即为parse-(text|html|js|pdf)。

crawled中包含有segment, linkdb, indexed, index, crawldb文件夹。

到此为止，索引数据准备完毕。

下面是如何在tomcat中运行。

将nutch-0.9.war拷到tomcat的webapps目录下，并改名为nutch.war；

进入conf\Catalina\localhost目录下，创建文件nutch.xml，内容如下：

启运tomcat；

进入解压后的webapps\nutch\WEB-INF\classes目录，将nutch-default.xml的search.dir设置为D:\nutch\nutch-0.9\crawled；