分享

nutch项目配置

 funson 2007-05-31

http://lucene./nutch/tutorial8.html有如下的介绍:

Requirements

  1. Java 1.4.x, either from Sun or IBM on Linux is preferred. Set NUTCH_JAVA_HOME to the root of your JVM installation.
  2. Apache‘s Tomcat 4.x.
  3. On Win32, cygwin, for shell support. (If you plan to use Subversion on Win32, be sure to select the subversion package when you install, in the "Devel" category.)
  4. Up to a gigabyte of free disk space, a high-speed connection, and an hour or so.

所以需要的准备工作如下:

1、下载nutch,使用最新的0.9版本,放在D:\nutch\nutch-0.9下;

2、在环境变量中设置NUTCH_JAVA_HOME为jdk的安装路径;

3、安装tomcat服务器,不作介绍;

4、因为是在windows环境下,所以需要下载安装cygwin来运行shell command。

准备工作完毕。

Getting Started

First, you need to get a copy of the Nutch code. You can download a release from http://lucene./nutch/release/. Unpack the release and connect to its top-level directory. Or, check out the latest source code from subversion and build it with Ant.

Try the following command:

bin/nutch

This will display the documentation for the Nutch command script.

这部分工作有如下几步:

1、运行cygwin

安装完成cygwin后运行,执行命令:

cd d:nutch

cd nutch-0.9

cygwin所示的当前目录为:

/cygdrive/d/nutch/nutch-0.9

在此目录下执行命令:bin/nutch,如果正确的话,会有Usage:nutch COMMAND提示

Intranet: Configuration

To configure things for intranet crawling you must:

  1. Create a directory with a flat file of root urls. For example, to crawl the nutch site you might start with a file named urls/nutch containing the url of just the Nutch home page. All other Nutch pages should be reachable from this page. The urls/nutch file would thus contain:
    http://lucene./nutch/
                    
  2. Edit the file conf/crawl-urlfilter.txt and replace MY.DOMAIN.NAME with the name of the domain you wish to crawl. For example, if you wished to limit the crawl to the domain, the line should read:
    +^http://([a-z0-9]*\.)*/
                    
    This will include any url in the domain .
  3. Edit the file conf/nutch-site.xml, insert at minimum following properties into it and edit in proper values for the properties:
                    
                    http.agent.name
                    
                    HTTP ‘User-Agent‘ request header. MUST NOT be empty -
                    please set this to a single word uniquely related to your organization.
                    NOTE: You should also check other related properties:
                    http.robots.agents
                    http.agent.description
                    http.agent.url
                    http.agent.email
                    http.agent.version
                    and set their values appropriately.
                    
                    
                    http.agent.description
                    
                    Further description of our bot- this text is used in
                    the User-Agent header.  It appears in parenthesis after the agent name.
                    
                    
                    http.agent.url
                    
                    A URL to advertise in the User-Agent header.  This will
                    appear in parenthesis after the agent name. Custom dictates that this
                    should be a URL of a page explaining the purpose and behavior of this
                    crawler.
                    
                    
                    http.agent.email
                    
                    An email address to advertise in the HTTP ‘From‘ request
                    header and User-Agent header. A good practice is to mangle this
                    address (e.g. ‘info at example dot com‘) to avoid spamming.
                    
                    

对于第一条应在d:\nutch\nutch-0.9下建文件夹urls,在此文件夹下建文本文件nutch.txt,其中的内容为:http://lucene./nutch/

对于第二条,打开conf/crawl-urlfilter.txt ,找到MY.DOMAIN.NAME ,修改为:

+^http://([a-z0-9]*\.)*/
            

对于第三条,此次实验使用nutch-default.xml, 修改如下属性:

http.agent.name

http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version

例如:


  http.agent.name
  NutchCVS
  Our HTTP ‘User-Agent‘ request header.


  http.robots.agents
  *
  The agent strings we‘ll look for in robots.txt files,
  comma-separated, in decreasing order of precedence.


  http.agent.description
  Nutch
  Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
 


  http.agent.url
  http://lucene./nutch/
  A URL to advertise in the User-Agent header.  This will
   appear in parenthesis after the agent name.
 


  http.agent.email
  nutch-agent@lucene.
  An email address to advertise in the HTTP ‘From‘ request
   header and User-Agent header.


  http.agent.version
  Nutch-0.9
  A version string to advertise in the User-Agent
   header.

修改完成后保存。

Intranet: Running the Crawl

Once things are configured, running the crawl is easy. Just use the crawl command. Its options include:

  • -dir dir names the directory to put the crawl in.
  • -threads threads determines the number of threads that will fetch in parallel.
  • -depth depth indicates the link depth from the root page that should be crawled.
  • -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.

For example, a typical call might be:

bin/nutch crawl urls -dir crawl -depth 3 -topN 50
            

Typically one starts testing one‘s configuration by crawling at shallow depths, sharply limiting the number of pages fetched at each level (-topN), and watching the output to check that desired pages are fetched and undesirable pages are not. Once one is confident of the configuration, then an appropriate depth for a full crawl is around 10. The number of pages per level (-topN) for a full crawl can be from tens of thousands to millions, depending on your resources.

Once crawling has completed, one can skip to the Searching section below.

此处只需运行如下命令即可:

bin/nutch crawl urls -dir crawled-depth 3 -topN 50 >&crawl.log

运行完成后,会生成crawled文件夹和crawl.log日志文件。

在日志文件中会发现抛pdf文件错误,那是因为默认情况下不支持对pdf文件的索引,要想对pdf文件也进行正确的索上,找到nutch-default.xml中的plugin.includes属性,添加上pdf,即为parse-(text|html|js|pdf)。

crawled中包含有segment, linkdb, indexed, index, crawldb文件夹。

到此为止,索引数据准备完毕。

下面是如何在tomcat中运行。

将nutch-0.9.war拷到tomcat的webapps目录下,并改名为nutch.war;

进入conf\Catalina\localhost目录下,创建文件nutch.xml,内容如下:

启运tomcat;

进入解压后的webapps\nutch\WEB-INF\classes目录,将nutch-default.xml的search.dir设置为D:\nutch\nutch-0.9\crawled;

打开浏览器,运行http://localhost:8080/

现就可以进行搜索了,输入apache,就可以查询得到相关的结果。

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多