Nutch version 0.8 安装向导

漂在北方的狼 2006-11-05

展开全文

Nutch version 0.8 安装向导 Nutch version 0.8 安装向导
1、必要的条件
1.1 Java 1.4或1.4以上版本。操作系统推荐用Linux（Sun或IBM的都可以）。记得在环境变量中设置变量NUTCH_JAVA_HOME=你的虚拟机地址，例如，本人将jdk1.5安装在c:\jdk1.5文件夹下，所以本人的设置为NUTCH_JAVA_HOME=c:\jdk1.5（此为win32 环境下的设置方法）。
1.2 服务器端推荐使用Apache’s Tomcat 4.x或该版本以上的Tomcat。
1.3 当要在win32安装Nutch时，请安装cygwin软件，以提供Linux的shell支持。
1.4 安装Nutch需要消耗Ｇ字节的磁盘空间，高速的连接并要花费一个小时左右的时间等等。
2、从这开始
2.1 首先，你必须获得Nutch源码的一个拷贝。你可以从网址：http://lucene./nutch/release/　上下载Nutch的发行版，解开下载的文件包即可。或者通subversion获得最新的源码并且通过Ant工具创建Nutch。
2.2 上述步骤完成以后，你可以通过下面这个命令，试试是否安装成功。
在Nutch所在的目录下，输入  bin/nutch
如果显示了一个有关Nutch命令脚本的文档，那么恭喜你，你已经向成功迈出了重要的一步。
2.3 现在，我们可以准备为我们的搜索引挚去“爬行（crawl）”资料。爬行（crawl）有两种方法：
2.3.1 用crwal命令实现内部网的爬行
2.3.2 整个web网的爬行，除了上面的crwal命令外，我们需要用得一些更为底层的命令以实现更为强大的功能，如inject, generate, fetch以及updatedb等。
3、内部网爬行（测试未通过）
内部网爬行适合用于具有百万级别的web网站。
3.1 内部网：配置
要配置内部网爬行，你必需做如下几项工作：
3.1.1 在nutch所在的文件夹下建立一个包含纯文本文件的根文件夹urls。例如，为了爬行nutch站点，你可以建立一个nutch文本文件，该文件中仅仅包含nutch的主页。所有有关Nutch的其它页面你将从这个页面搜索得到。这样你在urls/nutch文件中将包含如下的内容：
http://lucene./nutch/
3.1.2 接着你要去编辑nutch文件夹下的conf/crawl-urlfilter.txt文件，将该文件中MY.DOMAIN.NAME替换成你要去爬行的域。例如，如果你想把爬行限制在域，你就可用替换上述文件中的MY.DOMAIN.NAME。替换后如下：
+^http://([a-z0-9]*\.)*/
上述语句的意思包含在域中的任何url。
3.2 内部网：运行crawl
一旦配置好后，运行crawl是一件简单的事情。只要使用crawl命令。这个命令包含如下这些先项：
-dir  dir指定将爬行到信息要存储的目录
-threads threads决定了要同时运行的线程数
-depth depth指出从根页面往下要爬行的深度
-topN topN决定了在每一级的深度上要搜索的最大页面数
例如，一个典型的命令如下：
bin/nutch crawl urls -dir crawl -depth 3 -topN 50
一旦命令执行结束，就可以跳到后面的搜索部分（见5）。
4、全网爬行
全网爬行设计去处理非常大量的爬行，它可能要花几个星期的时间才能完成，并起需要多台电脑来运行它。
4.1 下载 http://rdf./rdf/content.rdf.u8.gz 然后解压解压命令为： gunzip content.rdf.u8.gz
4.2 创建目录 mkdir dmoz
4.3每搜索5000条URL记录选择一个存进urls文件: bin/nutch  org.apache.nutch.tools. DmozParser content.rdf.u8 -subset 5000 > dmoz/urls
4.4 初始化crawldb: bin/nutch inject crawl/crawldb dmoz
4.5 从crawldb生成fetchlist: bin/nutch generate crawl/crawldb crawl/segments
4.6 fetchlist放置在重新创造的段目录，段目录根据指定的时间创建，我们保存这段变量s1:
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1 显示结果如：crawl/segments/2006******* /*号部分表示是月日时的数字，如20060703150028
4.7 运行这段: bin/nutch fetch $s1
4.8 完成后更新数据结果: bin/nutch updatedb crawl/crawldb $s1
4.9现在数据库的参考页设在最初，接着来取得新的1000页:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch updatedb crawl/crawldb $s2
4.10 让我们取得周围的更多:
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
bin/nutch fetch $s3
bin/nutch updatedb crawl/crawldb $s3
4.11 创建索引:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
4.12 使用索引命令: bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb crawl/segments/*
5、搜索
5.1 删除root目录: rm -rf ~/local/tomcat/webapps/ROOT* //.war包在webapps下会自动解压
5.2 拷贝文件: cp nutch*.war ~/local/tomcat/webapps/ROOT.war
5.3修改tomcat/webapps/root/WEB-INF/classes下的nutch-site.xml文件如下：
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
<property>
   <name>searcher.dir</name>
   <value>/home/crawl/nutch-0.8-dev/crawl</value> //索引的目录
</property>
</configuration>

ps:上面说的少了一步
3.1.2
Edit the file conf/nutch-site.xml, insert at minimum following properties into it and edit in proper values for the properties:

<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP ‘User-Agent‘ request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

    http.robots.agents
    http.agent.description
    http.agent.url
    http.agent.email
    http.agent.version

and set their values appropriately.

</description>
</property>

<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>

<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
</description>
</property>

<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP ‘From‘ request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. ‘info at example dot com‘) to avoid spamming.
</description>
</property>
这样才行。不然都是nullpointerexception.
爬不到东西。