[原创]全文搜索引擎Lucene学习笔记(页 1) - 『编程设计』 - 青韶论坛湘...

chanvy 2008-12-12

展开全文

[原创]全文搜索引擎Lucene学习笔记

Source: http://www./bbs/viewthread.php?tid=810119

在apache下载了一个最新的Lucene2.0.0的版本，开始了Lucene的学习历程：

首先搭建好运行环境，JDK、TOMCAT及下载的Lucene（在Lucene说明书上说要下载ant和JavaCC，ant用于构建Lucene，但是下载的Lucene包是已经构建好了的，而JavaCC是可选的）

然后，测试Lucene提供的Demo：
按说明书上的，将Lucene包:lucene-core-2.0.0.jar和lucene-demo-2.0.0.jar加入classpath。为了方便起见，我用Eclipse代替了上面的工作。新建工程，将这两个包导入。直接运行demo中的IndexHTML。个人理解这个是用于建立HTML文件索引的，而包目录下还有一个IndexFiles，估计是用于建立普通文件索引的。
Usage: IndexHTML [-create] [-index <index>] <root_directory>
-index <index>是目录索引存放文件夹，root_directory是欲建立索引的文件目录。
这里我直接将Lucene的API html文件做为root_directory，再在创建一个index目录，用于存放索引。
运行IndexHTML，成功的话可以看到index目录下面将生成三个文件：
segments
deletable
_?.cfs

建立好索引文件后就可以应用查询了~~
直接用Lucene自带的JSP应用的话，将Luceneweb.war放入tomcat\webapps目录下，重启tomcat后，设置configuration.jsp里的indexLocation参数为指定的上面的index目录。

这里Lucene里自带的那个JSP应用有错误，估计是apache更新过Lucene后忘记同时更新下Demo了。在results.jsp里有一行Query query = QueryParser.parse(...)这一行运行时会出错，parse方法已经过时。改正的办法是建立一个QueryParser实例，再调用其parse方法：
QueryParser qp = new QueryParser("contents", analyzer);
query = qp.parse(queryString);

然后就可以在浏览器下运行此web应用了~~

也可以用应用程序的方式来检验：
import org.apache.lucene.analysis.SimpleAnalyzer;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.*;

public class Search {
public static void main(String[] args) throws Exception {
String indexPath = args[0], queryString = args[1]; //指向索引目录的搜索器
Searcher searcher = new IndexSearcher(indexPath); //查询解析器：使用和索引同样的语言分析器
QueryParser qp = new QueryParser("contents",new SimpleAnalyzer());
Query query = qp.parse(queryString); //搜索结果使用Hits存储
Hits hits = searcher.search(query); //通过hits可以访问到相应字段的数据和查询的匹配度
System.out.println(hits.length());
for (int i=0; i<hits.length(); i++) {
System.out.println(hits.doc(i).get("path") + "; Score: " + hits.score(i));
};
}
}

对搜索引擎、Lucene及Lucene API的一些理解：
搜索是在已经建立好的索引的基础上进行的。由于数据库索引不适合全文索引（花费巨大且效果差），因此产生了Lucene等全文搜索引擎。如果有时候要对数据库比如果存放于数据库里的贴子内容进行全文检索的话，可以先由数据库建立索引文件，再由搜索引擎在此索引文件的基础上建立自己的索引，再进行检索。
Lucene：
要利用Lucene作为搜索引擎，先要建立索引文件。具体细节有待查看IndexHTML的实现方法。在索引文件的基础上进行检索的话，先建立一个指向索引目录的搜索器Searcher，然后建立查询解析器QueryParse，在参数里设定查询范围和分析器Analyzer。利用QueryParse的parse方法得到一个Query实例。将此实例做为参数传入Searcher的search()方法，该方法将返回一个结果集Hits。之后的操作就是遍历此结果集了
要注意的是Hits对象有一个score()方法，该方法返回的是该条结果符合检索条件的权重。可以对结果集的权重进行排序以得到最好的结果。

大饼先生 2006-9-8 14:50

在Lucene里应用中文检索

说起来很简单，Lucene2.0.0直接就提供了中文检索的功能
引入Lucene的扩展包analyzersn，里面有个ChineseAnalyzer和CJKAnalyzer是直接中文切词的~
在建立索引的时候，用IndexWriter writer = new IndexWriter(INDEX_DIR, new ChineseAnalyzer(), true)就可以建立基于中文检索的Lucene索引
而欲检索则中文索引，只需将查询解析器QueryParse的构造方法的analyzer参数设为ChineseAnalyzer，同时检索条件相应的转化成为“GBK”就行了~

大饼先生 2006-9-9 12:42

在HTML页面里处理中文，返回的编码是ISO8859-1格式的！
所以查询时要转化编码：
queryString = new String(request.getParameter("query").getBytes("iso8859_1"),"GB2312");
同时将查询解析器QueryParse的构造方法的analyzer参数设为ChineseAnalyzer

大饼先生 2006-9-10 15:51

设置关键字高亮的方法

Lucene里包含一个highlight包，用于高亮关键字等功能，具体用法：
Highlighter highlighter = new Highlighter(new SimpleHTMLFormatter("<font color=red><B>","</B></font>"), new QueryScorer(query));
................

String text = doc.get("contents");
TokenStream tokenStream = analyzer.tokenStream(queryCondition, new StringReader(text));
String result = highlighter.getBestFragments(tokenStream, text, 3, "...."); //设置最符合查询结果的片段

输出result就可以得到：
最符合查询结果的一个结果中的三个片段，中间用“....”分隔，并且输入的查询条件queryCondition在片段中将会被设置成高亮！

大饼先生 2007-3-21 14:28

一个简单应用，在jdk1.5,Lucene2.0版本下通过，正常运行。
一共3个文件
Constants.java用于存放常量
LuceneIndex.java用于建立索引
LuceneSearch.java用于搜索

package testlucene;

public class Constants {
//要索引的文件的存放路径
public final static String INDEX_FILE_PATH = "c:\\test";

//索引的存放位置
public final static String INDEX_STORE_PATH = "c:\\index";
}

package testlucene;
import java.io.*;
import java.util.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;

public class LuceneIndex {
private IndexWriter writer = null;

public LuceneIndex(){
try {
writer = new IndexWriter(Constants.INDEX_STORE_PATH,
new StandardAnalyzer(),true);
}catch(Exception e){
e.printStackTrace();
}
}

private Document getDocument(File f) throws Exception{
Document doc = new Document();
FileInputStream is = new FileInputStream(f);
Reader reader = new BufferedReader(new InputStreamReader(is));
doc.add(new Field("contents",reader));
doc.add(new Field("path",f.getAbsolutePath(),Field.Store.YES,Field.Index.TOKENIZED));
return doc;
}

public void writeToIndex() throws Exception{
File folder = new File(Constants.INDEX_FILE_PATH);
if(folder.isDirectory()){
String[] files = folder.list();
for(int i=0; i<files.length; i++){
File file = new File(folder,files[i]);
Document doc = getDocument(file);
System.out.println("正在建立索引： " + file + " ");
writer.addDocument(doc);
}
}
}

public void close()throws Exception{
writer.close();
}

public static void main(String[] args)throws Exception{
LuceneIndex indexer = new LuceneIndex();
Date start = new Date();
indexer.writeToIndex();
Date end = new Date();
System.out.println("建立索引用时 " + (end.getTime() - start.getTime()) + "毫秒");
indexer.close();
}
}

package testlucene;
import java.util.*;
import org.apache.lucene.analysis.standard.*;
import org.apache.lucene.index.*;
import org.apache.lucene.document.*;
import org.apache.lucene.queryParser.*;
import org.apache.lucene.search.*;

public class LuceneSearch {
private IndexSearcher searcher = null;
private Query query = null;

public LuceneSearch(){
try{
searcher = new IndexSearcher(IndexReader.open(Constants.INDEX_STORE_PATH));
}catch(Exception e){
e.printStackTrace();
}
}

public final Hits Search(String keyword){
System.out.println("正在检索关键字 " + keyword);
try{
query = new QueryParser("contents", new StandardAnalyzer()).parse(keyword);
Date start = new Date();
Hits hits = searcher.search(query);
Date end = new Date();
System.out.println("检索完成，用时" + (end.getTime() - start.getTime()) + "毫秒");
return hits;
}catch(Exception e){
e.printStackTrace();
return null;
}
}

public void printResult(Hits h){
if(h.length() == 0){
System.out.println("对不起，没有找到您要的结果。");
}
else{
for(int i = 0; i<h.length(); i++){
try{
Document doc = h.doc(i);
System.out.print("这是第" + i + "个检索到的结果，文件名为：");
System.out.println(doc.get("path"));
}catch(Exception e ){
e.printStackTrace();
}
}
}
System.out.println("---------------------------");
}

public static void main(String[] args) throws Exception{
LuceneSearch test = new LuceneSearch();
Hits h = null;
h = test.Search("测试");
test.printResult(h);

h = test.Search("搜索");
test.printResult(h);

h = test.Search("引擎");
test.printResult(h);
}
}