现在正对搜索引擎有兴趣,
我的搜索基于nutch,并结合了ICTCLAS,分词和速度都不错。 这样做可以不用crywin来模拟linux 下面是在win nt下调用nutch的脚本, 你可以自己改一下,这样就可以很方便的实现自动运行了。 有兴趣的朋友可以用一下,大大方便了操作。 nutch.bat @cmd /V:on /c %~dp0nutch1.bat %* nutch1.bat @echo on rem ********************************************************************* rem * A script to launch nutch on Windows 2000/XP System.
rem * rem * Written by babatu rem * babatu@gmail.com blog: blog.babatu.com rem * rem * Because delayed environment is used, cmd /V:on should be used to rem * run this script. rem ****************************** if "%OS%"=="Windows_NT" @setlocal if "%OS%"=="WINNT" @setlocal if "%1" == "" goto :msg goto :begin :msg echo "Usage: nutch COMMAND" echo "where COMMAND is one of:" echo " crawl one-step crawler for intranets" echo " readdb read / dump crawl db" echo " readlinkdb read / dump link db" echo " inject inject new urls into the database" echo " generate generate new segments to fetch" echo " fetch fetch a segment‘s pages" echo " parse parse a segment‘s pages" echo " segread read / dump segment data" echo " updatedb update crawl db from segments after fetching" echo " invertlinks create a linkdb from parsed segments" echo " index run the indexer on parsed segments and linkdb" echo " merge merge several segment indexes" echo " dedup remove duplicates from a set of segment indexes" echo " plugin load a plugin and run one of its classes main()" echo " server run a search server" echo " or" echo " CLASSNAME run the class named CLASSNAME" echo "Most commands print help when invoked w/o parameters." pause goto :end :begin rem %~dp0 这个脚本的扩展path ( expanded pathname of the current script under NT) set DEFAULT_NUTCH_HOME=%~dp0.. rem set DEFAULT_NUTCH_HOME=.. if "%NUTCH_HOME%"=="" set NUTCH_HOME=%DEFAULT_NUTCH_HOME set DEFAULT_NUTCH_HOME="" rem 设置默认DEFAULT_NUTCH_HOME echo %NUTCH_HOME% rem set _USE_CLASSPATH=yes if "%CLASSPATH%"=="" ( set CLASSPATH=%JAVA_HOME%\lib CLASSPATH=%CLASSPATH%;%JAVA set CLASSPATH=%CLASSPATH%;%NUTCH echo %CLASSPATH% echo before other rem for developers, add plugins, job & test code to CLASSPATH if exist %NUTCH_HOME%\build\plugins set CLASSPATH=%CLASSPATH%;%NUTCH for /R %NUTCH_HOME%\build %%i in (nutch*.job) do set CLASSPATH=!CLASSPATH!;%%i if exist %NUTCH_HOME%\build\test CLASSPATH=%CLASSPATH%;%NUTCH rem for releases, add Nutch job to CLASSPATH for /R %NUTCH_HOME% %%i in (nutch*.job) do set CLASSPATH=!CLASSPATH!;%%i rem add plugins to classpath if exist %NUTCH_HOME%\plugins set CLASSPATH=%CLASSPATH%;%NUTCH rem add libs to CLASSPATH for /R %NUTCH_HOME%\lib %%f in (*.jar) do set CLASSPATH=!CLASSPATH!;%%f echo %CLASSPATH% rem translate command if "%1"=="crawl" set CLASS=org.apache.nutch.crawl.Crawl if "%1"=="inject" set CLASS=org.apache.nutch.crawl.Injecto if "%1"=="generate" set CLASS=org.apache.nutch.crawl.Generat if "%1"=="fetch" set CLASS=org.apache.nutch.fetcher if "%1"=="parse" set CLASS=org.apache.nutch.parse.ParseSe if "%1"=="readdb" set CLASS=org.apache.nutch.crawl.CrawlDb if "%1"=="readlinkdb" set CLASS=org.apache.nutch.crawl.LinkDbR if "%1"=="segread" set CLASS=org.apache.nutch.segment if "%1"=="updatedb" set CLASS=org.apache.nutch.crawl.CrawlDb if "%1"=="invertlinks" set CLASS=org.apache.nutch.crawl.LinkDb if "%1"=="index" set CLASS=org.apache.nutch.indexer if "%1"=="dedup" set CLASS=org.apache.nutch.indexer if "%1"=="merge" set CLASS=org.apache.nutch.indexer if "%1"=="plugin" set CLASS=org.apache.nutch.plugin if "%1"=="server" set CLASS=‘ org.apache.nutch.searcher if "%CLASS%"=="" set CLASS=%1 %JAVA_HOME%\bin\java -cp %CLASSPATH% %CLASS% %* if "%OS%"=="Windows_NT" @endlocal if "%OS%"=="WINNT" @endlocal :end
搜索不是目的,能够实现搜索那是非常容易的事情,随便下载一个ASP或者其它的搜索源码就可以轻松实现,有四个关键,1。不要说亿了,数据达到1000万时,你的搜索速度是多少 由于NUTCH是apache的一个开源项目,所以它的性能是不错的。 |
|