author:http://hi.baidu.com/jrckkyy
author:http://blog.csdn.net/jrckkyy
不好意思让大家久等了,前一阵一直在忙考试,终于结束了。呵呵!废话不多说了下面我们开始吧!
TSE用的是将抓取回来的网页文档全部装入一个大文档,让后对这一个大文档内的数据整体统一的建索引,其中包含了几个步骤。
- 1. The document index (Doc.idx) keeps information about each document.
-
- It is a fixed width ISAM (Index sequential access mode) index, orderd by docID.
-
- The information stored in each entry includes a pointer into the repository,
-
- a document length, a document checksum.
-
-
-
-
-
- 0 0 bc9ce846d7987c4534f53d423380ba70
-
- 1 76760 4f47a3cad91f7d35f4bb6b2a638420e5
-
- 2 141624 d019433008538f65329ae8e39b86026c
-
- 3 142350 5705b8f58110f9ad61b1321c52605795
-
-
-
-
-
- The url index (url.idx) is used to convert URLs into docIDs.
-
-
-
-
-
- 5c36868a9c5117eadbda747cbdb0725f 0
-
- 3272e136dd90263ee306a835c6c70d77 1
-
- 6b8601bb3bb9ab80f868d549b5c5a5f3 2
-
- 3f9eba99fa788954b5ff7f35a5db6e1f 3
-
-
-
-
-
- It is a list of URL checksums with their corresponding docIDs and is sorted by
-
- checksum. In order to find the docID of a particular URL, the URL's checksum
-
- is computed and a binary search is performed on the checksums file to find its
-
- docID.
-
-
-
- ./DocIndex
-
- got Doc.idx, Url.idx, DocId2Url.idx
-
-
-
-
-
- 0 http:
-
- 1 http:
-
- 2 http:
-
- 3 http:
-
-
-
-
-
- 2. sort Url.idx|uniq > Url.idx.sort_uniq
-
-
-
-
-
-
-
- 000bfdfd8b2dedd926b58ba00d40986b 1111
-
- 000c7e34b653b5135a2361c6818e48dc 1831
-
- 0019d12f438eec910a06a606f570fde8 366
-
- 0033f7c005ec776f67f496cd8bc4ae0d 2103
-
-
-
- 3. Segment document to terms, (with finding document according to the url)
-
- ./DocSegment Tianwang.raw.2559638448
-
- got Tianwang.raw.2559638448.seg
-
-
-
-
-
- version: 1.0
-
- url: http:
-
- origin: http:
-
- date: Fri, 23 May 2008 20:01:36 GMT
-
- ip: 162.105.138.175
-
- length: 38413
-
-
-
- HTTP/1.1 200 OK
-
- Server: Microsoft-IIS/5.0
-
- Date: Fri, 23 May 2008 11:17:49 GMT
-
- Connection: keep-alive
-
- Connection: Keep-Alive
-
- Content-Length: 38088
-
- Content-Type: text/html; Charset=gb2312
-
- Expires: Fri, 23 May 2008 11:17:49 GMT
-
- Set-Cookie: ASPSESSIONIDSSTRDCAB=IMEOMBIAIPDFCKPAEDJFHOIH; path=/
-
- Cache-control: private
-
-
-
-
-
-
-
- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
-
- "http://www./TR/html4/loose.dtd">
-
- <html>
-
- <head>
-
- <title>Apabi数字资源平台</title>
-
- <meta http-equiv="Content-Type" content="text/html; charset=gb2312">
-
- <META NAME="ROBOTS" CONTENT="INDEX,NOFOLLOW">
-
- <META NAME="DESCRIPTION" CONTENT="数字图书馆 方正数字图书馆 电子图书 电子书 ebook e书 Apabi 数字资源平台">
-
- <link rel="stylesheet" type="text/css" href="css/common.css">
-
-
-
- <style type="text/css">
-
- <!--
-
- .style4 {color: #666666}
-
- -->
-
- </style>
-
-
-
- <script LANGUAGE="vbscript">
-
- ...
-
- </script>
-
-
-
- <Script Language="javascript">
-
- ...
-
- </Script>
-
- </head>
-
- <body leftmargin="0" topmargin="0">
-
- </body>
-
- </html>
-
-
-
-
-
-
-
- 1
-
- ...
-
- ...
-
- ...
-
- 2
-
- ...
-
- ...
-
- ...
-
-
-
-
-
-
-
- 4. Create forward index (docic-->termid)
-
- ./CrtForwardIdx Tianwang.raw.2559638448.seg > moon.fidx
-
-
-
-