分享

Google 翻译

 HiLinz 2010-12-22

PHP开发搜索引擎的全分析

Recommended topics: PHP + MySQL Video Tutorial推荐主题:PHP + MySQL的视频教程

Turning to Web search engines, many people will think of Yahoo.谈到网络搜索引擎,很多人都会想到雅虎。 Indeed, Yahoo created an Internet search of the times.事实上,雅虎创造了一个时代的互联网搜索。 However, Yahoo's technology currently used to search the web is not the company originally developed its own.但是,雅虎的技术目前用于搜索网页是不是公司最初开发它自己的。 August 2000, Yahoo adopted the Google The students created by the Stanford University's technology. 2000年8月,雅虎采用了谷歌由斯坦福大学的技术创建的学生。 The reason is very simple, Google's search engine than Yahoo's previous use of the technology faster, more accurate search for the information they need.原因很简单,谷歌的比雅虎的技术更快,更准确的信息,他们需要使用搜索引擎搜索之前。

Allow us to to design, develop a strong, efficient search engine and database within a short time I am afraid of technology, capital and other areas is unlikely, however, since Yahoo are using someone else's technology, we can not use other people ready search engine website?让我们来设计,开发在短期内,我的技术,资金等方面害怕一个强大,有效的搜索引擎和数据库是不可能的,但是,因为雅虎是用别人的技术,我们不能使用其他人随时查阅引擎网站?

Analysis of programming ideas分析编程思想

We can do it: simulating a query to a search engine site search form issued by the appropriate command, and then returns search results, the results of HTML code analysis, stripping the extra characters and code, and press the required format In our own web site page in.我们可以做到这一点:模拟一个查询,以搜索引擎网站搜索通过适当的形式发出的命令,然后返回搜索结果的HTML代码分析结果,剥离多余的字符和代码,然后按自己的网站在我们需要的格式网站页面英寸

Thus, the key issue is that we have to choose a search for information to be accurate (so that we will be more meaningful search ah), fast (because we analyze the search results and show the need for additional time), the search results simple (easy HTML source code of conduct and stripping) of the search site, As the new search engine, Google's 各种 excellent properties, where our choice, as an example, how to Kankan Yong Shi Xian PHP background right Google search, front personalized Xianshizheyi Guocheng .因此,关键的问题是,我们必须选择一个搜索信息准确(这样我们会更有意义的搜索啊),快速 (因为我们分析搜索结果并显示需要额外的时间),搜索结果简单(简单的HTML源代码的行为和剥离)的搜索网站,由于新的搜索引擎,谷歌的各种,优良的性 能在我们的选择,作为一个例子,如何康康Yong施西鞍PHP的背景下的物权谷歌搜索,个性化的前面Xianshizheyi国成。

Let's take a look at Google's query order form.让我们来看看谷歌的查询订单看看。 Into the Google site, in the query box enter "abcd", click the Query button, we can find the browser's address bar into: "http://www.google.com/search?q=abcd&btnG=Google% CB% D1% CB% F7 & hl = zh-CN & lr = ", can be seen, Google is the way to get through the form and submit the query pass query parameter command.进 入谷歌网站在查询框中,输入“abcd的”,点击查询按钮,我们可以发现浏览器的地址栏为:“会CB%D1的http://www.google.com /search?q=abcd&btnG=Google%会CB%F7键&hl =%zh - cn的和LR的=“,可以看出,谷歌就是这样度过的形式和提交查询传递查询参数的命令。 We can use PHP in the file () function to simulate the query process.我们可以在文件中使用的PHP()函数来模拟查询过程。

Read File () function阅读文件()函数

Syntax: array file (string filename);语法:阵列文件(字符串文件名);

Return value is an array, the papers are all read into the array variable.返回值是一个数组,文件全部读入数组变量。 Here's documents can be local, it can be remote, the remote file must specify the protocol used.这里的文件可以是本地的,也可以是远程,远程文件必须指定使用的协议。 For example: result = file ("http://www.google.com/search?q=a ... mp; hl = zh-CN & lr ="), the statement simulation we query Google, the word "abcd "process, and conduct the search results to each element back to the result of an array variable.例如:结果=文件(“http://www.google.com/search?q=a ...国会议员; hl = zh - cn的和LR的=”),该声明模拟我们查询谷歌,守信用“的ABCD “的过程,并进行了搜索结果的每个元素返回一个数组变量的结果。 Read the file because there is a remote, so the protocol name "http://" can not do without.阅读文件,因为存在一个远程的,所以协议名的“http://”不能没有。

If you want users to input search characters in any search, we can make an input text box and submit button, and will be above the search characters "abcd" with variable substitution:如果你希望用户在任何搜索输入搜索字符,我们可以输入文本框和提交按钮,将上面的字符“abcd”包含与变量替换搜索:

echo '回声'

'; File: / / no parameter form, the default mode for the submission get, to submit to itself ';文件:/ /没有参数的形式,得到默认的提交方式,提交给自己

echo '; file: / / construct a text input box回声';文件:/ /构造一个文本输入框

echo '; file: / / construct a submit query button回声';文件:/ /构造一个提交查询按钮

echo '回声'

'; ';

if (isset (keywords)) file: / / submitted, PHP will generate variable kwywords, that is required to submit the following program is run after如果(使用isset(关键字))文件:/ /提交的,PHP将产生变数kwywords,即须提交下列程序运行后

(

urlencode (keywords); file: / / URL on the user input content codes用urlencode(关键字),文件:/ /网址对用户输入内容的代码

result = file ("http://www.google.com/search?q =". keywords. "& btnG = Google% CB% D1% CB% F7 & hl = zh-CN & lr =" ;);导致=档(“http://www.google.com/search?q =”关键词“&btnG =谷歌%会CB%D1的%会CB%中七&hl =的ZH - Cn中与LR的=”;。。);

file: / / on the query to variable substitution, the query results stored in array variables result in文件:/ /变量替换在去查询,查询结果存储在数组变量导致

result_string = join ("", result); file: / / $ result will be combined into a string array, the array elements with spaces between the stick and result_string =加入(“”,结果);文件:/ / $结果将合并为一个字符串数组,坚持和空间之间的数组元素

... ... File: / / further processing文件:/ /进一步处理

)

?> ?>

Above this process has to be queried by user input, and returns the results into one string variable $ result_string.超过这个过程应该由用户输入的查询,并返回到一个字符串变量$ result_string的结果。 Please note that to use urlencode () function, the user input for URL encoding, can normally the input of Chinese characters, spaces and other special characters in a query, this is realistic as possible to simulate Google's query command to ensure that search results correctness.请注意,要使用进行urlencode()函数,用于URL编码的用户输入,可以正常的汉字,空格和查询中的其他特殊字符的输入,这是尽可能逼真模拟谷歌的查询命令,保证搜索结果的正确性。

Analysis on Google谷歌分析

For ease of understanding, assume that we really need is: the title of the search results.为了便于理解,假设我们真正需要的是:在搜索结果的标题。 Website and profile, etc. This is a simple and typical needs.网站和个人资料等,这是一个简单而典型的需求。 Thus, we have to do is: Remove the Google search results platform head and footnotes, including a Google logo, re-search input box and search results descriptions and search results in the remaining strip of the entries in the original HTML formatting tags, replace the format we want.因此,我们必须做的是:谷歌搜索结果中删除平台头和脚注,包括谷歌徽标,重新搜索输入框和搜索结果描述,搜索在原来的HTML格式标记的条目其余条结果,请更换我们想要的格式。

To do this, we must carefully analyze the Google search results HTML source, find the one rule.要做到这一点,我们必须仔细分析谷歌搜索结果的HTML源码,找到一个规则。 Not difficult to find in Google's search results included in the source text is always the first不难发现,在谷歌的搜索结果中包含的源文本总是先

Tags and penultimate标记和倒数第二

Tags, and the penultimate标记和倒数第二

Immediately after the table tag characters, and this combination "表后立即标记字符,而这个组合“

Follow all of the following procedures are followed in the above procedure "further processing" Department.遵循下列程序都在上述过程中“进一步处理”处其次。

result_string = strstr (result_string, ""); result_string = strstr(result_string,“”);

file: / / take result_string from the first string after the start, head to remove Google Taiwan文件:/ /从开始后的第一个字符串,头result_string删除谷歌台

position = strpos (result_string, "table symbol position位置= strpos(result_string,“表中的符号位置

result_string = substr (result_string, 0, position); / / interception before the first symbol string table, in order to remove the footnote result_string = SUBSTR中(result_string,0,位置); / /截取字符串前的第一个符号表,以去除脚注

Application and Implementation应用与实现

Now that we have been useful HTML source code trunk, and the remaining question is how to display the content independently.现在我们是有用的HTML源代码主干,而剩下的问题是如何显示的内容独立。 We then analyze the search results entries found that between each entry is also useful to use the law of separation, that is, into a paragraph each, according to this feature we use explode () function to cut each entry:然后,我们分析搜索结果条目发现每个条目之间也是很有用的分离法,即,每到一个段落,根据这一功能,我们使用爆炸()函数将每个条目:

Syntax: explode (string separator, string string);语法:爆炸(字符串分隔符,字符串字符串);

Returns an array, separator cut by a small string in all, after being stored in the array.返回一个数组,字符串分隔在一个小后全部被存储在数组中,切。

Then:然后:

result_array = explode ("", result_string); file: / / use the string "" the results of cut result_array =爆炸(“”,result_string);文件:/ /使用字符串“”切的结果

We get an array result_array, where each element is a search results entry.我们得到一个数组result_array,其中每个元素是一个搜索结果条目。 We have to do is to study the display format of each entry and its HTML code, and then replaced on the line as required.我们要做的是研究每个条目的显示格式和它的HTML代码,然后根据需要上线所取代。 The following loop to handle result_array use of each entry.下面的循环处理result_array每个条目使用。

for (i = 0; i (对(我= 0;我(

... ... File: / / handle each entry文件:/ /处理每个条目

)

For each entry, we can easily find some of the features: each item by title, abstract, profile, category, URLs, each of parts for OK which includes the tag, then Zaici segmentation: (the following handler Fangzai The cycle above)对于每个项目,我们可以很容易找到的一些特点:每个项目的标题,摘要,简介,类别,网址,为确定每个部分包括标记,然后Zaici分割:(以下处理程序坊子的周期以上)

every_item = explode ("", result_array [i]); every_item =爆炸(“”,result_array [我]);

So we get an array every_item, which every_item [0] is the title, every_item [1] and every_item [2] the two acts of summary, every_item [3] and every_item [4] and so on in the head if it contains "Description:" " Category:因此,我们得到一个数组every_item,这every_item [0]是标题,every_item [4]等的头部,如果它包含[1]和every_item [2]总结的两种行为,every_item [3]和every_item“说明:“” 分类: "character, it is us or category (because there is not the result of entry), if the header contains" "certainly is at the matter, we often use to judge the contrast between the regular expression (abbreviated), is also very convenient if you want to replace, for example contains the title of $ every_item [0], there is a link of its own, and we want to modify this link property, it opens the link in a new window: “字样,这是我们或类别(因为没有结果的条目),如果标题包含”“当然问题是在,我们经常使用的表达来判断经常对比(略),也很方便如果你要替换,例如包含]标题$ every_item [0,有一个自己的链接的,我们希望修改这个链接属性,它会打开一个新窗口的链接:

echo eregi_replace ('( 回声eregi_replace('(

... ... File: / / handle to remove the first item of each entry (the first title, has shown) in each 文件:/ /处理删除项目(每个条目的第一个冠军,已经显示出)在每个第一

... ... File: / / More format changes 文件:/ /更多格式的变化

)

This changes the link attribute, and the remaining number of display format changes, stripping, replace all with regular replacement eregi_replace () to complete. 这改变了链接属性,格式变化剩余数量显示,剥离,替换完成所有与定期更换eregi_replace()来。

So far we have received each of each search entry, and can arbitrarily modify each format, and even give him wear a beautiful form. 到目前为止,我们已收到各每个搜索条目,并可以任意修改每个格式,甚至给他戴上美丽的外形。 But a good program should be able to adapt to a variety of operating environments, there is no exception, and we actually discussed just stripping HTML search results, a framework approach, really want to be perfect, but also take into account many elements, such as to show how many total search results, divided into how many pages, etc., can even those excluded and Google-related "category", "Introduction" and other code, so that customers can not see the original site. 但 是,一个好的方案应该能够适应各种环境中经营,没有例外,而且我们确实讨论了刚刚剥HTML搜索结果,一个框架的方法,真的要很完美,但也考虑到许多因 素,例如以显示有多少总搜索结果,分为多少页等等,甚至可以排除和谷歌这些相关的“种类”,“简介”等代码,让客户看不到原来的网站。 However, these contents and requirements we can obtain by analyzing the HTML for dissection. 然而,这些内容和要求,我们可以通过分析解剖得到的HTML。 Now you can completely own hands, to be a highly personalized search engine. 现在您可以完全自己动手,是一个高度个性化的搜索引擎

    本站是提供个人知识管理的网络存储空间,所有内容均由用户发布,不代表本站观点。请注意甄别内容中的联系方式、诱导购买等信息,谨防诈骗。如发现有害或侵权内容,请点击一键举报。
    转藏 分享 献花(0

    0条评论

    发表

    请遵守用户 评论公约

    类似文章 更多