CN101055587A - Search engine retrieving result reordering method based on user behavior information - Google Patents

Search engine retrieving result reordering method based on user behavior information Download PDF

Info

Publication number
CN101055587A
CN101055587A CN 200710099594 CN200710099594A CN101055587A CN 101055587 A CN101055587 A CN 101055587A CN 200710099594 CN200710099594 CN 200710099594 CN 200710099594 A CN200710099594 A CN 200710099594A CN 101055587 A CN101055587 A CN 101055587A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
user
query
search engine
information
page
Prior art date
Application number
CN 200710099594
Other languages
Chinese (zh)
Other versions
CN100507920C (en )
Inventor
岑荣伟
刘奕群
张敏
金奕江
马少平
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention belongs to internet massage processing field, characterized in that: common inquiry set which is cared by user, according to one ore more search engine logs, is screened out firstly using the inquiry of corresponding information of user number; then calculating user touching ratio corresponding to user touching page of inquiry in common inquiry set, if multi-search engine logs is used, user touching ratio are united; user touching pages are effectively screened out based on user touching ratio, the ralated inquiry and corresponding result page address are stored related database; at last, the result from user information and the result by searching by search engine are united effectively when user puts in the demand of inquiry, and are reordered to return to user. The method has the advantage that computer is completed automatically and efficiently submits search engine performance in real time and objectively.

Description

一种基于用户行为信息的搜索引擎检索结果重排序方法 Based on the user behavior information search engine search results reordering method

技术领域 FIELD

本发明属于互联网信息处理领域,特别是涉及信息检索系统,具体说是利用群体用户行为信息,对搜索引擎检索结果进行重排序,提高检索排序性能的处理方法。 The present invention belongs to the field of processing Internet information, particularly to an information retrieval system, particularly the use of group user behavior information, the search engine to reorder the search results, retrieval processing method to improve the sorting performance.

背景技术 Background technique

搜索引擎是以一定的策略收集互联网上的信息,对信息进行组织和处理后为用户提供网络信息服务的计算机系统,它包括计算机网络、计算机硬件系统以及在硬件系统上运行的软件程序三个部分。 Search engines are the information on the Internet to collect a certain strategy, to provide users with network information services and organize the information after processing computer system, which consists of three parts of a computer network, computer hardware and software systems that run on hardware systems . 它的主要作用是帮助用户快捷、高效的获取存在于互联网信息环境中的能够满足用户需求的高质量信息。 Its main role is to help users quickly and efficiently to meet the user needs access to high-quality information exists on the Internet information environment.

目前,通用网络搜索引擎主要包含信息收集、信息处理和用户查询服务三部分。 Currently, the general Web search engines mainly contain information collection, information processing and user queries of three parts. 搜索引擎通过被称为网络蜘蛛(Spider)的工具根据站点或页面的URL信息和网页之间的链接关系进行信息收集,用链接信息分析器、文本分析器以及索引器对抓取的页面信息进行整理,进而通过查询服务器负责与用户交交互,根据用户提交的查询关键词进行检索,并返回相关结果列表,提供相关信息以满足用户的查询需求。 Search engines collect information based on the relationship between the link URL or web site or page through the tool is known as web spider (Spider), the text parser and indexer to crawl pages of information carried by the link information analyzer finishing, and then query the server is responsible for interaction with the user pay, user submits a search query based on keywords and returns a list of relevant results, provide relevant information to meet the needs of the user's query.

从使用者的角度看,搜索引擎提供一个包含搜索框的页面,用户在搜索框输入能反映自己需求的查询关键词,通过浏览器提交给搜索引擎后,搜索引擎返回和用户输入的内容相关的搜索结果信息列表,用户进行点击相关结果页面,查找所需要的信息。 From the user's point of view, the search engine provides a page that contains a search box, a user can enter in the search box to reflect their needs keyword query submitted to the search engine through a browser, the search engine returns and related user input list search results, users click on the results page to find the information they need.

查询服务器中的一个关键技术是对相关的文档进行有效排序,使得用户想要的页面排在返回结果的前列位置,相关信息更容易被访问到。 A key technical query server is effectively the relevant documents to sort, makes the page the user wants to return the results came in the forefront, the relevant information more easily accessible. 从20世纪60年代中期以来,人们提出了大量的文本相似度模型。 Since the mid-1960s, people made a lot of text similarity model. 90年代后,随着Web页面的大规模出现,部分相似度模型应用到了网络信息搜索引擎上,其主要的思想就是“TF*IDF”。 After the 1990s, with the emergence of large-scale Web page, some similarity model applied to the web search engine, the main idea is "TF * IDF". 当前应用广泛的模型主要有布尔模型(Boolean Model),统计模型(Statistical Model)和语言知识模型(Linguistic andKnowledge-based Model)。 The current model is applied mainly in a wide range of Boolean models (Boolean Model), the statistical model (Statistical Model) and knowledge of the language model (Linguistic andKnowledge-based Model). 由于这类相似度模型基本上都是基于普通文本检索提出的,和真实网络环境上的Web文本信息有一定的区别。 Due to the similarity of these models are essentially based on common Web text proposed text retrieval, and real network environment information is quite different. 另外,各Web站点为提高搜索结果排名,利用相关Spam技术,如在各自的页面中添加各种关键词,欺骗搜索引擎,提升搜索结果排名,提高站点知名度。 In addition, each Web site to improve search results ranking, the use of Spam related technologies, such as adding a variety of keywords in each page, to deceive search engines to improve search results rankings, improve site visibility.

由于网络检索的特殊性,用户提交关键词进行搜索时,其目标页面并不一定完全是和查询关键词相关的内容,用户一般也难于用简单的几个关键词对其想要的目标页面进行合理描述,很多实际检索任务往往转化为查找相关的关键资源页面。 Due to the special nature of network search, user submits keyword search, the page is not necessarily the goal entirely and query keywords relevant content, users generally difficult to perform a few simple keywords to their desired destination page reasonable description, a lot of the actual retrieval tasks often converted to find the relevant key resources page. 关键资源页面通常是指一系列相关信息页面的入口页面,即用户能通过该页面很容易找到自己想要的信息,和传统的文本信息检索有很大的不同。 The key resources page usually refers to the entry page of a series of related information pages that users can easily find information about the page you want, and the traditional text information retrieval is very different.

90年代中后期,针对已有的网络环境和网络用户的需求特点,人们开始对Web页面的质量问题进行相关研究和考察,其中最主要的依据是网络页面上的超链接结构。 In the late 1990s, the demand for the characteristics of the existing network environment and network users, people began to question the quality of Web pages related research and study, the most important of which is based on the hyperlink structure of the web page. 超链接结构是网络信息环境与传统的信息媒介的最大区别之一,是指两个页面或页面的两个不同部分之间的一种指向关系,包含源页面和目标页面,基于链接结构提出的技术主要有Pagerank和Hits等算法。 Hyperlink structure is a network environment with one of the biggest difference between traditional information media, it refers to a relationship between two different points to two pages or portions of pages, the page that contains the source and target pages, based on the structure of the proposed link Hits and technology mainly Pagerank algorithms. Pagerank算法是Google公司的Brin等人根据因特网用户浏览模型建立的超链接分析算法,其主要利用超链接关系对不同页面进行一定的质量评级,用页面对应的评级结果对搜索引擎的搜索结果进行改进,把页面质量高且相关性好的页面排在结果前列返回给用户,能很大程度的提高了实际用户检索的满意度。 Pagerank algorithm is Google's Brin, who according hyperlink analysis algorithms Internet users browsing model, its main use hyperlinks relation to the different pages certain quality rating, improved search results of search engines with ratings page corresponding , the high quality and relevance of the page at the forefront of good results page returned to the user, it can greatly improve user satisfaction actual retrieval. Pagerank算法的基本架构和实现思路因此也在实际商用搜索引擎的应用中取得了巨大成功。 The basic architecture Pagerank algorithm and realization of ideas and therefore also the practical application of commercial search engine has achieved great success.

事实上,尽管各种新模型新技术在搜索引擎中被使用,但还是没法很好的满足检索用户对信息查询的满意度。 In fact, despite all the new model of the new technology being used in search engines, but it could not satisfy the user satisfaction of information retrieval queries. 为尽量提高搜索引擎的搜索性能,很多实际的搜索引擎站点使用一些人工挑选的手段的来提高部分查询词的检索性能,挑选那些用户经常使用且目标唯一的查询词(通常为导航类查询词,如“搜狐:www.sohu.com”,)。 In order to maximize search performance of search engines, many of the actual search engine site to use some means of hand-picked to improve retrieval performance of the query words, the selection of those who often use and target only the query terms (normally search term is a navigation class, such as "Sohu: www.sohu.com",). 当用户检索时,把人工挑选的相关检索目标页面融合在自动检索的页面中,通常如放在返回结果页面前几位。 When users search, retrieve the artificial selection of the relevant target page integration in pages automatically retrieved, usually as the result back to the previous page in a few. 但由于人工挑选带来的投入代价太大,而且很难对更大规模的查询进行人工挑选。 However, due to artificial selection brought into too costly and difficult for larger queries artificial selection.

在实际商用搜索引擎中,用户对查询返回结果会按照自己的理解和满意程度进行点击,很容易对用户的该点击行为进行记录,该记录通常也被称为搜索引擎日志。 In actual commercial search engine, users will return query results to click according to their own understanding and satisfaction, it is easy for the user clicks the record, the record is also commonly known as search engine logs. 查询记录点击信息不但体现了用户的查询兴趣,也包含了用户对查询结果判断和挑选,蕴含了大量的群体用户的知识和信息,从用户的查询点击信息中筛选相关的查询和结果页面是可行的。 Search record click information not only reflects the user's query interest, but also contains the user query results judgment and selection, contains a large number of groups, user knowledge and information, screening-related queries and results page from the user's query click information is feasible of. 已有统计研究表明,在日常的用户搜索中,查询最频繁的1%的查询词占了70%以上的查询次数,因此只要对用户点击信息进行统计,找出那些常用的用户查询,则能代表大多数用户查询需求,分析其相关用户点击行为信息,并融合到搜索引擎结果中去,则能自动利用群体用户的知识提高检索性能。 Statistical studies have shown that, in the daily user's search query 1% of the most frequent query terms accounted for over 70% of the number of queries, so long as the user clicks on statistical information, to identify those common user queries, you can on behalf of the needs of most users query, the user clicks to analyze the relevant information, and integrated into the search engine results go, it can automatically take advantage of the knowledge of the user groups to improve retrieval performance.

发明内容 SUMMARY

本发明针对现有各种检索排序算法存在的不足以及人工挑选所需要的大量劳动力和时效性的不足,提出了基于用户行为信息的搜索引擎检索结果重排序方法。 A lot of labor and lack of timeliness of the present invention is directed to retrieve various existing sorting algorithms existing problems and handpicked need, presented information based on user behavior of a search engine search results reordering method. 该方法利用已有的在一家或多家搜索引擎上的用户查询及其点击行为信息,对其进行宏观统计分析,对那些用户近期常用的查询挑选用户最关注的几个目标页面,融合到搜索引擎返回的原始结果之中对其结果页面改进,作为最终查询结果返回给用户。 The method uses existing user on one or a number of search engine queries and click behavior information, statistical analysis of its macro, several of those target page user selection of recent popular queries users are most concerned about, fused to search among the original engine returns the results to improve their results pages, returned to the user as the final results. 利用多家搜索引擎的用户查询点击行为信息能很好的避免因单个搜索引擎索引数据规模和检索排序策略给用户点击带来的偏向性和不足,但收集多家搜索引擎用户日志信息有一定的难度;利用近期的信息能很好的保持一定的时效性;利用用户点击的信息返回的结果页面有一定的用户认同感,把其作为结果页面的一部分,能提高结果的准确度和用户的满意度。 Many users use search engines to information on the behavior information can be good to avoid bias and lack of data due to a single search engine index size and retrieval ordering policy to the user clicks lead, but the number of search engine users log information collection has some difficulty; the use of recent information can be good to keep a certain timeliness; use the information returned results page a user clicks on a certain user identity, as part of the results page, can improve the accuracy of the results and customer satisfaction degree. 上述处理过程都是由计算机自动完成,因此,在提高性能的基础上大量的减少了人工辅助检索劳动,并能及时、准确、有效、客观的返回结果页面和满足用户需求。 The above-mentioned process is done automatically by a computer, therefore, on the basis of improving the performance of a large number of reducing manual labor assisted retrieval, and timely, accurate, valid and objective return results pages and meet customer needs.

该方法的具体内容如下:1.利用查询的用户数等信息,自动筛选出具有时效性,能包含大部分用户查询需求,且能够较准确标注的查询;2.根据用户行为信息,计算各查询对应的用户点击页面的点击率,若利用多搜索引擎的用户行为日志,则进行相关合并,得到合并后各查询的用户点击率;3.根据各查询对应页面的用户点击率,筛选出该查询的结果页面;4.当用户查询时,将标出的查询结果融合到搜索引擎返回的结果中去,并最终显示给用户。 Specific contents of the process are as follows: a number of users query using information automatically selected time sensitive, the user can query requirement contains most, and capable of more accurate labeling query; 2 based on the user behavior information calculated for each query. corresponding to the user clicks the click-through rate of the page, if the use of user behavior logs from various search engines, the related combined to give the user hits each query after the merger; 3. the user hits each query corresponding page, filter out the query the results page; 4. when the user query, the query results marked fused to the results returned by search engines go, and ultimately displayed to the user.

本发明的特征在于:它是在计算机上自动完成的,依次含有如下步骤:步骤1用户常用查询集筛选步骤1.1数据预处理用于利用用户行为信息的搜索引擎检索结果重排序所使用的查询集、查询对应的结果页面以及页面筛选过程中所使用的相关信息来源于一个或多个搜索引擎的用户日志,对于这些搜索引擎用户日志,它至少需要包含以下内容信息才能用于搜索引擎结果重排序的方法中:表1 供基于用户行为信息的搜索引擎检索结果重排序所使用的用户日志需包含的内容 Feature of the present invention is that: it is done automatically on a computer, comprising the following steps in sequence: Step 1 set of common user query query search engine search result set filtering step 1.1 reordering data preprocessing for utilizing user behavior information to be used related information corresponding to the query results pages and pages screening process used is derived from one or more user log search engine, search engine users for these logs, it needs to contain at least the following information can be used for search engine results reorder method: table 1 for content based on the user behavior information search engine search results user log reordering used must carry

一般搜索引擎服务提供商都可以很容易的通过搜索引擎网络服务器得到以上信息,从而保证了本方法的可行性。 General search engine service providers can easily get more information by search engine web server, thus ensuring the feasibility of the method. 由于各个搜索引擎对其用户日志存储格式和表现形式上有所不同,具体处理过程略有差别,但基本上都需要如下步骤对用户日志进行预处理:步骤1.1.1进行用户日志编码转换,将服务器记录的编码格式转换成国家标准汉字编码的GBK格式。 Because each search engine users vary their log storage format and form of expression, particularly slightly different process, but basically requires the user to log the steps of pre-processing steps of: 1.1.1 encoding user log conversion, format transcoding server records into a national standard Chinese character encoding GBK format.

步骤1.1.2利用表1中列出的内容项对用户日志进行整理,去除表1内容项之外的信息,并将日志整理成以上内容项字符串的形式。 Step 1.1.2 using the content items listed in Table 1 for user log sort, removed information other than content items in Table 1, and log organized into more than one form of content item character string.

步骤1.1.3利用字符串匹配技术(如KMP算法)过滤用户查询中的噪声信息,包括违禁查询词、某些在线商品推广使用的查询词等,仅保留直接反映搜索引擎普通用户查询需求与行为的内容项。 Step 1.1.3 using the string matching techniques (such as KMP algorithm) noise filter information in the user query, the query terms, including contraband, some query words to promote the use of online products and so on, retaining only a direct reflection of the search engine query ordinary user needs and behavior content items.

经过数据预处理过程,可以从中提取表1中列举的内容,并应用于方法的以下步骤。 After pre-processing of data, can extract the contents listed in Table 1 and used in the following step of the process.

步骤1.2查询用户数信息提取按以下规则统计各个查询的用户数信息:对于日志中用户提交的某查询Q,对提交过该查询Q的用户数进行计数得到。 Step 1.2 query information extracting statistical number of users each user query according to the following information on the number of rules: for a query submitted by the user log Q, submitted to the query Q is the number of users counted.

对于每一个提交的查询,其用户数都是一个大于等于1的值。 For each query submitted, the number of users is greater than a value equal to 1. 查询的用户数信息包含了人们对该查询的关注度。 User information query contains a number of people's attention to the query. 由于查询的用户数和日志记录的时间有很大关系,为保持一定的时效性,选择最近一个月或半个月的日志作为数据源。 Because of the time and the number of users logging query has a lot to maintain a certain timeliness, select the last month or two weeks of the log as a data source.

步骤1.3常用查询集筛选按照以下规则挑选常用查询集合S:若:某查询Q在搜索引擎日志中其查询用户数小于20,则排除在S之外;否则:把该查询Q放入常用查询集合S中。 Step 1.3 commonly used query set selected set of queries filter S according to the following rules: if: a search engine query Q in the query log which is less than the number of users 20, S is excluded; else: the common query into the query set Q S in.

用查询的用户数对查询主题进行一定的筛选,挑选那些用户数较多的查询是因为这些查询占据了大量的用户查询,用少量的用户查询词集合满足大部分用户的查询需求,并能保证所选择的查询反应当前用户的关注趋势和热点,保证时效性和关注度,有一定的代表性。 Screening for a certain number of user queries with a query subject, the selection of those larger number of user queries because these queries take a lot of user queries, with a small number of users a set of query words that satisfy the query needs of most users, and to ensure the selected queries reflect the current trends and hotspot users concerned, to ensure timeliness and attention, there is a certain representation. 另外,挑选那些用户数较多的查询,能提高步骤2中计算用户点击率信息的可靠性和稳定性,减少因个别用户的点击行为带来较大的波动。 In addition, the selection of those inquiries larger number of users, can improve the reliability and stability of the computing user click-through rate information in Step 2, reduce the click behavior of individual users to bring greater volatility.

步骤2用户点击率信息提取步骤2.1用户点击率信息提取对于查询集合S中的每个查询Q,都有一系列被点击的结果页面,通过表1提供的用户查询和点击信息,可以得到这一系列被点击页面的地址URL,并针对该查询计算每个页面URL的“用户点击率”,即页面被点击的概率。 Step 2 the user information extracting step 2.1 CTR CTR user information extracting query set S for each query Q, there is a range of results page is clicked by the user query and Table 1 provides the click information can be obtained in this series click on the URL address of the page, and for the calculation of each page URL query "user hits" that the probability of the page is clicked. 查询Q的某一结果页面URL的“用户点击率”计算公式是: Query results page URL Q of a "user hits" The formula is: 其中,“查询Q用户点击查询结果URL的次数”可以通过对查询Q中点击结果URL的用户行为计数得到,而“查询Q用户的点击总次数”可以通过对查询Q的所有用户点击行为计数得到。 Among them, "the number of queries Q user clicks on the search results URL" may count obtained by the user behavior of the query Q, the result of clicking the URL, and "Click total number of queries Q users" by all users click behavior query Q is counted .

按照其定义,由于“查询Q用户点击查询结果URL的次数”必然小于等于“查询Q用户的点击总次数”,因此“用户点击率”的取值范围在0和1之间。 By definition, since the "query Q user clicks the URL query results" necessarily less "user query Q total clicks", the "user hits" in the range between 0 and 1. 对查询Q,所有被点击的结果页面URL的“用户点击率”之和为1。 Query Q, all of the results page URL is clicked "user hits" sum to 1.

“用户点击率”描述对于查询Q,该结果页面URL的认可度,其值越大,表示更多用户对该页面URL和查询Q关系认可,可作为该查询和URL相关度的一个有效度量。 "User hits" description of the query Q, recognition of the results page URL, the greater its value is, more users to the page URL and query Q relationships recognized as the query and URL of a valid measure of relevance.

步骤2.2多搜索引擎用户点击率信息合并使用单搜索引擎日志信息在日志数据获取上相对比较容易,但其存在着检索数据有限和搜索引擎排序引导偏向的不足,如果能使用多搜索引擎的用户日志,其性能提高将更加理想。 Step 2.2 Multi search engine users click rate information is combined using a single search engine log information acquired log data is relatively easy, but there is limited search data and search engines sort guide insufficient bias, if using multiple search engines user log its performance increase will be even better.

用概率表达式P(URL|查询Q)表示合并后对查询Q结果页面URL的“用户点击率”信息,使用条件分布的全概率公式得到其计算公式如下: Represent | (URL query Q) after the merger of the query results page URL Q "user hits" full probability formula information, using the conditional distribution is obtained which is calculated by the following expression probability P: 由概率的相关概念可以知道P(URL|查询Q)的取值范围必然在0至1之间。 The concepts of probability can know P (URL | query Q) is inevitable in the range between 0-1. 其中,P(URL|SEi,查询Q)表示在搜索引擎日志SEi中,对于查询Q,点击结果页面URL的点击率,用(1)式计算该结果页面URL在搜索引擎SEi上的“用户点击率”得到。 Where, P (URL | SEi, query Q) represents the search engine logs SEi, for the query Q, click on the click-through rate results page URL, the calculation of the results page with (1) type URL on the search engine SEi "user clicks rate "get. P(SEi|查询Q)表示对于查询Q,搜索引擎用户日志SEi给出的支持度,用如下(3)式计算的“SEi查询可信度”得到: P (SEi | query Q) represents a query for Q, search engine users log SEi support given by the following equation (3) calculated by "SEi credibility inquiry" to give: 由(3)可知该“查询可信度”是一取值范围在0至1之间,且该查询的各搜索引擎“查询可信度”之和为1,该值是各搜索引擎日志关于该查询用户点击率的合并权重。 From (3) the apparent "Query confidence" of each search engine is a range between 0 and 1, and the query "Query credibility" sum to unity, which value is the log of each search engine the combined weight of the user query CTR weight.

合并后查询的“用户点击率”排除了单个搜索引擎日志上得到的“用户点击率”存在的不足,对于检索排序性能提高效果更好(见相关验证实验)。 After the merger inquiry "user hits" ruled out the shortcomings of the "user hits" to get on a single search engine logs, sort retrieval performance improvement for the better (see related validation experiments).

步骤3利用用户行为信息进行搜索引擎检索结果改进步骤3.1查询结果页面筛选对于某一查询Q对应的结果页面集,按以下两种方法之一确定相关结果页面:固定点击率和法:对于查询Q,其“用户点击率”最大的连续前M个页面即是其查询Q对应的从搜索引擎用户日志中得到的结果页面,其中M满足:从融合后“用户点击率”最大的页面开始,连续前M个页面的融合后“用户点击点击率”之和大于0.8,但连续前M-1个页面的融合后“用户点击率”之和小于0.8,且该M个页面的“用户点击率”都大于0.1。 Step 3 using the search engine user behavior information retrieval results improve screening step 3.1 results page results page set Q corresponding to a query, determine the relevant results page in one of two ways: click-through rate and fixed law: the query Q that "the user hits the" biggest consecutive first M page that is its query Q corresponding results page obtained from a search engine user logs in which M meet: starting from the fusion largest page "user hits", continuous after fusion the first M page "the user clicks the click-through rate" is greater than 0.8, but the fusion continuous front M-1 pages "user hits" and less than 0.8, and "user hits" the M pages greater than 0.1.

该方法确定的各个查询对应的结果数可能有所不同,对于用户点击率集中度高的查询,一般只返回极少个别的页面;对于用户点击率相对比较分散的查询,可能返回较多的页面。 The results of each query number corresponding to the method determines may be different, the user hits a high concentration of inquiry, usually only minimal returns of individual pages; user click-through rate is relatively dispersed query may return more pages .

固定结果页面数法:对于查询Q,其“用户点击率”最大的连续前n个页面是其对应的所需结果页面,其中n满足:该n个页面对应的“用户点击率”都大于0.1,且n为小于等于N的最大整数,N为一常数,通常为3。 Fixed number method results page: for the query Q, which is "the user hits the" maximum consecutive first n pages is the page corresponding to a desired result, where n satisfies: n pages corresponding to the "user hits" are greater than 0.1 and n is the largest integer less than or equal to N, N is a constant, typically 3.

该方法确定结果的页面个数,对所有的查询其结果页面数都不超过N个,且每一个查询结果都有一定的可信度。 This method determines the number of pages of the result, all the query result is not more than the N number of pages, and each query result has a certain reliability.

实际中,对于导航类查询,其结果页面比较唯一且确定,因此使用固定结果页面数法比较合适;对于信息类查询,其结果页面多样化,数目不确定,因此使用固定点击率和法比较合适。 In practice, the navigation for the query type, the comparison result page and uniquely determined, thus a fixed number of methods suitable results page; informational queries for which results page diversified, the number of uncertainty, thus using a more appropriate method and fixed CTR .

通过上述步骤确定了能够融合用户行为信息检索排序的查询,以及其对应的从日志中获得的结果页面,将其以一定形式存储起来(如数据库)。 Is determined by the above-described steps can be fused to query the user behavior information retrieval ranking, and the corresponding results obtained from the log page, it is stored (such as a database) in some form.

步骤3.2指定搜索引擎原始检索结果的获取当用户向指定搜索引擎提交相关查询关键词进行查询时,把该查询提交给指定的搜索引擎,搜索引擎将返回查询相关的排好序的结果页面集,并有计算得到的页面相关度信息。 Specify the original search results of search engines step to obtain 3.2 When a user submits a query related query keywords to specify the search engine, to specify the query submitted to the search engine, the search engine will return results related to the query page set sorted, there are pages and relevance of the information calculated.

若没有独立的搜索引擎,可抓取特定的搜索引擎结果页面(实验中使用了该方法),抓取方法如下:首先选用一种互联网网页抓取程序,如wget,FlashGet等,以便利用这个工具对相应的URL的网页进行抓取。 If no independent search engine crawls a particular search engine results pages (the method used in the experiment), the crawler follows: First, selection of an Internet web crawler, such as wget, FlashGet, etc., in order to use this tool URL of the web page corresponding to crawl. 其次根据查询Q的不同,利用模式替换的方式生成对应Q的搜索引擎结果页面的URL。 Second generation Q corresponding to the URL of the search engine results page depending on the use of alternative modes of query Q mode. 不同搜索引擎结果页面URL记录Q的方式不同。 Various different search engine results page URL record Q way. 但搜索引擎都需要在URL中记录Q以便向服务器传递Q的信息。 But search engines are to be recorded in the Q Q of the URL to pass information to the server. 如使用Baidu搜索引擎,其对应Q的结果页面URL是http://www.baidu.com/baidu? Such as the use Baidu search engine, which corresponds to Q results page URL is http://www.baidu.com/baidu? wd=Q。 wd = Q. 最后,调用网页抓取程序,自动抓取该URL对应的页面,并获取查询Q对应的查询结果页面。 Finally, invoking a web crawler, automatically crawl the URL corresponding page and get the query Q corresponding query results page.

步骤3.3基于用户行为信息的检索结果合并当用户提交查询Q进行查询时,将其分别提交给搜索引擎(步骤3.2)和从步骤3.1筛选得到的查询结果集数据库中,返回得到该查询的两个结果页面序列,分别命名为序列SEQ和序列LOG。 Step 3.3 When the search result information based on user behavior when a user submits a query merging the query Q, respectively submit it to the search engine (step 3.2) and 3.1 database query result set obtained from the screening step, the query returns to give two results page sequence, named sequence and the sequence SEQ LOG. 若从用户日志得到的查询集数据库中不包含该查询Q,则不进行如下处理,直接返回搜索引擎查询结果序列SE。 If the query Q does not contain the user log database from a set of queries obtained, the following processing is not performed, the search engine query results returned directly sequence SE. 否则,按如下方法融合上述两个序列的结果页面,并作为最终返回结果页面集返回给用户:首先,按“用户点击率”的大小依次取序列LOG中的每一个结果页面,放入最终返回结果页面集中,至取完为止;其次,按SEQ序列中已有的顺序,依次取其中的每一个结果页面,至取完为止,如果该页面已经出现在最终返回结果页面集中,则不再取该页面。 Otherwise, the fusion of the two sequences in the following manner results page, and returns the result returned to the user as a final page set: First, according to the size of the "User Hits' sequentially take each of a sequence of pages in the LOG result, the final return into results focus on the page, to present until completion; Secondly, according to the existing order of the sequence SEQ sequentially take each result page in which to take up completely, if the page is already present in the final concentrate return result page, it is no longer taken the page.

经过上述步骤,完成了将用户的点击行为信息融合到搜索引擎返回的结果中去,包含了大量用户的知识和信息,通过实验可以看到其能提高搜索引擎的检索排序性能。 After these steps, the user clicks the integration of information to the results returned by search engines go, including a large number of users of knowledge and information, experiments can be seen that it can improve the performance of retrieval ranking search engine.

为了验证本发明的有效性、可靠性和应用性,我们设计和测试了相关的验证实验。 In order to verify the validity, reliability and applicability of the present invention, we have designed and tested the related verification experiment.

从数据源上,我们使用了4个常用搜索引擎的用户查询日志。 From the data source, we used a user query log four popular search engines. 另外,选择了约320个用户查询,并使用结果池过滤技术(Pooling,由美国国家技术研究所NIST组织的文本检索会议TREC提出),对这些查询进行了人工答案集标注,作为测试答案。 In addition, approximately 320 selected user queries, and use the results of the pool filtration technology (Pooling, text retrieval conference proposed by the United States National Institute of Technology NIST organization TREC), to answer these queries manually set the mark as a test answer. Pooling池包括国内搜狗,百度,Google,中搜,雅虎,新浪等各大著名搜索引擎,每个搜索引擎返回前20个结果作为池中的备选答案。 Pooling pool including domestic Sogou, Baidu, Google, the search, Yahoo, Sina and other major well-known search engines, before each search engine returns 20 as a result of the pool of possible answers. 在验证实验中使用信息检索中常用的平均检索精度(MAP)进行性能的评价。 Performance was evaluated using a common average search accuracy (MAP) information retrieval in the verification experiment.

搜索引擎很容易记录其自身用户访问的情况,进而可以得到关于该单个搜索引擎的用户行为信息,进行相关检索结果重排序。 Search engines can easily record as its own user access, then you can get the user behavior on a single search engine information, relevant search results reorder. 利用获取的2007年3月1日至4月8日的搜索引擎用户日志,利用对搜狗搜索引擎结果页面抓取的方式验证本方法的性能。 The use of search engine users log 2007 March 1 to April 8 acquired by Sogou search engine results pages crawl way to verify the performance of the method. 表2列出了使用固定结果页面数法筛选查询结果页面时,使用不同N值融合后的性能提高情况,可以看到其基本上有5%以上的提高。 Table 2 shows the screening results of the query results page using a fixed number of pages method using different values ​​of N properties after fusion improve the situation, it can be seen that substantially more than 5% increase.

表2 融合单搜索引擎用户行为信息的检索性能提高情况 Table 2 fusion single search engine user behavior retrieve performance information to improve the situation

另外,利用多家搜索用户日志,对融合多搜索引擎用户行为信息查询排序进行验证。 In addition, a number of users use search logs of user behavior information query sort fusion of multi search engine for verification. 考察了其对6家常用搜索引擎结果重排序后性能改进情况,使用了固定结果页面数法筛选查询结果页面,取N为3。 Investigated its performance improvement results after 6 reorder popular search engines, the results of using a fixed number of pages screening method query results page, N is taken 3. 表3显示各搜索引擎性能提高情况,可以看到该方法能平均提高15%的评价性能,尤其对原有性能较差的搜索引擎,其提高幅度非常明显,如新浪和中搜。 Table 3 shows the case where each search engine performance improvement can be seen the method can evaluate the performance of an average of 15%, in particular, the poor performance of the original search engine, which increases the amplitude is very clear, as Sina and search. 同样比较搜狗搜索引擎的情况,使用单搜索引擎用户行为信息时,其提高了6.9%(见表2),使用多搜索引擎用户行为信息后其提高了13.6%(见表3),提高效果更加明显。 The same situation compare Sogou search engine, when using a single search engine user behavior information, which increases by 6.9% (see Table 2), the use of multi-search engine user behavior information that improves 13.6% (see Table 3), the effect of improving more obvious.

表3 融合多搜索引擎用户行为信息的检索性能提高情况 Table 3 fusion retrieval performance multi-search engine user behavior information to improve the situation

本发明能够自动的从单个或多个搜索引擎日志中筛选出用户关注的查询以及这些查询对应的可信度高的结果页面,进而当用户进行查询时,把相关结果融合在搜索引擎返回的结果中,提供给用户。 The present invention can automatically filter from a single or multiple search engine query logs the user's attention and high reliability result page corresponding to these queries, and thus when the user query, the fusion results in the correlation results returned by the search engine provided to the user. 该方法处理简单,算法复杂度低,能有效的利用已有的搜索引擎用户行为信息,使用群体用户的智慧来改进搜索引擎的检索结果,提高性能。 The method is simple, low complexity, can effectively utilize the existing search engine user behavior information, use the wisdom of crowds to improve the user search results of search engines and improve performance. 在测试数据上取得了很好的结果,提高了搜索引擎的检索性能。 Achieved very good results on the test data, improve the retrieval performance search engine. 这说明本发明具有较好的推广性和适应性,能对搜索引擎的搜索结果有效改进,具有良好的应用前景。 This description of the invention has good promotion and adaptability, search results can be effective in improving search engine, it has a good prospect.

附图说明 BRIEF DESCRIPTION

图1.搜索引擎基本流程架构;图2.基于用户行为信息的搜索引擎检索结果重排序方法的流程;图3.多搜索引擎用户点击率合并算法;图4.结果页面筛选的两种方法流程;图5.用户行为信息结构。 Figure 1. Search engine basic flow architecture; Figure 2. The process-based search engine search results reordering method of user behavior information; Figure 3. Multi-rate search engine users click on merging algorithm; Figure 4. The screening results page are two ways to process ; 5. FIG user behavior information structure.

具体实施方式 Detailed ways

附图2描述了本方法的流程。 Figure 2 describes a flow of the method. 本发明对于搜索引擎性能的提高,具有广泛的适应性和应用性。 The present invention is to improve the performance of the search engine has broad applicability and adaptability. 下面利用四个常用搜索引擎的日志进行融合多搜索引擎用户行为信息的检索结果重排序,就以上方法进行详细的流程说明:1.数据预处理所使用的日志包括在2007年3月18号至2007年4月23号的37天时间内收集到的四个常用搜索引擎的用户查询点击信息记录,共有非空查询点击信息58,092,696条(四个搜索引擎分别有32,983,339条,11,159,594条,3,450,045条,10,499,718条)。 The following four popular search engines using log fused multi-user behavior information search engine search results reordering procedure explained in detail above method: 1 log data in the pretreatment include March 18, 2007 to users collected 37 days in April 2007 No. 23 four popular search engine queries click information record, a total of 58,092,696 non-null inquiry click information bar (there were four search engines 32,983,339 strip, strip 11,159,594, 3,450,045 strip, 10,499,718 bars). 记录中的包括的信息有:表4:搜狗搜索引擎提供的4个常用搜索引擎用户日志包含信息项: Record includes information: Table 4: Sogou search engine provides four popular search engine users log contains information items:

FromUrl信息中包含了该日志所属的搜索引擎。 FromUrl information contained in the search engine of the log belongs. 通常,该地址的变量中包含了相关的查询关键词。 Typically, the variable that contains the address of the relevant query keywords. ToUrl即用户点击结果页面。 ToUrl ie the user clicks on the results page. 因此,这些日志包含了表1所对应的数据信息项,可以提供融合排序中所需的用户行为信息。 Thus, such a log table containing the information corresponding to data items, user behavior information may be provided in the desired fusion sorted.

日志的预处理包括:过滤非搜索引擎日志记录(如搜索引擎的站内互相跳转等);对搜索引擎日志按搜索引擎进行分类,得到四个常用搜索引擎各自的用户查询点击信息记录;从FromUrl的变量中提取相关的查询关键词部分,进行URL转码,并最终统一转码成GBK编码;过滤记录中非表1所需要的无用信息以及相关噪音信息;对相同查询的用户点击信息进行组织,计算各查询的用户数,查询对应的各点击页面的用户点击数等信息。 Pretreatment log include: non-search engine log filter (such as a jump to each other like the station search engine); search engine logs are classified by the search engine, the search engine to obtain four common queries the user clicks the respective recording information; from FromUrl the variables to extract the relevant part of the query keywords, a URL transcoding, and eventual reunification transcoded into GBK encoding; filter useless information records and information related to the noise of the Central African table 1 required; click information for the user to organize the same query information calculate the number of users for each query, the user clicks, and so each click on the corresponding query page.

2.常用查询集筛选搜索引擎的用户查询有一定的重复性和密集性,对于用户关心的查询,会经常被查询用户提交。 2. The common user queries a search engine query filter set has a certain repetitive and intensive, query the user for concern, would often submit queries users. 查询的密集性和重复性也是我们利用已有的用户行为信息提高检索性能的依据和保障。 Intensive and repetitive queries and our use of existing user behavior to improve information retrieval performance basis and protection.

以下为查询集合的筛选过程,如果使用多家搜索引擎日志,则在各个日志上独立进行考察和筛选。 The following is a collection of the screening process the query, if you use more than one search engine logs, the independent inspection and filtering on each log.

单搜索引擎日志上的常用查询筛选流程:每个搜索引擎日志经过预处理后,对于每个查询Q根据其查询用户数进行筛选,如果总查询次数少于20,则认为这个查询没有足够的宏观用户点击行为信息,无法进行有效的分析,同时该查询也缺少足够的代表性来描述用户所关心的话题,剔除该查询。 Common Queries screening process on a single search engine logs: each search engine logs after pretreatment, for each query Q filter according to the number of users of its inquiry, if the total number of queries is less than 20, it is considered that the query does not have enough macro user clicks information, can not effectively analyze, while the query is also a lack of adequate representation to describe the topic of interest to the user, excluded the query. 否则,将该查询进行保留。 Otherwise, the queries reserved. 根据以前对搜狗日志进行分析后发现,用户查询次数大于100的查询超过3万个,而用户在这部分查询上的总点击次数占到全部点击次数的70%左右。 According to the previous Sogou log analysis found that the number of queries the user query is greater than 100, more than 30,000, while the total user clicks on this part of the inquiry accounted for about 70% of all clicks. 这与前人的一些研究结果相同,即较少数量的查询被用户反复查询,占据了大部分的搜索引擎服务。 This is the result of some previous studies of the same, that a small number of user queries are repeated inquiries, account for most of the search engine service.

3.用户点击率信息获取对于每一个查询的用户点击率,其表示了在同一查询下,用户对各个结果页面的偏好度。 3. User Hits access to information for each user query hits, which represents at the same query, the user of each results page preferences. 对于单搜索引擎日志,直接利用(1)式计算得到每个查询下的各个页面的用户点击率大小。 For a single search engine logs, the direct use of (1) is calculated to give the user the size of each page of hits in each query. 如果利用多搜索引擎日志,则根据单搜索引擎的用户点击率,通过(2)式进行合并计算。 If using multiple search engines logs, is calculated according to the user hits a single search engine, are combined by (2). 图3描述了合并算法,可以得到合并后的用户点击率大小。 Merge algorithm described in FIG. 3, the size of the user click rate can be obtained combined.

用户点击率包含了宏观用户群对于特定查询各个结果页面的判断信息。 The user hits the macro contains information for specific user groups to determine each query results page. 基于单搜索引擎的用户点击率信息其存在着检索结果排序引导的偏向和单搜索引擎因资源有限而带来的数据集的偏向,但其用户日志收集相对比较简单,易于实现。 Based on the user hits a single search engine information that there is bias search results sorted and directed toward a single search engine with limited resources to bring the data set, but the users log collection is relatively simple and easy to implement. 基于多搜索引擎的用户点击率信息则能很好的避免存在的偏向问题,但相对来说由于各商用搜索引擎存在这竞争关系,获取日志信息相对较难。 It can be good to avoid bias problems based on user information and more search engine hits, but relatively speaking, because of this competition between the various commercial search engine, is relatively difficult to obtain log information.

4.用户日志结果筛选和搜索引擎检索结果重排序有了每个查询及其对应结果页面的用户点击率后,需要对用户点击页面进行筛选。 4. When the user logs and search engine results Filter search results reorder With each query and its corresponding user click-through rate results page, the user needs to click on the page to screen. 图4描述了两种利用已有的用户点击率信息对用户点击页面进行筛选的方法,即固定点击率和法和固定结果页面数法。 FIG 4 describes two CTR advantage of user information the user click on the page screening, i.e. CTR and fixing methods and fixing results page number method. 经过筛选后把常用查询及其对应的标出来的结果页面进行数据库保存,并提供查询服务。 After the screening of the common queries and the corresponding results page marked out of the database to save, and provide consulting services. 当用户提交某查询Q时,将其提交给保存的常用查询集和对应结果页面的数据库,如果包含该查询Q,则返回相关的结果页面,否则,返回为空。 When a user submits a query Q, submit it to common queries and a corresponding set of database results page saved, if you include the query Q, relevant results page is returned, otherwise return empty. 同时将该查询提交给搜索引擎,得到搜索引擎返回的结果,按照先挑选由用户行为信息得到的结果页面,再挑选搜索引擎返回的页面的顺序生成最终的查询Q的查询结果,返回给用户。 At the same time the query submitted to the search engine, and the results returned by search engines to generate the final query Q query result in accordance with the first selection of the results obtained from the page user behavior information, and then choose the page returned by the search engine of the order, returned to the user.

按照以上步骤,就可以利用宏观群体搜索引擎用户的查询行为信息和智慧,改进搜索引擎检索结果,提高搜索引擎检索性能。 Follow the steps above, you can use the macro group search engine user's query behavior information and intelligence, improve search engine search results, increase the Search Engines.

Claims (1)

1.一种基于用户行为信息的搜索引擎检索结果重排序方法,其特征在于该方法是在搜索引擎的计算机上依次按以下步骤实现的:步骤(1).用户常用查询集的筛选:步骤(1.1).数据预处理:通过搜索引擎网络服务器从至少一个搜索引擎的用户日志中提取供基于用户行为进行搜索结果重排序的用户信息,形成的用户信息至少包含以下内容:Query:用户提交的查询;URL:该查询对应的用户点击的结果页面地址;Id:系统自动分配给每个用户每次使用搜索引擎时的标识号;所述步骤(1.1)依次含有以下各子步骤:步骤(1.1.1).把搜索引擎网络服务器记录的编码格式转换成国家标准汉字编码的GBK格式;步骤(1.1.2).去除所述Query、URL、Id以外的信息,并把日志信息整理成所述用户日志的三个内容项字符串的形式;步骤(1.1.3).在步骤(1.1.2)的范围内,再利用如KMP的字符串匹配算 A method of reordering the search results search engine based on the user behavior information, wherein the method is implemented by sequentially the following steps on a computer search engine: Step (1) screening a user query set used: step ( 1.1) data preprocessing: the user information for search results based on user behavior reordering extracted by the search engine web server from the at least one search engine, the user logs in, the user information formed including at least the following: query: query submitted by the user ; URL: a user clicks on the query results corresponding page address; Id: identification number automatically assigned to each user during each use a search engine; said step (1.1) the sub-sequence comprising the following steps: (1.1. 1) converting the encoding format search engine web server logs into the national standard of GBK character coding format; step (1.1.2) removing information other than the Query, URL, Id, and organize the information into the user log. three content item character string form a log; step (1.1.3) in the range of the step (1.1.2), the reuse of such string matching operator KMP 过滤用户查询中的噪声信息,仅保留直接反映搜索引擎普通用户查询需求行为的内容项;步骤(1.2).提取查询用户数信息:对于在设定的最近时间段内用户日志中的用户提交的每一个查询Q,统计提交过该查询Q的用户数,该数值表示了用户对该查询的关注度;步骤(1.3).常用查询集的筛选:若:某查询Q在搜索引擎用户日志中其查询用户数小于设定值,则排除在常用查询集合之外;否则,把该查询Q放在所述常用查询集合S中;步骤(2).用户点击率信息的提取:步骤(2.1).单搜索引擎用户点击率的提取: Noise filter information in the user query, content items to retain a direct reflection of the search engine query ordinary user behavior needs only; step (1.2) queries to extract information number of users: For users in the last period of time set by the user logs submitted each query Q, the number of users of the statistics submitted query Q, which represents the value of the user attention to the inquiry; step (1.3) screen common set of query: If: a query Q in its search engine users log query number of users than the set value, it is excluded from the set of queries used; otherwise, the query to the query Q in the common set S; step (2) extracts user information CTR: step (2.1). single search engine users click-through rate of extraction: 步骤(2.2).多搜索引擎下用户点击率信息合并,用一个概率表达式P(URL|查询Q)来表示在合并后对查询Q的用户点击率的结果页面地址URL的用户点击率: Step (2.2) the user hits the combined multi-information search engine, with a probability expression P (URL | query Q) to represent the user hits after the merger to result page address of the user query Q hits a URL: 其中,P(SEi|查询Q)表示在第i个搜索引擎SEi中查询Q的概率,用SEi查询可信度表示: Where, P (SEi | query Q) represents the probability that a query Q in the i-th search engines SEi in confidence with SEi inquiry, said: P(URL|SEi,查询Q)表示在搜索引擎日志SEi中,对于查询Q,用户点击结果页面地址URL的点击率,用步骤(2.1)中所述的方法求出,i=1,2,...,I,因而P(URL|SEi,查询Q)取值范围在0~1之间;步骤(3).利用用户行为信息进行搜索引擎结果改进:步骤(3.1).用以下两种方法之一对用户点击的结果页面进行筛选,再把结果页面集保存;固定点击率和法:对于查询Q,寻找根据搜索引擎用户行为信息的用户点击率最大的前M个页面,该M个页面满足以下条件:连续前M个页面合并后的用户点击率之和大于0.8,但连续前M-1个页面的合并后的用户点击率之和小于0.8,而且该M个页面的用户点击率都大于0.1;固定结果页面数法:对于查询Q,寻找用户点击率最大的连续前n个页面:该n个页面对应的用户点击率都大于0.1,且n≤3;步骤(3.2).搜索引擎原始检索结果的获取:对于查询Q,将其 P (URL | SEi, query Q) represents a search engine log SEi, for the query Q, the user clicks on the results page URL address hits of step (2.1) is determined according to the method, i = 1,2, .. ..., I, and thus P (URL | SEi, query Q) ranges between 0 and 1; step (3) user behavior information using the search engine results improve the steps of: (3.1) with two one way for a user clicks on the results page to screen, and then save the results page set; click-through rate and fixed law: for the query Q, find the maximum of the first M page based on the user hits the search engine user behavior information, the M page following conditions are met: a user click rate after the continuous front M pages combined and greater than 0.8, but the user hits the combined continuous front M-1 pages sum is less than 0.8, and the user hits the M pages greater than 0.1; a fixed number method results page: for the query Q, to find the maximum rate of the user clicks the first n consecutive pages: page corresponding to the n users click rate is greater than 0.1, and n ≦ 3; step (3.2) searches. get original engine search results: for the query Q, which 提交给指定的搜索引擎得到该搜索引擎的检索结果序列;步骤(3.3).基于用户行为信息的检索结果合并:当用户向指定搜索引擎提交查询Q进行查询时,根据步骤(3.2)得到搜索引擎原始结果序列SEQ,同时提交给步骤(3.1)根据用户日志信息确定的结果页面集合中查询得到结果页面序列LOG,按如下方法合并该两个序列,将最终结果返回给用户:按用户点击率大小依次取序列LOG中的每一个结果页面,放入最终返回结果页面集,至取完为止;再取序列SEQ中的每一个结果页面,放入最终返回结果页面集,至取完为止,若最终返回结果页面集中已经存在该结果页面,则放弃该页面。 Submitted to the designated search engine to obtain the search engine search result sequence; step (3.3) the search result information based on user behavior combined: When a user submits a query Q query to the specified search engine to obtain search engine in accordance with step (3.2). the original sequence of SEQ result, submit to the step (3.1) the set of results obtained in accordance with the user log page information determined in the query results page sequence of the lOG, combining the two sequences as follows, the final result will be returned to the user: a user click-through rate by size each takes a turn sequence in the LOG results page, the page into the final result set returned, to take up completely; then take the results of each page in the sequence SEQ, the results page returned into the final set to take up completely, if the final focus returns results page results page already exists, then giving up the page.
CN 200710099594 2007-05-25 2007-05-25 Search engine retrieving result reordering method based on user behavior information CN100507920C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200710099594 CN100507920C (en) 2007-05-25 2007-05-25 Search engine retrieving result reordering method based on user behavior information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200710099594 CN100507920C (en) 2007-05-25 2007-05-25 Search engine retrieving result reordering method based on user behavior information

Publications (2)

Publication Number Publication Date
CN101055587A true true CN101055587A (en) 2007-10-17
CN100507920C CN100507920C (en) 2009-07-01

Family

ID=38795423

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200710099594 CN100507920C (en) 2007-05-25 2007-05-25 Search engine retrieving result reordering method based on user behavior information

Country Status (1)

Country Link
CN (1) CN100507920C (en)

Cited By (35)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100570611C (en) 2008-08-22 2009-12-16 清华大学;北京搜狗科技发展有限公司 Grading method for information retrieval document based on viewpoint searching
CN101329687B (en) 2008-07-31 2010-06-23 清华大学;北京搜狗科技发展有限公司 Method for positioning news web page
CN102063453A (en) * 2010-05-31 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for searching based on demands of user
CN102073699A (en) * 2010-12-20 2011-05-25 百度在线网络技术(北京)有限公司 Method, device and equipment for improving search result based on user behaviors
CN102289459A (en) * 2010-06-18 2011-12-21 微软公司 Automatically generating training data
CN102314462A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Method and system for obtaining navigation result on input method platform
CN102314461A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Navigation prompt method and system
CN101398844B (en) 2008-10-28 2012-01-25 华为终端有限公司 Resource file searching method and mobile terminal
CN101551806B (en) 2008-04-03 2012-04-18 北京搜狗科技发展有限公司 Personalized website navigation method and system
CN102591948A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 Method and system for improving search results based on user behavior analysis
CN102609511A (en) * 2012-02-06 2012-07-25 奇智软件(北京)有限公司 Navigation page data processing method and navigation page data processing device
CN102830940A (en) * 2012-09-24 2012-12-19 深圳市宜搜科技发展有限公司 Search result processing method and system
CN102937951A (en) * 2011-08-15 2013-02-20 北京百度网讯科技有限公司 Method for building internet protocol (IP) address classification model, user classifying method and device
WO2013034043A1 (en) * 2011-09-05 2013-03-14 腾讯科技(深圳)有限公司 Search ranking method and system for community users
CN102999496A (en) * 2011-09-09 2013-03-27 北京百度网讯科技有限公司 Method for building requirement analysis formwork and method and device for searching requirement recognition
CN103020083A (en) * 2011-09-23 2013-04-03 北京百度网讯科技有限公司 Automatic mining method of requirement identification template, requirement identification method and corresponding device
CN103034639A (en) * 2011-10-09 2013-04-10 张征程 Classified library with popular degree judging function relevant to dynamic information of keywords
CN103136219A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method and device for requirement mining and based on timeliness
CN103235783A (en) * 2013-03-28 2013-08-07 百度在线网络技术(北京)有限公司 Method and equipment used for determining optimized search results
CN103365904A (en) * 2012-04-05 2013-10-23 阿里巴巴集团控股有限公司 Advertising information searching method and system
CN103365875A (en) * 2012-03-29 2013-10-23 百度在线网络技术(北京)有限公司 Method and device for providing contact object in current application
CN103389974A (en) * 2012-05-07 2013-11-13 腾讯科技(深圳)有限公司 Method and server for searching information
CN103475676A (en) * 2012-06-06 2013-12-25 百度在线网络技术(北京)有限公司 Method, device, equipment and system used for providing page body information
CN103530376A (en) * 2013-10-15 2014-01-22 北京百度网讯科技有限公司 Method and device for providing screening conditions and searching method and device
CN103646093A (en) * 2013-12-18 2014-03-19 北京博雅立方科技有限公司 Data processing method and platform for search engines
CN103646034A (en) * 2013-11-14 2014-03-19 东华理工大学 Web search engine system and search method based content credibility
CN103870573A (en) * 2014-03-18 2014-06-18 北京奇虎科技有限公司 Method and device for website analysis
CN103870607A (en) * 2014-04-08 2014-06-18 北京奇虎科技有限公司 Sequencing method and device of search results of multiple search engines
CN104391847A (en) * 2014-05-22 2015-03-04 艺龙网信息技术(北京)有限公司 Hotel ordering method based on user action, cloud server and system
CN104615680A (en) * 2015-01-21 2015-05-13 广州神马移动信息科技有限公司 Method and device for establishing web page quality model
CN104951441A (en) * 2014-03-24 2015-09-30 阿里巴巴集团控股有限公司 Method and device for sequencing objects
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
CN105354235A (en) * 2015-10-08 2016-02-24 天脉聚源(北京)传媒科技有限公司 Search result processing method and apparatus
CN103914485B (en) * 2013-01-07 2017-05-03 上海宝信软件股份有限公司 A remote system and method for collecting and retrieving system log display application
CN103593353B (en) * 2012-08-15 2018-11-13 阿里巴巴集团控股有限公司 Information search method, the information display determining weighting values ​​ordering method and apparatus

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101431114B1 (en) * 2010-07-01 2014-08-18 에스케이플래닛 주식회사 Contents searching service system and contents searching service method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389811A (en) 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
CN100440224C (en) 2006-12-01 2008-12-03 清华大学 Automatization processing method of rating of merit of search engine

Cited By (48)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101551806B (en) 2008-04-03 2012-04-18 北京搜狗科技发展有限公司 Personalized website navigation method and system
CN101329687B (en) 2008-07-31 2010-06-23 清华大学;北京搜狗科技发展有限公司 Method for positioning news web page
CN100570611C (en) 2008-08-22 2009-12-16 清华大学;北京搜狗科技发展有限公司 Grading method for information retrieval document based on viewpoint searching
CN101398844B (en) 2008-10-28 2012-01-25 华为终端有限公司 Resource file searching method and mobile terminal
CN102063453A (en) * 2010-05-31 2011-05-18 百度在线网络技术(北京)有限公司 Method and device for searching based on demands of user
CN102289459A (en) * 2010-06-18 2011-12-21 微软公司 Automatically generating training data
CN102314462A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Method and system for obtaining navigation result on input method platform
CN102314461A (en) * 2010-06-30 2012-01-11 北京搜狗科技发展有限公司 Navigation prompt method and system
CN102073699A (en) * 2010-12-20 2011-05-25 百度在线网络技术(北京)有限公司 Method, device and equipment for improving search result based on user behaviors
CN102073699B (en) * 2010-12-20 2016-03-02 百度在线网络技术(北京)有限公司 A method for improving search results based on user behavior, apparatus and equipment
CN102937951A (en) * 2011-08-15 2013-02-20 北京百度网讯科技有限公司 Method for building internet protocol (IP) address classification model, user classifying method and device
CN102937951B (en) * 2011-08-15 2016-11-02 北京百度网讯科技有限公司 Ip address of the method for establishing the classification model, method and apparatus for user classification
WO2013034043A1 (en) * 2011-09-05 2013-03-14 腾讯科技(深圳)有限公司 Search ranking method and system for community users
US9489428B2 (en) 2011-09-05 2016-11-08 Tencent Technology (Shenzhen) Company Limited Search ranking method and system for community users
CN102999496A (en) * 2011-09-09 2013-03-27 北京百度网讯科技有限公司 Method for building requirement analysis formwork and method and device for searching requirement recognition
CN103020083B (en) * 2011-09-23 2016-06-15 北京百度网讯科技有限公司 Automatic recognition template mining demand, the demand recognition method and a corresponding apparatus
CN103020083A (en) * 2011-09-23 2013-04-03 北京百度网讯科技有限公司 Automatic mining method of requirement identification template, requirement identification method and corresponding device
CN103034639A (en) * 2011-10-09 2013-04-10 张征程 Classified library with popular degree judging function relevant to dynamic information of keywords
CN103136219A (en) * 2011-11-24 2013-06-05 北京百度网讯科技有限公司 Method and device for requirement mining and based on timeliness
CN103136219B (en) * 2011-11-24 2016-08-17 北京百度网讯科技有限公司 A method and apparatus based on the timeliness requirements Mining
CN102591948A (en) * 2011-12-27 2012-07-18 厦门市美亚柏科信息股份有限公司 Method and system for improving search results based on user behavior analysis
CN102609511A (en) * 2012-02-06 2012-07-25 奇智软件(北京)有限公司 Navigation page data processing method and navigation page data processing device
CN103365875B (en) * 2012-03-29 2018-05-11 百度在线网络技术(北京)有限公司 A method of providing contact object in the current application methods and apparatus for
CN103365875A (en) * 2012-03-29 2013-10-23 百度在线网络技术(北京)有限公司 Method and device for providing contact object in current application
CN103365904A (en) * 2012-04-05 2013-10-23 阿里巴巴集团控股有限公司 Advertising information searching method and system
CN103389974B (en) * 2012-05-07 2017-12-08 深圳市世纪光速信息技术有限公司 Methods of information search and servers
US9454613B2 (en) 2012-05-07 2016-09-27 Tencent Technology (Shenzhen) Company Limited Method and server for searching information
WO2013166916A1 (en) * 2012-05-07 2013-11-14 深圳市世纪光速信息技术有限公司 Information search method and server
CN103389974A (en) * 2012-05-07 2013-11-13 腾讯科技(深圳)有限公司 Method and server for searching information
CN103475676A (en) * 2012-06-06 2013-12-25 百度在线网络技术(北京)有限公司 Method, device, equipment and system used for providing page body information
CN103593353B (en) * 2012-08-15 2018-11-13 阿里巴巴集团控股有限公司 Information search method, the information display determining weighting values ​​ordering method and apparatus
CN102830940A (en) * 2012-09-24 2012-12-19 深圳市宜搜科技发展有限公司 Search result processing method and system
CN103914485B (en) * 2013-01-07 2017-05-03 上海宝信软件股份有限公司 A remote system and method for collecting and retrieving system log display application
CN103235783B (en) * 2013-03-28 2016-12-28 百度在线网络技术(北京)有限公司 A method for determining a preferred method and apparatus for search results
CN103235783A (en) * 2013-03-28 2013-08-07 百度在线网络技术(北京)有限公司 Method and equipment used for determining optimized search results
CN105247517A (en) * 2013-04-23 2016-01-13 谷歌公司 Ranking signals in mixed corpora environments
WO2015055094A1 (en) * 2013-10-15 2015-04-23 北京百度网讯科技有限公司 Method and device for providing screening conditions and method and device for searching
CN103530376B (en) * 2013-10-15 2016-03-16 北京百度网讯科技有限公司 Filters to provide a method, apparatus and a search method, apparatus
CN103530376A (en) * 2013-10-15 2014-01-22 北京百度网讯科技有限公司 Method and device for providing screening conditions and searching method and device
CN103646034A (en) * 2013-11-14 2014-03-19 东华理工大学 Web search engine system and search method based content credibility
CN103646034B (en) * 2013-11-14 2017-03-08 东华理工大学 A content credible Web search engine and search systems based on
CN103646093A (en) * 2013-12-18 2014-03-19 北京博雅立方科技有限公司 Data processing method and platform for search engines
CN103870573A (en) * 2014-03-18 2014-06-18 北京奇虎科技有限公司 Method and device for website analysis
CN104951441A (en) * 2014-03-24 2015-09-30 阿里巴巴集团控股有限公司 Method and device for sequencing objects
CN103870607A (en) * 2014-04-08 2014-06-18 北京奇虎科技有限公司 Sequencing method and device of search results of multiple search engines
CN104391847A (en) * 2014-05-22 2015-03-04 艺龙网信息技术(北京)有限公司 Hotel ordering method based on user action, cloud server and system
CN104615680A (en) * 2015-01-21 2015-05-13 广州神马移动信息科技有限公司 Method and device for establishing web page quality model
CN105354235A (en) * 2015-10-08 2016-02-24 天脉聚源(北京)传媒科技有限公司 Search result processing method and apparatus

Also Published As

Publication number Publication date Type
CN100507920C (en) 2009-07-01 grant

Similar Documents

Publication Publication Date Title
US6701310B1 (en) Information search device and information search method using topic-centric query routing
US7962461B2 (en) Method and system for finding and aggregating reviews for a product
Wu et al. Query selection techniques for efficient crawling of structured web sources
US6647383B1 (en) System and method for providing interactive dialogue and iterative search functions to find information
Paliwal et al. Semantics-based automated service discovery
US20070239701A1 (en) System and method for prioritizing websites during a webcrawling process
US20070038608A1 (en) Computer search system for improved web page ranking and presentation
US20070250501A1 (en) Search result delivery engine
US20030046311A1 (en) Dynamic search engine and database
US20070214133A1 (en) Methods for filtering data and filling in missing data using nonlinear inference
US20010049674A1 (en) Methods and systems for enabling efficient employment recruiting
Koshman et al. Web searching on the Vivisimo search engine
US8166013B2 (en) Method and system for crawling, mapping and extracting information associated with a business using heuristic and semantic analysis
US20070100795A1 (en) System and method for associating an unvalued search term with a valued search term
US20070055680A1 (en) Method and system for creating a taxonomy from business-oriented metadata content
US20090119261A1 (en) Techniques for ranking search results
US20070129997A1 (en) Systems and methods for assigning monetary values to search terms
US20060129538A1 (en) Text search quality by exploiting organizational information
US6463430B1 (en) Devices and methods for generating and managing a database
US20080243787A1 (en) System and method of presenting search results
US20080243786A1 (en) System and method of goal-oriented searching
Kao et al. Mining web informative structures and contents based on entropy analysis
Segev et al. Context-based matching and ranking of web services for composition
US20030014501A1 (en) Predicting the popularity of a text-based object
US20080243785A1 (en) System and methods of searching data sources

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C14 Grant of patent or utility model
C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data