WO2009000174A1 - Method and device of web page rank - Google Patents

Method and device of web page rank Download PDF

Info

Publication number
WO2009000174A1
WO2009000174A1 PCT/CN2008/070608 CN2008070608W WO2009000174A1 WO 2009000174 A1 WO2009000174 A1 WO 2009000174A1 CN 2008070608 W CN2008070608 W CN 2008070608W WO 2009000174 A1 WO2009000174 A1 WO 2009000174A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
category
user
web page
vector
Prior art date
Application number
PCT/CN2008/070608
Other languages
French (fr)
Chinese (zh)
Inventor
Zhiyuan Liu
Original Assignee
Tencent Technology (Shenzhen) Company Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology (Shenzhen) Company Limited filed Critical Tencent Technology (Shenzhen) Company Limited
Publication of WO2009000174A1 publication Critical patent/WO2009000174A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the present invention relates to the field of computer applications, and in particular, to a web page sorting method and apparatus. Background of the invention
  • Search engines are an area where competition is currently fierce.
  • the focus of search engine competition is not only rich content, but also user experience.
  • the problem that search engines face now is not insufficient information but excessive information. Searching for a keyword often results in thousands of results.
  • the embodiment of the present invention is implemented by the method for sorting webpages.
  • the method includes: each webpage corresponding to at least one webpage category related to the content, and each webpage corresponds to one webpage a webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
  • the webpage category corresponding to the search content is determined, and the searched at least one webpage is sorted according to the value of the element corresponding to the determined webpage category in each webpage category vector.
  • a webpage sorting apparatus comprising: a first module, configured to determine, according to a webpage visited by a user, a webpage category that is most used by a user, and determine a webpage category vector corresponding to the webpage; wherein, each webpage Corresponding to at least one webpage category related to the content, each webpage corresponds to a webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
  • a second module configured to: when the first user clicks on a webpage, determine, in the webpage category vector determined by the first module corresponding to the webpage that is clicked, an element corresponding to the webpage category that is used by the first user, The value of the determined element;
  • the third module is configured to: when the second user searches for a webpage, determine a webpage category corresponding to the search content, and search for at least the value of the element corresponding to the determined webpage category in each webpage category vector from the second module. Sort a page.
  • FIG. 1 is a flowchart of a web page sorting method according to an embodiment of the present invention
  • FIG. 2 is a structural diagram of a search engine in an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a webpage sorting apparatus according to an embodiment of the present invention. Mode for carrying out the invention
  • the embodiment of the present invention divides the user expert category according to the network protocol (Internet Protocol, IP) log accessed by the user, and adds a score to the value of the webpage category vector corresponding to the webpage according to the webpage clicked by the user, when the user retrieves the information, The user's search results are sorted according to the web page category vector.
  • IP Internet Protocol
  • Embodiments of the present invention provide a method for sorting web pages.
  • each web page corresponds to at least one web page category related to the content
  • each web page corresponds to a web page category vector.
  • the web page category vector includes at least one element, and at least one element included in the web page category vector respectively represents a weight of each of the at least one web page category corresponding to the web page.
  • the web page category vector is an n-dimensional vector, where n is equal to the number of web page categories. It should be noted that the web page category vector can be implemented with an array containing n elements, n equal to the number of web page categories.
  • the type of webpage that the user uses the most is determined based on the webpage that the user has visited, that is, the expert category of the user is determined.
  • the webpage category that the user uses most can be determined according to the behavior of the user, for example, the IP log accessed by the user is classified, ⁇ the webpage category that the user uses most is determined according to the IP category that the user accesses the most, or the search term input by the user can be performed.
  • Classification which determines the category of webpages that users use the most, based on the category to which the user's most used terms belong.
  • there are other implementations that determine the type of webpage that the user uses the most based on the user's behavior all of which are well known to those skilled in the art, and are not listed here.
  • the value of the element is increased.
  • the method of increasing the value of the element may specifically add 1 to the value of the element. Repeat this action when another user clicks on the web page.
  • the second user searches for the webpage, first determine the webpage category corresponding to the search content input by the second user, and sort the searched at least one webpage according to the value of the element corresponding to the determined webpage category in each webpage category vector.
  • FIG. 1 shows a flow of a webpage sorting method provided by an embodiment of the present invention, which is described in detail below.
  • step S101 the web page category vector established by the user is stored.
  • the vector is a one-dimensional matrix, which can save the score of things on all elements of a certain set.
  • the value of each category in the category set is saved, for example, if the category set is ⁇ "sports", "news" ⁇ , then the webpage vector saves the webpage. For the score of "sports" and the score of "news", these two scores can be read by accessing the vector.
  • the size of the category collection is on the hundreds of levels, so the web page vector saves the score of each category of each of the hundreds of categories.
  • a web page category vector Using an n-dimensional vector for all web pages is called a web page category vector.
  • the dimension ⁇ of the vector is equal to the number of categories of the web page category set.
  • the meaning of the vector is the weight of the web page in each category, that is, the web page is in each category. What is the proportion, because a web page does not necessarily belong to a category, a vector can be used to indicate the weight of the web page on each category, and the weight of each category can be represented by an element, indicating each category An array of elements constitutes the vector.
  • most websites are able to establish a category set A according to the content of the current Internet web page, such as history, military, tourism, humanities, automobiles, and the like.
  • step S102 the IP logs accessed by the user are classified, and the expert category of the user is determined according to the IP category that the user accesses the most.
  • the process of obtaining the IP log accessed by the user is described as follows.
  • the typical structure of the search engine shown in FIG. 2 includes a crawler, an indexer, a retriever, etc., wherein the crawler works mainly to allocate a uniform resource locator to the webpage. Uniform Resource Locator Identify (URLID) and download webpage.
  • the crawler assigns a unique identifier ID to each Internet webpage to distinguish different URLIDs. This URLID corresponds to a structure, including the text content of the webpage, and the webpage. Additional properties, etc.
  • the crawler downloads the web page from the Internet and assigns a unique URLID to the original database.
  • the indexer reads the web page information from the original database and indexes it, and stores it in the index database.
  • the retriever receives the user input, obtains the record from the index database and returns it to the user after sorting, and records the user's operation log to the user behavior log.
  • the algorithm used is as follows.
  • the user's expert category represents the most used web page category.
  • the user inputs the search information "T43", and the search engine classifies the retrieved character string to obtain the category "computer”.
  • the search engine sorts the search results, the role of the web page category vector is considered, and the "computer" is Pages with larger weights are ranked first.
  • step S103 when the user clicks on a certain webpage in the search engine search result, the value of the webpage category vector corresponding to the webpage is added according to the determined expert category of the user.
  • the user clicks on a web page. If the user belongs to the expert of the web page category vector, the category weight of the web page is added to the corresponding vector. That is, the web page clicked by the user adds a value to the corresponding value of the web page category according to the expert category of the user, that is, increases the weight of the element.
  • step S104 when the user searches through the search engine, the results of the user search are optimally sorted by referring to the scores in the web page category.
  • the algorithm used in this step is as follows.
  • the search results are pre-sorted as an embodiment of the present invention, where the search results are sorted using the pagerank technique.
  • Each search result page c Query the web page category vector corresponding to the c web page, and read the expert recommendation value of the web page for category a.
  • FIG. 3 shows the structure of a webpage sorting apparatus provided by an embodiment of the present invention.
  • the web page category vector storage module 11 stores a web page category vector established by the user, wherein each vector in the web page category vector is used to identify the weight of the web page corresponding to the vector in the web page category set.
  • Each webpage corresponds to at least one webpage category related to the content, and each webpage corresponds to one webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents at least one webpage category corresponding to the webpage. the weight of.
  • the user expert category determining module 12 classifies the IP logs accessed by the user, and determines the expert category of the user according to the IP category that the user accesses the most.
  • the webpage category vector adding module 13 adds a score to the webpage category vector corresponding to the webpage according to the expert category of the user determined by the user expert category determining module 12. The specific process has been described above and will not be repeated here.
  • the webpage optimization ranking module When the user enters an index through the search engine for information retrieval, the webpage optimization ranking module
  • the web page display module 15 will optimize the sorted web page display.
  • the embodiment of the present invention divides the user expert category according to the IP log accessed by the user, and adds a score to the value of the webpage category vector corresponding to the webpage according to the webpage clicked by the user, and when the user retrieves the information, the user is based on the webpage category vector. Sorting the search results, solving the problem in the prior art that the user clicks on the number of clicks directly, causing the user to click maliciously and blindly add points.

Abstract

A method and device of web page rank are provided which adapted to the computer application field. The method includes: each web page corresponds to at least one web page type related to the content, and each web page corresponds to one web page type vector containing at least one element representing the respective weight of the at least one web page type which the web page corresponds to; when a first user click a web page, the element of the clicked web page's vector corresponding to the web page type that the first user uses most is determined, and the value of determined element is increased; when a second user searches web pages, the web page types that the search content corresponds to is determined, and the at least one pages achieved are sorted based on the the value of the element of each web page vector corresponding to the determined web page type. The problems of users' vicious clicks and blindly adding scores to the web pages brought by adding scores to web pages directly according to users' click times in prior art are resolved.

Description

一种网页排序方法及装置  Web page sorting method and device
技术领域 Technical field
本发明涉及计算机应用领域, 尤其涉及一种网页排序方法及装置。 发明背景  The present invention relates to the field of computer applications, and in particular, to a web page sorting method and apparatus. Background of the invention
搜索引擎是当前竟争非常激烈的领域, 搜索引擎竟争的重点除了丰 富的内容外, 还有用户体验。 一般来说, 搜索引擎现在面对的问题不是 信息不够而是信息过量,搜索一个关键词往往会出现成千上万的结果。  Search engines are an area where competition is currently fierce. The focus of search engine competition is not only rich content, but also user experience. In general, the problem that search engines face now is not insufficient information but excessive information. Searching for a keyword often results in thousands of results.
在实际应用过程中, 用户使用搜索引擎时, 都希望第一页甚至前 5 个统一资源定位器( Uniform Resource Locator, URL ) 中就包括了用户 希望得到的信息, 因此排序成了提高搜索引擎质量的关键因素。 著名的 搜索引擎 Google能在短时间内成为世界上排名第一的搜索引擎,就是因 为其发明的网页等级(pagerank )技术能够有效解决排序问题。  In the actual application process, when users use the search engine, they want the first page or even the first five Uniform Resource Locators (URLs) to include the information that the user wants, so the ranking becomes the quality of the search engine. The key factor. The famous search engine Google can become the world's number one search engine in a short period of time, because the pagerank technology it invented can effectively solve the sorting problem.
但是如今, 各个网络公司已经了解并且大部分都采用了 pagerank技 术, 实际上当前的任何一个比较大的搜索引擎的排序结果都不是靠单一 算法了, 而是总结几十个甚至上百个因素而得出的结果。 常用的算法除 了 pagerank, 还有 hits算法(一种基于超链接的检索算法 ) , Hilltop算法 (一种适用于大分类的搜索引擎排名算法)等。  But nowadays, various network companies have already understood and most of them have adopted pagerank technology. In fact, the ranking results of any of the current large search engines are not based on a single algorithm, but on the summary of dozens or even hundreds of factors. The result. Commonly used algorithms include pagerank, hits algorithm (a hyperlink-based search algorithm), Hilltop algorithm (a search engine ranking algorithm for large categories), and so on.
现有技术中, 在对用户通过搜索引擎搜索的结果进行排序时, 直接 靠用户的点击次数对网页进行加分。 发明内容  In the prior art, when the results of the search by the search engine are sorted, the webpage is directly scored by the number of clicks of the user. Summary of the invention
本发明实施例是这样实现的, 一种网页排序方法, 该方法包括: 每一网页对应至少一个与内容相关的网页类别, 每一网页对应一个 网页类别向量, 所述网页类别向量包含至少一个元素, 所述至少一个元 素分别代表该网页对应的至少一个网页类别各自的权重; The embodiment of the present invention is implemented by the method for sorting webpages. The method includes: each webpage corresponding to at least one webpage category related to the content, and each webpage corresponds to one webpage a webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
根据用户访问过的网页确定用户使用最多的网页类别;  Determine the type of webpage that the user uses the most according to the webpage that the user has visited;
当第一用户点击一个网页时, 确定被点击的网页对应的网页类别向 量中对应所述第一用户使用最多的网页类别的一个元素, 增加所确定的 元素的值;  When the first user clicks on a webpage, determining an element of the webpage category trajectory corresponding to the clicked webpage corresponding to the webpage category most used by the first user, increasing the value of the determined element;
当第二用户搜索网页时, 确定搜索内容对应的网页类别, 根据各网 页类别向量中对应所确定的网页类别的元素的值对搜索到的至少一个 网页进行排序。  When the second user searches for the webpage, the webpage category corresponding to the search content is determined, and the searched at least one webpage is sorted according to the value of the element corresponding to the determined webpage category in each webpage category vector.
本发明另外一个实施例提供一种网页排序装置, 该装置包括: 第一模块, 用于根据用户访问过的网页确定用户使用最多的网页类 别, 确定网页对应的网页类别向量; 其中, 每一网页对应至少一个与内 容相关的网页类别, 每一网页对应一个网页类别向量, 所述网页类别向 量包含至少一个元素, 所述至少一个元素分别代表该网页对应的至少一 个网页类别各自的权重;  Another embodiment of the present invention provides a webpage sorting apparatus, the apparatus comprising: a first module, configured to determine, according to a webpage visited by a user, a webpage category that is most used by a user, and determine a webpage category vector corresponding to the webpage; wherein, each webpage Corresponding to at least one webpage category related to the content, each webpage corresponds to a webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
第二模块, 用于当第一用户点击一个网页时, 确定被点击的网页对 应的所述第一模块确定的网页类别向量中对应所述第一用户使用最多 的网页类别的一个元素, 增加所确定的元素的值;  a second module, configured to: when the first user clicks on a webpage, determine, in the webpage category vector determined by the first module corresponding to the webpage that is clicked, an element corresponding to the webpage category that is used by the first user, The value of the determined element;
所述第三模块, 用于当第二用户搜索网页时, 确定搜索内容对应的 网页类别, 根据来自第二模块的各网页类别向量中对应所确定的网页类 别的元素的值对搜索到的至少一个网页进行排序。 附图简要说明 图 1是本发明实施例提供的网页排序方法的流程图;  The third module is configured to: when the second user searches for a webpage, determine a webpage category corresponding to the search content, and search for at least the value of the element corresponding to the determined webpage category in each webpage category vector from the second module. Sort a page. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of a web page sorting method according to an embodiment of the present invention;
图 2是本发明实施例中搜索引擎的组成结构图; 图 3是本发明实施例提供的网页排序装置的结构示意图。 实施本发明的方式 2 is a structural diagram of a search engine in an embodiment of the present invention; FIG. 3 is a schematic structural diagram of a webpage sorting apparatus according to an embodiment of the present invention. Mode for carrying out the invention
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附 图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处所描述的 具体实施例仅仅用以解释本发明, 并不用于限定本发明。  In order to make the objects, the technical solutions and the advantages of the present invention more comprehensible, the present invention will be further described in detail below with reference to the accompanying drawings. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明实施例根据用户访问的网络协议( Internet Protocol, IP ) 日 志对用户进行用户专家类别划分, 并根据用户点击的网页对该网页对应 的网页类别向量的值加分, 在用户检索信息时, 根据该网页类别向量对 用户的搜索结果进行排序。  The embodiment of the present invention divides the user expert category according to the network protocol (Internet Protocol, IP) log accessed by the user, and adds a score to the value of the webpage category vector corresponding to the webpage according to the webpage clicked by the user, when the user retrieves the information, The user's search results are sorted according to the web page category vector.
本发明实施例提供了一种网页排序的方法。 在该方法中, 每一个网 页对应至少一个与内容相关的网页类别, 每一个网页对应一个网页类别 向量。 该网页类别向量包含至少一个元素, 包括在网页类别向量中的至 少一个元素分别代表该网页对应的至少一个网页类别各自的权重。 网页 类别向量是一个 n维的向量, 其中, n等于网页类别的数量。 需要指出 的是,该网页类别向量可以用一个数组来实现,该数组中包含 n个元素, n等于网页类别的数量。  Embodiments of the present invention provide a method for sorting web pages. In the method, each web page corresponds to at least one web page category related to the content, and each web page corresponds to a web page category vector. The web page category vector includes at least one element, and at least one element included in the web page category vector respectively represents a weight of each of the at least one web page category corresponding to the web page. The web page category vector is an n-dimensional vector, where n is equal to the number of web page categories. It should be noted that the web page category vector can be implemented with an array containing n elements, n equal to the number of web page categories.
首先, 根据用户访问过的网页确定用户使用最多的网页类别, 即确 定用户的专家类别。 可以根据用户的行为确定用户使用最多的网页类 别, 比如, 将用户访问的 IP日志进行分类, ^^据用户访问最多的 IP类 别确定用户使用最多的网页类别, 或者可以对用户输入的检索词进行分 类, 根据用户使用最多的检索词所属的类别确定用户使用最多的网页类 别。 当然, 还有其他根据用户的行为确定用户使用最多的网页类别的实 施方式, 都属于本领域技术人员公知的技术, 这里就不再——列举。  First, the type of webpage that the user uses the most is determined based on the webpage that the user has visited, that is, the expert category of the user is determined. The webpage category that the user uses most can be determined according to the behavior of the user, for example, the IP log accessed by the user is classified, ^^ the webpage category that the user uses most is determined according to the IP category that the user accesses the most, or the search term input by the user can be performed. Classification, which determines the category of webpages that users use the most, based on the category to which the user's most used terms belong. Of course, there are other implementations that determine the type of webpage that the user uses the most based on the user's behavior, all of which are well known to those skilled in the art, and are not listed here.
在确定了用户使用最多的网页类别之后, 当第一用户点击一个网页 时, 确定被点击的网页对应的网页类别向量中对应该第一用户使用最多 的网页类别的一个元素, 增加该元素的值。 增加该元素的值的方法具体 可以为将该元素的值加 1。 当有另外的用户点击网页时, 重复该操作。 After determining the category of the webpage that the user uses the most, when the first user clicks on a webpage When the page category vector corresponding to the clicked webpage is determined to correspond to an element of the webpage category that is most used by the first user, the value of the element is increased. The method of increasing the value of the element may specifically add 1 to the value of the element. Repeat this action when another user clicks on the web page.
当第二用户搜索网页时, 首先确定该第二用户输入的搜索内容对应 的网页类别, 根据各网页类别向量中对应所确定的网页类别的元素的值 对搜索到的至少一个网页进行排序。  When the second user searches for the webpage, first determine the webpage category corresponding to the search content input by the second user, and sort the searched at least one webpage according to the value of the element corresponding to the determined webpage category in each webpage category vector.
图 1示出了本发明实施例提供的网页排序方法的流程, 详述如下。 在步骤 S101中, 存储用户确立的网页类别向量。  FIG. 1 shows a flow of a webpage sorting method provided by an embodiment of the present invention, which is described in detail below. In step S101, the web page category vector established by the user is stored.
其中, 向量是一维的矩阵, 可以保存事物对某一个集合的所有元素 的分值。 本发明实施例通过对网页分配一个向量, 保存此网页对类别集 合中每个类别的取值, 譬如, 如果类别集合为{ "体育", "新闻" } , 那 么网页的向量就保存了此网页对 "体育" 的分值和 "新闻" 的分值, 可 以通过访问向量读取这两个分值。 在实际应用过程中, 类别集合的大小 都在上百的级别上, 因此网页向量就保存了每个网页对这上百个类别的 每个类别的分值。  Among them, the vector is a one-dimensional matrix, which can save the score of things on all elements of a certain set. In the embodiment of the present invention, by assigning a vector to the webpage, the value of each category in the category set is saved, for example, if the category set is {"sports", "news"}, then the webpage vector saves the webpage. For the score of "sports" and the score of "news", these two scores can be read by accessing the vector. In the actual application process, the size of the category collection is on the hundreds of levels, so the web page vector saves the score of each category of each of the hundreds of categories.
对所有网页使用一个 η维的向量叫网页类别向量, 向量的维数 η等 于网页类别集合 Α的类别数量,该向量的含义是此网页在每个类别的权 重, 即此网页在每个类别中占的比例是多少, 因为一个网页不一定属于 一个类别, 用一个向量就可以表示此网页在每个类别上的权重是多少, 对每个类别的权重可以用一个元素来表示, 表示各个类别的元素就组成 的数组构成了所述的向量。 其中, 在现有技术中, 大多数的网站都能够 根据当前的互联网网页的内容确立一个类别集合 A, 譬如历史、 军事、 旅游、 人文、 汽车等。  Using an n-dimensional vector for all web pages is called a web page category vector. The dimension η of the vector is equal to the number of categories of the web page category set. The meaning of the vector is the weight of the web page in each category, that is, the web page is in each category. What is the proportion, because a web page does not necessarily belong to a category, a vector can be used to indicate the weight of the web page on each category, and the weight of each category can be represented by an element, indicating each category An array of elements constitutes the vector. Among them, in the prior art, most websites are able to establish a category set A according to the content of the current Internet web page, such as history, military, tourism, humanities, automobiles, and the like.
在步骤 S102中, 将用户访问的 IP日志进行分类, 根据用户访问最 多的 IP类别确定用户的专家类别。 获取用户访问的 IP日志的过程描述如下,如图 2所示的搜索引擎的 典型结构, 包括爬虫 (crawler), 索引器, 检索器等, 其中爬虫的工作主 要是给网页分配统一资源定位器的身份标识( Uniform Resource Locator Identify, URLID )和下载网页, 爬虫给每个互联网的网页分配一个唯一 的标识符 ID , 来区分不同的 URLID , 此 URLID对应一个结构, 包括了 网页的文本内容, 网页的附加属性等。 In step S102, the IP logs accessed by the user are classified, and the expert category of the user is determined according to the IP category that the user accesses the most. The process of obtaining the IP log accessed by the user is described as follows. The typical structure of the search engine shown in FIG. 2 includes a crawler, an indexer, a retriever, etc., wherein the crawler works mainly to allocate a uniform resource locator to the webpage. Uniform Resource Locator Identify (URLID) and download webpage. The crawler assigns a unique identifier ID to each Internet webpage to distinguish different URLIDs. This URLID corresponds to a structure, including the text content of the webpage, and the webpage. Additional properties, etc.
爬虫从互联网下载网页,并且分配唯一的 URLID,存入原始数据库。 索引器从原始数据库读取网页信息建立索引, 并存入索引数据库。  The crawler downloads the web page from the Internet and assigns a unique URLID to the original database. The indexer reads the web page information from the original database and indexes it, and stores it in the index database.
在用户输入检索信息进行信息检索时, 检索器接收用户输入, 从索 引数据库得到记录并且排序后返回给该用户, 同时把用户的操作日志记 录到用户行为日志。  When the user inputs the search information for information retrieval, the retriever receives the user input, obtains the record from the index database and returns it to the user after sorting, and records the user's operation log to the user behavior log.
其中, 在确定用户的专家类别时, 所用到的算法如下,  Among them, when determining the expert category of the user, the algorithm used is as follows.
定义专家数组 UserTypeD , 其中 UserType[i]表示第 i个用户的专家 类别。  Define the expert array UserTypeD , where UserType[i] represents the expert category of the i-th user.
For (每一个用户 i) 定义类别计数器数组 TypeCounter[]  For (each user i) defines the category counter array TypeCounter[]
读取用户 i的所有历史搜索记录  Read all historical search records for user i
For (用户 i的每一次搜索 serch[ j ]) 给 Serchfj]分类, 得到类别 ID = a  For (user i's every search serch[ j ]) classifies Serchfj], gets category ID = a
TypeCounter[a] = TypeCounter[a] +1 , 给用户此类另1 J力口 1 UserType[i] = 类别计数器 TypeCounter最多的类别 返回专家类别 UserType[]。 TypeCounter[a] = TypeCounter[a] +1 , give the user this type of other 1 J port 1 UserType[i] = category with the most category counter TypeCounter Return to the expert category UserType[].
其中, 用户的专家类别代表了用户使用最多的网页类别。  Among them, the user's expert category represents the most used web page category.
譬如, 用户输入检索信息 "T43" , 搜索引擎对检索的字符串进行 分类, 得到类别为 "电脑" 类, 搜索引擎对检索的结果进行排序时, 考虑网页类别向量的作用, 把 "电脑" 的权重比较大的网页排在前面。  For example, the user inputs the search information "T43", and the search engine classifies the retrieved character string to obtain the category "computer". When the search engine sorts the search results, the role of the web page category vector is considered, and the "computer" is Pages with larger weights are ranked first.
在步骤 S103中, 当用户在搜索引擎检索结果点击某网页时,按照确 定的用户的专家类别对该网页对应的网页类别向量的值加分。  In step S103, when the user clicks on a certain webpage in the search engine search result, the value of the webpage category vector corresponding to the webpage is added according to the determined expert category of the user.
譬如, 当用户对搜索引擎进行搜索后, 选择点击了一个网页, 如果 此用户属于网页类别向量的专家, 就把该网页的类别权重在对应的向量 上加分。 即该用户点击的网页, 按照该用户的专家类别对该网页类别向 量相应的值加分, 即增加该元素的权重。  For example, when the user searches the search engine, the user clicks on a web page. If the user belongs to the expert of the web page category vector, the category weight of the web page is added to the corresponding vector. That is, the web page clicked by the user adds a value to the corresponding value of the web page category according to the expert category of the user, that is, increases the weight of the element.
在具体实施过程中, 在按照用户的专家类别对该用户点击网页对应 的网页类别向量的值加分时, 用到的算法如下,  In the specific implementation process, when the value of the webpage category vector corresponding to the webpage clicked by the user is added according to the expert category of the user, the algorithm used is as follows.
IF (用户点击网页 ) 判断用户的专家类别  IF (user clicks on the web page) to determine the user's expert category
IF (用户属于专家类别 a, ^ e A ) 网页 的网页类别向量 a的值自增 1  IF (user belongs to expert category a, ^ e A ) Web page category vector a value increases by 1
在步骤 S104中, 当用户通过搜索引擎进行检索时,参照网页类别向 量中的分值对用户搜索的结果进行优化排序。 In step S104, when the user searches through the search engine, the results of the user search are optimally sorted by referring to the scores in the web page category.
其中, 该步骤用到的算法如下,  The algorithm used in this step is as follows.
IF (用户检索词条 "KKK" ) 对 "KKK" 进行分类, 得到 "kkk" 的类别为 a, ) 搜索引擎调用检索器得到检索结果。 IF (user search term "KKK") The "KKK" is classified, and the category of "kkk" is a, and the search engine calls the retriever to obtain the search result.
对检索结果进行预排序, 作为本发明的实施例, 此处用 pagerank 技术对搜索结果进行排序。  The search results are pre-sorted as an embodiment of the present invention, where the search results are sorted using the pagerank technique.
For (每一个检索结果网页 c) 查询 c网页对应的网页类别向量, 读取该网页关于类别 a的专家推 荐值为 ^  For (each search result page c) Query the web page category vector corresponding to the c web page, and read the expert recommendation value of the web page for category a.
根据专家推荐值为 ^的大小调整此网页 C的排序结果,把 ^大的 提前。 返回排序后的网页集合, 并将排序后的网页结果显示。  According to the expert recommendation value of ^, adjust the sorting result of this page C, and put the big one in advance. Returns the sorted page collection and displays the sorted page results.
图 3示出了本发明实施例提供的网页排序装置的结构。  FIG. 3 shows the structure of a webpage sorting apparatus provided by an embodiment of the present invention.
网页类别向量存储模块 11存储用户确立的网页类别向量, 其中, 该 网页类别向量中的每个向量用于标识该向量对应的网页在网页类别集 合中的权重。 其中, 每一网页对应至少一个与内容相关的网页类别, 每 一网页对应一个网页类别向量, 所述网页类别向量包含至少一个元素, 所述至少一个元素分别代表该网页对应的至少一个网页类别各自的权 重。  The web page category vector storage module 11 stores a web page category vector established by the user, wherein each vector in the web page category vector is used to identify the weight of the web page corresponding to the vector in the web page category set. Each webpage corresponds to at least one webpage category related to the content, and each webpage corresponds to one webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents at least one webpage category corresponding to the webpage. the weight of.
用户专家类别确定模块 12将用户访问的 IP日志进行分类, 根据用 户访问最多的 IP类别确定用户的专家类别。当用户根据搜索弓 I擎检索结 果点击网页时, 网页类别向量加分模块 13按照用户专家类别确定模块 12确定的用户的专家类别对该网页对应的网页类别向量的值加分。具体 过程在前文已述, 此处就不再赘述。 当用户通过搜索引擎输入索引进行信息检索时, 网页优化排序模块The user expert category determining module 12 classifies the IP logs accessed by the user, and determines the expert category of the user according to the IP category that the user accesses the most. When the user clicks the webpage according to the search result, the webpage category vector adding module 13 adds a score to the webpage category vector corresponding to the webpage according to the expert category of the user determined by the user expert category determining module 12. The specific process has been described above and will not be repeated here. When the user enters an index through the search engine for information retrieval, the webpage optimization ranking module
14参照网页的网页类别向量对搜索的网页进行优化排序。网页显示模块 15将优化排序后的网页显示。 14 Optimize the searched webpages by referring to the webpage category vector of the webpage. The web page display module 15 will optimize the sorted web page display.
本发明实施例根据用户访问的 IP 日志对用户进行用户专家类别划 分, 并根据用户点击的网页对该网页对应的网页类别向量的值加分, 在 用户检索信息时, 根据该网页类别向量对用户的搜索结果进行排序, 解 决了现有技术中直接靠用户的点击次数对网页进行加分而导致用户恶 意点击, 盲目加分推荐的问题。  The embodiment of the present invention divides the user expert category according to the IP log accessed by the user, and adds a score to the value of the webpage category vector corresponding to the webpage according to the webpage clicked by the user, and when the user retrieves the information, the user is based on the webpage category vector. Sorting the search results, solving the problem in the prior art that the user clicks on the number of clicks directly, causing the user to click maliciously and blindly add points.
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡 在本发明的精神和原则之内所作的任何修改、 等同替换和改进等, 均应 包含在本发明的保护范围之内。  The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.

Claims

权利要求书 Claim
1、 一种网页排序方法, 其特征在于, 包括:  A method for sorting web pages, comprising:
每一网页对应至少一个与内容相关的网页类别, 每一网页对应一个 网页类别向量, 所述网页类别向量包含至少一个元素, 所述至少一个元 素分别代表该网页对应的至少一个网页类别各自的权重;  Each webpage corresponds to at least one webpage category related to the content, and each webpage corresponds to one webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage. ;
才艮据用户访问过的网页确定用户使用最多的网页类别;  The webpage category that the user has used most is determined according to the webpage visited by the user;
当第一用户点击一个网页时, 确定被点击的网页对应的网页类别向 量中对应所述第一用户使用最多的网页类别的一个元素, 增加所确定的 元素的值;  When the first user clicks on a webpage, determining an element of the webpage category trajectory corresponding to the clicked webpage corresponding to the webpage category most used by the first user, increasing the value of the determined element;
当第二用户搜索网页时, 确定搜索内容对应的网页类别, 根据各网 页类别向量中对应所确定的网页类别的元素的值对搜索到的至少一个 网页进行排序。  When the second user searches for the webpage, the webpage category corresponding to the search content is determined, and the searched at least one webpage is sorted according to the value of the element corresponding to the determined webpage category in each webpage category vector.
2、如权利要求 1所述的网页排序方法, 其特征在于, 根据用户访问 过的网页确定用户使用最多的网页类别包括:  2. The method for sorting webpages according to claim 1, wherein determining the types of webpages that are most used by the user according to the webpages visited by the user comprises:
将所述用户访问的 IP 日志进行分类, 根据所述用户访问最多的 IP 类别确定用户使用最多的网页类别。  The IP logs accessed by the user are classified, and the webpage category that the user uses most is determined according to the IP category that the user accesses the most.
3、如权利要求 2所述的网页排序方法, 其特征在于, 所述网页类别 向量是一个 n维的向量, n等于网页类别的数量。  The web page sorting method according to claim 2, wherein the web page category vector is an n-dimensional vector, and n is equal to the number of web page categories.
4、如权利要求 2所述的网页排序方法, 其特征在于, 所述网页类别 向量中的元素的值为点击次数, 所述增加所确定的元素的值包括:  The method of sorting a webpage according to claim 2, wherein the value of the element in the webpage category vector is a click number, and the value of the determined determined element comprises:
将所确定的元素的值加 1。  Add 1 to the value of the determined element.
5、如权利要求 2所述的网页排序方法, 其特征在于, 所述方法进一 步包括:  The method for sorting web pages according to claim 2, wherein the method further comprises:
显示排序后的网页。  Show sorted pages.
6、 一种网页排序装置, 其特征在于, 包括: 第一模块, 用于根据用户访问过的网页确定用户使用最多的网页类 别, 确定网页对应的网页类别向量; 其中, 每一网页对应至少一个与内 容相关的网页类别, 每一网页对应一个网页类别向量, 所述网页类别向 量包含至少一个元素, 所述至少一个元素分别代表该网页对应的至少一 个网页类别各自的权重; 6. A web page sorting apparatus, comprising: a first module, configured to determine, according to a webpage visited by the user, a webpage category that is most used by the user, and determine a webpage category vector corresponding to the webpage; wherein each webpage corresponds to at least one webpage category related to the content, and each webpage corresponds to one webpage category a vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
第二模块, 用于当第一用户点击一个网页时, 确定被点击的网页对 应的所述第一模块确定的网页类别向量中对应所述第一用户使用最多 的网页类别的一个元素, 增加所确定的元素的值;  a second module, configured to: when the first user clicks on a webpage, determine, in the webpage category vector determined by the first module corresponding to the webpage that is clicked, an element corresponding to the webpage category that is used by the first user, The value of the determined element;
所述第三模块, 用于当第二用户搜索网页时, 确定搜索内容对应的 网页类别, 根据来自第二模块的各网页类别向量中对应所确定的网页类 别的元素的值对搜索到的至少一个网页进行排序。  The third module is configured to: when the second user searches for a webpage, determine a webpage category corresponding to the search content, and search for at least the value of the element corresponding to the determined webpage category in each webpage category vector from the second module. Sort a page.
7、如权利要求 6所述的网页排序装置, 其特征在于, 所述第一模块 包括:  The webpage sorting apparatus according to claim 6, wherein the first module comprises:
第一单元, 用于确定网页对应的网页类别, 和与网页对应且包含所 述各网页类别的权重的网页类别向量;  a first unit, configured to determine a webpage category corresponding to the webpage, and a webpage category vector corresponding to the webpage and including weights of the webpage categories;
第二单元, 用于根据用户访问过的网页, 将所述第一单元确定的所 述网页类别中的一个确定为用户使用最多的网页类别。  And a second unit, configured to determine, according to the webpage accessed by the user, one of the webpage categories determined by the first unit as the webpage category that is most used by the user.
8、 如权利要求 7所述的网页排序装置, 其特征在于,  8. The web page sorting apparatus according to claim 7, wherein:
所述第一单元, 用于将所述用户访问的 IP日志进行分类,根据所述 用户访问最多的 IP类别确定用户使用最多的网页类别。  The first unit is configured to classify IP logs accessed by the user, and determine a webpage category that is most used by the user according to the IP category that the user accesses the most.
9、如权利要求 8所述的网页排序装置, 其特征在于, 所述网页类别 向量是一个 n维的向量, n等于网页类别的数量。  The web page sorting apparatus according to claim 8, wherein the web page category vector is an n-dimensional vector, and n is equal to the number of web page categories.
10、 如权利要求 8所述的网页排序装置, 其特征在于, 所述装置进 一步包括:  10. The web page sorting apparatus according to claim 8, wherein the apparatus further comprises:
第四模块, 用于接收经所述第三模块排序的网页, 并显示所述排序 后的网页。 a fourth module, configured to receive a webpage sorted by the third module, and display the sorting After the page.
PCT/CN2008/070608 2007-06-25 2008-03-27 Method and device of web page rank WO2009000174A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN200710076164.2 2007-06-25
CN2007100761642A CN101079064B (en) 2007-06-25 2007-06-25 Web page sequencing method and device

Publications (1)

Publication Number Publication Date
WO2009000174A1 true WO2009000174A1 (en) 2008-12-31

Family

ID=38906543

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2008/070608 WO2009000174A1 (en) 2007-06-25 2008-03-27 Method and device of web page rank

Country Status (2)

Country Link
CN (1) CN101079064B (en)
WO (1) WO2009000174A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2709647C1 (en) * 2016-04-14 2019-12-19 Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд Method of associating a domain name with a characteristic of visiting a website

Families Citing this family (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101515360A (en) 2009-04-13 2009-08-26 阿里巴巴集团控股有限公司 Method and server for recommending network object information to user
CN101840420B (en) * 2010-04-02 2011-12-28 清华大学 Search aid system, search aid method and program
CN101996240A (en) * 2010-10-13 2011-03-30 蔡亮华 Method and device for providing information
CN102542474B (en) 2010-12-07 2015-10-21 阿里巴巴集团控股有限公司 Result ranking method and device
CN102541857A (en) * 2010-12-08 2012-07-04 腾讯科技(深圳)有限公司 Webpage sorting method and device
CN102722503A (en) * 2011-03-31 2012-10-10 北京百度网讯科技有限公司 Method and device for sequencing search results
CN102231152B (en) * 2011-05-25 2014-09-03 北京捷讯华泰科技有限公司 Searching method for precisely inquiring based on IP (Internet Protocol) address of mobile terminal
CN102956009B (en) 2011-08-16 2017-03-01 阿里巴巴集团控股有限公司 A kind of electronic commerce information based on user behavior recommends method and apparatus
CN103164804B (en) 2011-12-16 2016-11-23 阿里巴巴集团控股有限公司 The information-pushing method of a kind of personalization and device
CN109344321B (en) * 2012-05-08 2021-11-02 潍坊久宝智能科技有限公司 System for obtaining user personalized features
TWI465948B (en) * 2012-05-25 2014-12-21 Gemtek Technology Co Ltd Method for dlna pre-browsing and customizing browsing result and digital media device using the same
CN102722545B (en) * 2012-05-25 2015-11-25 百度在线网络技术(北京)有限公司 A kind of method, device and equipment for sorting to releasing news
CN103399861B (en) * 2013-07-04 2017-03-08 百度在线网络技术(北京)有限公司 A kind of network address in Web side navigation recommends methods, devices and systems
CN104636366B (en) * 2013-11-11 2020-06-02 腾讯科技(深圳)有限公司 Method and device for acquiring search result queue
CN105224657B (en) * 2015-09-30 2018-10-12 北京奇虎科技有限公司 A kind of information recommendation method and electronic equipment based on search engine
CN107153656B (en) * 2016-03-03 2020-12-01 阿里巴巴集团控股有限公司 Information searching method and device
CN107870941B (en) * 2016-09-27 2021-11-02 北京搜狗科技发展有限公司 Webpage sorting method, device and equipment
CN108182186B (en) * 2016-12-08 2020-10-02 广东精点数据科技股份有限公司 Webpage sorting method based on random forest algorithm
CN106777201B (en) * 2016-12-23 2021-01-08 北京奇元科技有限公司 Method and device for sorting recommended data on search result page

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
US20030131000A1 (en) * 2002-01-07 2003-07-10 International Business Machines Corporation Group-based search engine system
US20050256848A1 (en) * 2004-05-13 2005-11-17 International Business Machines Corporation System and method for user rank search
WO2006017364A1 (en) * 2004-07-13 2006-02-16 Google, Inc. Personalization of placed content ordering in search results
US7028027B1 (en) * 2002-09-17 2006-04-11 Yahoo! Inc. Associating documents with classifications and ranking documents based on classification weights

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030131000A1 (en) * 2002-01-07 2003-07-10 International Business Machines Corporation Group-based search engine system
CN1389811A (en) * 2002-02-06 2003-01-08 北京造极人工智能技术有限公司 Intelligent search method of search engine
US7028027B1 (en) * 2002-09-17 2006-04-11 Yahoo! Inc. Associating documents with classifications and ranking documents based on classification weights
US20050256848A1 (en) * 2004-05-13 2005-11-17 International Business Machines Corporation System and method for user rank search
WO2006017364A1 (en) * 2004-07-13 2006-02-16 Google, Inc. Personalization of placed content ordering in search results

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
RU2709647C1 (en) * 2016-04-14 2019-12-19 Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд Method of associating a domain name with a characteristic of visiting a website
RU2709647C9 (en) * 2016-04-14 2020-04-02 Шанхай Яму Коммюникейшн Текнолоджи Ко., Лтд Method of associating a domain name with a characteristic of visiting a website

Also Published As

Publication number Publication date
CN101079064A (en) 2007-11-28
CN101079064B (en) 2011-11-30

Similar Documents

Publication Publication Date Title
WO2009000174A1 (en) Method and device of web page rank
Davison Recognizing nepotistic links on the web
US6560600B1 (en) Method and apparatus for ranking Web page search results
Wu et al. Identifying link farm spam pages
US7383299B1 (en) System and method for providing service for searching web site addresses
US9443022B2 (en) Method, system, and graphical user interface for providing personalized recommendations of popular search queries
US10025855B2 (en) Federated community search
US20050060290A1 (en) Automatic query routing and rank configuration for search queries in an information retrieval system
US20070233808A1 (en) Propagating useful information among related web pages, such as web pages of a website
US20060095430A1 (en) Web page ranking with hierarchical considerations
CN111708740A (en) Mass search query log calculation analysis system based on cloud platform
US20070067304A1 (en) Search using changes in prevalence of content items on the web
US20060282413A1 (en) System and method for a search engine using reading grade level analysis
KR20070098521A (en) A system and method for prioritizing websites during a webcrawling process
WO2005031614A1 (en) Systems and methods for clustering search results
WO2007051397A1 (en) An information retrieval system and information retrieval method
Baeza-Yates Web usage mining in search engines
WO2005111787A2 (en) A method for indexing and searching geocoded pages of a web site
US20150186385A1 (en) Method, System, and Graphical User Interface For Improved Search Result Displays Via User-Specified Annotations
US20070094250A1 (en) Using matrix representations of search engine operations to make inferences about documents in a search engine corpus
US20100125781A1 (en) Page generation by keyword
US9275145B2 (en) Electronic document retrieval system with links to external documents
WO2007042840A1 (en) Search using changes in prevalence of content items on the web
KR20060096356A (en) Server, method and system for providing information search service by using sheaf of pages
WO1997049048A1 (en) Hypertext document retrieval system and method

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 08715344

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 446/CHENP/2010

Country of ref document: IN

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.04.10)

122 Ep: pct application non-entry in european phase

Ref document number: 08715344

Country of ref document: EP

Kind code of ref document: A1