一种网页排序方法及装置 Web page sorting method and device
技术领域 Technical field
本发明涉及计算机应用领域, 尤其涉及一种网页排序方法及装置。 发明背景 The present invention relates to the field of computer applications, and in particular, to a web page sorting method and apparatus. Background of the invention
搜索引擎是当前竟争非常激烈的领域, 搜索引擎竟争的重点除了丰 富的内容外, 还有用户体验。 一般来说, 搜索引擎现在面对的问题不是 信息不够而是信息过量,搜索一个关键词往往会出现成千上万的结果。 Search engines are an area where competition is currently fierce. The focus of search engine competition is not only rich content, but also user experience. In general, the problem that search engines face now is not insufficient information but excessive information. Searching for a keyword often results in thousands of results.
在实际应用过程中, 用户使用搜索引擎时, 都希望第一页甚至前 5 个统一资源定位器( Uniform Resource Locator, URL ) 中就包括了用户 希望得到的信息, 因此排序成了提高搜索引擎质量的关键因素。 著名的 搜索引擎 Google能在短时间内成为世界上排名第一的搜索引擎,就是因 为其发明的网页等级(pagerank )技术能够有效解决排序问题。 In the actual application process, when users use the search engine, they want the first page or even the first five Uniform Resource Locators (URLs) to include the information that the user wants, so the ranking becomes the quality of the search engine. The key factor. The famous search engine Google can become the world's number one search engine in a short period of time, because the pagerank technology it invented can effectively solve the sorting problem.
但是如今, 各个网络公司已经了解并且大部分都采用了 pagerank技 术, 实际上当前的任何一个比较大的搜索引擎的排序结果都不是靠单一 算法了, 而是总结几十个甚至上百个因素而得出的结果。 常用的算法除 了 pagerank, 还有 hits算法(一种基于超链接的检索算法 ) , Hilltop算法 (一种适用于大分类的搜索引擎排名算法)等。 But nowadays, various network companies have already understood and most of them have adopted pagerank technology. In fact, the ranking results of any of the current large search engines are not based on a single algorithm, but on the summary of dozens or even hundreds of factors. The result. Commonly used algorithms include pagerank, hits algorithm (a hyperlink-based search algorithm), Hilltop algorithm (a search engine ranking algorithm for large categories), and so on.
现有技术中, 在对用户通过搜索引擎搜索的结果进行排序时, 直接 靠用户的点击次数对网页进行加分。 发明内容 In the prior art, when the results of the search by the search engine are sorted, the webpage is directly scored by the number of clicks of the user. Summary of the invention
本发明实施例是这样实现的, 一种网页排序方法, 该方法包括: 每一网页对应至少一个与内容相关的网页类别, 每一网页对应一个
网页类别向量, 所述网页类别向量包含至少一个元素, 所述至少一个元 素分别代表该网页对应的至少一个网页类别各自的权重; The embodiment of the present invention is implemented by the method for sorting webpages. The method includes: each webpage corresponding to at least one webpage category related to the content, and each webpage corresponds to one webpage a webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
根据用户访问过的网页确定用户使用最多的网页类别; Determine the type of webpage that the user uses the most according to the webpage that the user has visited;
当第一用户点击一个网页时, 确定被点击的网页对应的网页类别向 量中对应所述第一用户使用最多的网页类别的一个元素, 增加所确定的 元素的值; When the first user clicks on a webpage, determining an element of the webpage category trajectory corresponding to the clicked webpage corresponding to the webpage category most used by the first user, increasing the value of the determined element;
当第二用户搜索网页时, 确定搜索内容对应的网页类别, 根据各网 页类别向量中对应所确定的网页类别的元素的值对搜索到的至少一个 网页进行排序。 When the second user searches for the webpage, the webpage category corresponding to the search content is determined, and the searched at least one webpage is sorted according to the value of the element corresponding to the determined webpage category in each webpage category vector.
本发明另外一个实施例提供一种网页排序装置, 该装置包括: 第一模块, 用于根据用户访问过的网页确定用户使用最多的网页类 别, 确定网页对应的网页类别向量; 其中, 每一网页对应至少一个与内 容相关的网页类别, 每一网页对应一个网页类别向量, 所述网页类别向 量包含至少一个元素, 所述至少一个元素分别代表该网页对应的至少一 个网页类别各自的权重; Another embodiment of the present invention provides a webpage sorting apparatus, the apparatus comprising: a first module, configured to determine, according to a webpage visited by a user, a webpage category that is most used by a user, and determine a webpage category vector corresponding to the webpage; wherein, each webpage Corresponding to at least one webpage category related to the content, each webpage corresponds to a webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents a weight of each of the at least one webpage category corresponding to the webpage;
第二模块, 用于当第一用户点击一个网页时, 确定被点击的网页对 应的所述第一模块确定的网页类别向量中对应所述第一用户使用最多 的网页类别的一个元素, 增加所确定的元素的值; a second module, configured to: when the first user clicks on a webpage, determine, in the webpage category vector determined by the first module corresponding to the webpage that is clicked, an element corresponding to the webpage category that is used by the first user, The value of the determined element;
所述第三模块, 用于当第二用户搜索网页时, 确定搜索内容对应的 网页类别, 根据来自第二模块的各网页类别向量中对应所确定的网页类 别的元素的值对搜索到的至少一个网页进行排序。 附图简要说明 图 1是本发明实施例提供的网页排序方法的流程图; The third module is configured to: when the second user searches for a webpage, determine a webpage category corresponding to the search content, and search for at least the value of the element corresponding to the determined webpage category in each webpage category vector from the second module. Sort a page. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a flowchart of a web page sorting method according to an embodiment of the present invention;
图 2是本发明实施例中搜索引擎的组成结构图;
图 3是本发明实施例提供的网页排序装置的结构示意图。 实施本发明的方式 2 is a structural diagram of a search engine in an embodiment of the present invention; FIG. 3 is a schematic structural diagram of a webpage sorting apparatus according to an embodiment of the present invention. Mode for carrying out the invention
为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附 图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处所描述的 具体实施例仅仅用以解释本发明, 并不用于限定本发明。 In order to make the objects, the technical solutions and the advantages of the present invention more comprehensible, the present invention will be further described in detail below with reference to the accompanying drawings. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
本发明实施例根据用户访问的网络协议( Internet Protocol, IP ) 日 志对用户进行用户专家类别划分, 并根据用户点击的网页对该网页对应 的网页类别向量的值加分, 在用户检索信息时, 根据该网页类别向量对 用户的搜索结果进行排序。 The embodiment of the present invention divides the user expert category according to the network protocol (Internet Protocol, IP) log accessed by the user, and adds a score to the value of the webpage category vector corresponding to the webpage according to the webpage clicked by the user, when the user retrieves the information, The user's search results are sorted according to the web page category vector.
本发明实施例提供了一种网页排序的方法。 在该方法中, 每一个网 页对应至少一个与内容相关的网页类别, 每一个网页对应一个网页类别 向量。 该网页类别向量包含至少一个元素, 包括在网页类别向量中的至 少一个元素分别代表该网页对应的至少一个网页类别各自的权重。 网页 类别向量是一个 n维的向量, 其中, n等于网页类别的数量。 需要指出 的是,该网页类别向量可以用一个数组来实现,该数组中包含 n个元素, n等于网页类别的数量。 Embodiments of the present invention provide a method for sorting web pages. In the method, each web page corresponds to at least one web page category related to the content, and each web page corresponds to a web page category vector. The web page category vector includes at least one element, and at least one element included in the web page category vector respectively represents a weight of each of the at least one web page category corresponding to the web page. The web page category vector is an n-dimensional vector, where n is equal to the number of web page categories. It should be noted that the web page category vector can be implemented with an array containing n elements, n equal to the number of web page categories.
首先, 根据用户访问过的网页确定用户使用最多的网页类别, 即确 定用户的专家类别。 可以根据用户的行为确定用户使用最多的网页类 别, 比如, 将用户访问的 IP日志进行分类, ^^据用户访问最多的 IP类 别确定用户使用最多的网页类别, 或者可以对用户输入的检索词进行分 类, 根据用户使用最多的检索词所属的类别确定用户使用最多的网页类 别。 当然, 还有其他根据用户的行为确定用户使用最多的网页类别的实 施方式, 都属于本领域技术人员公知的技术, 这里就不再——列举。 First, the type of webpage that the user uses the most is determined based on the webpage that the user has visited, that is, the expert category of the user is determined. The webpage category that the user uses most can be determined according to the behavior of the user, for example, the IP log accessed by the user is classified, ^^ the webpage category that the user uses most is determined according to the IP category that the user accesses the most, or the search term input by the user can be performed. Classification, which determines the category of webpages that users use the most, based on the category to which the user's most used terms belong. Of course, there are other implementations that determine the type of webpage that the user uses the most based on the user's behavior, all of which are well known to those skilled in the art, and are not listed here.
在确定了用户使用最多的网页类别之后, 当第一用户点击一个网页
时, 确定被点击的网页对应的网页类别向量中对应该第一用户使用最多 的网页类别的一个元素, 增加该元素的值。 增加该元素的值的方法具体 可以为将该元素的值加 1。 当有另外的用户点击网页时, 重复该操作。 After determining the category of the webpage that the user uses the most, when the first user clicks on a webpage When the page category vector corresponding to the clicked webpage is determined to correspond to an element of the webpage category that is most used by the first user, the value of the element is increased. The method of increasing the value of the element may specifically add 1 to the value of the element. Repeat this action when another user clicks on the web page.
当第二用户搜索网页时, 首先确定该第二用户输入的搜索内容对应 的网页类别, 根据各网页类别向量中对应所确定的网页类别的元素的值 对搜索到的至少一个网页进行排序。 When the second user searches for the webpage, first determine the webpage category corresponding to the search content input by the second user, and sort the searched at least one webpage according to the value of the element corresponding to the determined webpage category in each webpage category vector.
图 1示出了本发明实施例提供的网页排序方法的流程, 详述如下。 在步骤 S101中, 存储用户确立的网页类别向量。 FIG. 1 shows a flow of a webpage sorting method provided by an embodiment of the present invention, which is described in detail below. In step S101, the web page category vector established by the user is stored.
其中, 向量是一维的矩阵, 可以保存事物对某一个集合的所有元素 的分值。 本发明实施例通过对网页分配一个向量, 保存此网页对类别集 合中每个类别的取值, 譬如, 如果类别集合为{ "体育", "新闻" } , 那 么网页的向量就保存了此网页对 "体育" 的分值和 "新闻" 的分值, 可 以通过访问向量读取这两个分值。 在实际应用过程中, 类别集合的大小 都在上百的级别上, 因此网页向量就保存了每个网页对这上百个类别的 每个类别的分值。 Among them, the vector is a one-dimensional matrix, which can save the score of things on all elements of a certain set. In the embodiment of the present invention, by assigning a vector to the webpage, the value of each category in the category set is saved, for example, if the category set is {"sports", "news"}, then the webpage vector saves the webpage. For the score of "sports" and the score of "news", these two scores can be read by accessing the vector. In the actual application process, the size of the category collection is on the hundreds of levels, so the web page vector saves the score of each category of each of the hundreds of categories.
对所有网页使用一个 η维的向量叫网页类别向量, 向量的维数 η等 于网页类别集合 Α的类别数量,该向量的含义是此网页在每个类别的权 重, 即此网页在每个类别中占的比例是多少, 因为一个网页不一定属于 一个类别, 用一个向量就可以表示此网页在每个类别上的权重是多少, 对每个类别的权重可以用一个元素来表示, 表示各个类别的元素就组成 的数组构成了所述的向量。 其中, 在现有技术中, 大多数的网站都能够 根据当前的互联网网页的内容确立一个类别集合 A, 譬如历史、 军事、 旅游、 人文、 汽车等。 Using an n-dimensional vector for all web pages is called a web page category vector. The dimension η of the vector is equal to the number of categories of the web page category set. The meaning of the vector is the weight of the web page in each category, that is, the web page is in each category. What is the proportion, because a web page does not necessarily belong to a category, a vector can be used to indicate the weight of the web page on each category, and the weight of each category can be represented by an element, indicating each category An array of elements constitutes the vector. Among them, in the prior art, most websites are able to establish a category set A according to the content of the current Internet web page, such as history, military, tourism, humanities, automobiles, and the like.
在步骤 S102中, 将用户访问的 IP日志进行分类, 根据用户访问最 多的 IP类别确定用户的专家类别。
获取用户访问的 IP日志的过程描述如下,如图 2所示的搜索引擎的 典型结构, 包括爬虫 (crawler), 索引器, 检索器等, 其中爬虫的工作主 要是给网页分配统一资源定位器的身份标识( Uniform Resource Locator Identify, URLID )和下载网页, 爬虫给每个互联网的网页分配一个唯一 的标识符 ID , 来区分不同的 URLID , 此 URLID对应一个结构, 包括了 网页的文本内容, 网页的附加属性等。 In step S102, the IP logs accessed by the user are classified, and the expert category of the user is determined according to the IP category that the user accesses the most. The process of obtaining the IP log accessed by the user is described as follows. The typical structure of the search engine shown in FIG. 2 includes a crawler, an indexer, a retriever, etc., wherein the crawler works mainly to allocate a uniform resource locator to the webpage. Uniform Resource Locator Identify (URLID) and download webpage. The crawler assigns a unique identifier ID to each Internet webpage to distinguish different URLIDs. This URLID corresponds to a structure, including the text content of the webpage, and the webpage. Additional properties, etc.
爬虫从互联网下载网页,并且分配唯一的 URLID,存入原始数据库。 索引器从原始数据库读取网页信息建立索引, 并存入索引数据库。 The crawler downloads the web page from the Internet and assigns a unique URLID to the original database. The indexer reads the web page information from the original database and indexes it, and stores it in the index database.
在用户输入检索信息进行信息检索时, 检索器接收用户输入, 从索 引数据库得到记录并且排序后返回给该用户, 同时把用户的操作日志记 录到用户行为日志。 When the user inputs the search information for information retrieval, the retriever receives the user input, obtains the record from the index database and returns it to the user after sorting, and records the user's operation log to the user behavior log.
其中, 在确定用户的专家类别时, 所用到的算法如下, Among them, when determining the expert category of the user, the algorithm used is as follows.
定义专家数组 UserTypeD , 其中 UserType[i]表示第 i个用户的专家 类别。 Define the expert array UserTypeD , where UserType[i] represents the expert category of the i-th user.
For (每一个用户 i) 定义类别计数器数组 TypeCounter[] For (each user i) defines the category counter array TypeCounter[]
读取用户 i的所有历史搜索记录 Read all historical search records for user i
For (用户 i的每一次搜索 serch[ j ]) 给 Serchfj]分类, 得到类别 ID = a For (user i's every search serch[ j ]) classifies Serchfj], gets category ID = a
TypeCounter[a] = TypeCounter[a] +1 , 给用户此类另1 J力口 1 UserType[i] = 类别计数器 TypeCounter最多的类别
返回专家类别 UserType[]。 TypeCounter[a] = TypeCounter[a] +1 , give the user this type of other 1 J port 1 UserType[i] = category with the most category counter TypeCounter Return to the expert category UserType[].
其中, 用户的专家类别代表了用户使用最多的网页类别。 Among them, the user's expert category represents the most used web page category.
譬如, 用户输入检索信息 "T43" , 搜索引擎对检索的字符串进行 分类, 得到类别为 "电脑" 类, 搜索引擎对检索的结果进行排序时, 考虑网页类别向量的作用, 把 "电脑" 的权重比较大的网页排在前面。 For example, the user inputs the search information "T43", and the search engine classifies the retrieved character string to obtain the category "computer". When the search engine sorts the search results, the role of the web page category vector is considered, and the "computer" is Pages with larger weights are ranked first.
在步骤 S103中, 当用户在搜索引擎检索结果点击某网页时,按照确 定的用户的专家类别对该网页对应的网页类别向量的值加分。 In step S103, when the user clicks on a certain webpage in the search engine search result, the value of the webpage category vector corresponding to the webpage is added according to the determined expert category of the user.
譬如, 当用户对搜索引擎进行搜索后, 选择点击了一个网页, 如果 此用户属于网页类别向量的专家, 就把该网页的类别权重在对应的向量 上加分。 即该用户点击的网页, 按照该用户的专家类别对该网页类别向 量相应的值加分, 即增加该元素的权重。 For example, when the user searches the search engine, the user clicks on a web page. If the user belongs to the expert of the web page category vector, the category weight of the web page is added to the corresponding vector. That is, the web page clicked by the user adds a value to the corresponding value of the web page category according to the expert category of the user, that is, increases the weight of the element.
在具体实施过程中, 在按照用户的专家类别对该用户点击网页对应 的网页类别向量的值加分时, 用到的算法如下, In the specific implementation process, when the value of the webpage category vector corresponding to the webpage clicked by the user is added according to the expert category of the user, the algorithm used is as follows.
IF (用户点击网页 ) 判断用户的专家类别 IF (user clicks on the web page) to determine the user's expert category
IF (用户属于专家类别 a, ^ e A ) 网页 的网页类别向量 a的值自增 1 IF (user belongs to expert category a, ^ e A ) Web page category vector a value increases by 1
在步骤 S104中, 当用户通过搜索引擎进行检索时,参照网页类别向 量中的分值对用户搜索的结果进行优化排序。 In step S104, when the user searches through the search engine, the results of the user search are optimally sorted by referring to the scores in the web page category.
其中, 该步骤用到的算法如下, The algorithm used in this step is as follows.
IF (用户检索词条 "KKK" )
对 "KKK" 进行分类, 得到 "kkk" 的类别为 a, ) 搜索引擎调用检索器得到检索结果。 IF (user search term "KKK") The "KKK" is classified, and the category of "kkk" is a, and the search engine calls the retriever to obtain the search result.
对检索结果进行预排序, 作为本发明的实施例, 此处用 pagerank 技术对搜索结果进行排序。 The search results are pre-sorted as an embodiment of the present invention, where the search results are sorted using the pagerank technique.
For (每一个检索结果网页 c) 查询 c网页对应的网页类别向量, 读取该网页关于类别 a的专家推 荐值为 ^ For (each search result page c) Query the web page category vector corresponding to the c web page, and read the expert recommendation value of the web page for category a.
根据专家推荐值为 ^的大小调整此网页 C的排序结果,把 ^大的 提前。 返回排序后的网页集合, 并将排序后的网页结果显示。 According to the expert recommendation value of ^, adjust the sorting result of this page C, and put the big one in advance. Returns the sorted page collection and displays the sorted page results.
图 3示出了本发明实施例提供的网页排序装置的结构。 FIG. 3 shows the structure of a webpage sorting apparatus provided by an embodiment of the present invention.
网页类别向量存储模块 11存储用户确立的网页类别向量, 其中, 该 网页类别向量中的每个向量用于标识该向量对应的网页在网页类别集 合中的权重。 其中, 每一网页对应至少一个与内容相关的网页类别, 每 一网页对应一个网页类别向量, 所述网页类别向量包含至少一个元素, 所述至少一个元素分别代表该网页对应的至少一个网页类别各自的权 重。 The web page category vector storage module 11 stores a web page category vector established by the user, wherein each vector in the web page category vector is used to identify the weight of the web page corresponding to the vector in the web page category set. Each webpage corresponds to at least one webpage category related to the content, and each webpage corresponds to one webpage category vector, the webpage category vector includes at least one element, and the at least one element respectively represents at least one webpage category corresponding to the webpage. the weight of.
用户专家类别确定模块 12将用户访问的 IP日志进行分类, 根据用 户访问最多的 IP类别确定用户的专家类别。当用户根据搜索弓 I擎检索结 果点击网页时, 网页类别向量加分模块 13按照用户专家类别确定模块 12确定的用户的专家类别对该网页对应的网页类别向量的值加分。具体 过程在前文已述, 此处就不再赘述。
当用户通过搜索引擎输入索引进行信息检索时, 网页优化排序模块The user expert category determining module 12 classifies the IP logs accessed by the user, and determines the expert category of the user according to the IP category that the user accesses the most. When the user clicks the webpage according to the search result, the webpage category vector adding module 13 adds a score to the webpage category vector corresponding to the webpage according to the expert category of the user determined by the user expert category determining module 12. The specific process has been described above and will not be repeated here. When the user enters an index through the search engine for information retrieval, the webpage optimization ranking module
14参照网页的网页类别向量对搜索的网页进行优化排序。网页显示模块 15将优化排序后的网页显示。 14 Optimize the searched webpages by referring to the webpage category vector of the webpage. The web page display module 15 will optimize the sorted web page display.
本发明实施例根据用户访问的 IP 日志对用户进行用户专家类别划 分, 并根据用户点击的网页对该网页对应的网页类别向量的值加分, 在 用户检索信息时, 根据该网页类别向量对用户的搜索结果进行排序, 解 决了现有技术中直接靠用户的点击次数对网页进行加分而导致用户恶 意点击, 盲目加分推荐的问题。 The embodiment of the present invention divides the user expert category according to the IP log accessed by the user, and adds a score to the value of the webpage category vector corresponding to the webpage according to the webpage clicked by the user, and when the user retrieves the information, the user is based on the webpage category vector. Sorting the search results, solving the problem in the prior art that the user clicks on the number of clicks directly, causing the user to click maliciously and blindly add points.
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡 在本发明的精神和原则之内所作的任何修改、 等同替换和改进等, 均应 包含在本发明的保护范围之内。
The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection of the present invention. Within the scope.