CN109740062B

CN109740062B - Search task clustering method based on learning output

Info

Publication number: CN109740062B
Application number: CN201910006059.4A
Authority: CN
Inventors: 张引; 祝孟莨; 徐瑞康; 孙铭真; 赵玉丽; 张斌; 高克宁
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2019-01-04
Filing date: 2019-01-04
Publication date: 2020-10-16
Anticipated expiration: 2039-01-04
Also published as: CN109740062A

Abstract

The invention provides a search process clustering method based on learning output, and belongs to the field of search engines. And a search task clustering method based on a Bayesian rose tree is adopted, and a query similarity measurement method based on learning output is adopted in the clustering process, so that clustering of the search tasks is realized. The invention makes up the defect that the conventional search task clustering method only focuses on query and click in the search process and ignores the learning output, and improves the clustering effect of the search tasks by considering the learning output in the clustering process.

Description

Search task clustering method based on learning output

Technical Field

The invention belongs to the field of search engines, and particularly relates to a search task clustering method based on learning output.

Background

With the continuous improvement of the overall complexity of the society, people also face more and more complex problems in work and life. Search engines are one of the most common tools people use to solve everyday problems. As people increasingly use search engines to solve complex problems encountered in work and life, researchers have also begun to focus on how to develop new search technologies to help people solve complex problems.

One approach to help people solve complex problems using search engines is to cluster queries belonging to the same search task in search logs to identify the search task. The method of clustering queries of search tasks and identifying search tasks is referred to as a search task clustering method. Most of the existing search task clustering methods adopt a search theory based on a literature paradigm. Search theory based on the literature paradigm is only concerned with queries and clicks during the search. However, in the process of solving the complex problem, people always need to learn some knowledge from the search result, form learning outcome (such as knowledge memorized in the brain, recorded notes, written programs, etc.), and then perform the next search. The search task clustering method adopting the literature paradigm only focuses on query and click in the search tasks, ignores learning output, and causes that the search task clustering effect is not ideal. In contrast to search theories based on literature paradigms, search theories based on learning paradigms are concerned not only with queries and clicks during the search process, but also with learning outcome during the search process. Aiming at the defects of the existing search task clustering method based on the literature paradigm, the invention provides a search task clustering method based on learning output, which adopts the search task clustering method based on the Bayesian rose tree and proposed by Mehrotra and Yilmaz in the SIGIR 2017 conference, and improves the search task clustering effect by considering the learning output in the clustering process.

Disclosure of Invention

The invention provides a search task clustering method based on learning outcome, which adopts a search task clustering method based on a Bayesian rose tree and adopts a query similarity measurement method based on learning outcome in the clustering process to realize clustering of search tasks.

The invention is realized by the following technical scheme:

a search task clustering method based on learning output comprises the following steps: ,

step 1: determining a user session identifier, query submission time, a query word set, a click result address set and a learning output set which are queried in a search task according to a given search task, wherein each query word is a five-dimensional vector consisting of the user session identifier, the query submission time, the query word set, the click result address set and the learning output set; the search task refers to an ordered set of query information generated by a user in a search process, and each element in the set is the query information; the query information refers to a quintuple consisting of a user session identifier, query submission time, a query word set, a click result address set and a learning output set; the user session identifier refers to a binary group formed by the user identifier to which the query information belongs and the session identifier to which the query information belongs; the user identifier refers to an identifier for uniquely distinguishing different users; the session identifier refers to an identifier for uniquely distinguishing different sessions; the query submission time refers to the time when the query word set of the query information is submitted to a search engine; the query term set refers to an ordered set formed by query terms submitted to a search engine by a user at one time; the click result address set refers to an ordered set formed by the addresses of the search results clicked on the search result list page after the user submits the query word set to the search engine and obtains the search result list page;

step 2: determining the constituent symbols forming the learning output, and counting the constituent symbols forming the learning output to obtain a constituent symbol set forming the learning output; for example, for a program written in an object-oriented programming language, the constituent symbols may be the programming interface classes provided by the programming language, and the resulting set of constituent symbols may be a set of all the programming interface classes; the learning output set refers to a learning output ordered set constructed by a user after submitting the query word set to a search engine and obtaining a search result list page and before submitting the next group of query word sets to the search engine or completing a search task; the learning outcome refers to an outcome constructed by using learned knowledge, such as a written paper, a written program and the like, in the searching process of the user; the learning outcome is an ordered, repeatable set of constituent symbols;

and step 3: counting the occurrence frequency of each constituent symbol in the learning output based on the constituent symbol set of the learning output, and vectorizing the learning output into constituent symbol vectors; for example, for a program written in an object-oriented programming language, the program can be vectorized into a vector based on the programming interface classes, and the value of each item in the vector represents the number of times that the programming interface class corresponding to the value appears in the learning output; the composition symbols refer to symbols forming learning outcome; based on a set of constituent symbols, a learning outcome may be vectorized into constituent symbol vectors; the constituent symbol set refers to an ordered set formed by a group of constituent symbols; the constituent symbol vector refers to a vector with the same length as the constituent symbol set; the value of each term of the constituent symbol vectors is a non-negative integer representing the number of occurrences of a constituent symbol at the same position in the constituent symbol set in the learning outcome corresponding to the constituent symbol vector;

and 4, step 4: the method adopts a Bayesian rose tree search task clustering algorithm to perform clustering processing based on learning output on the query information, and comprises the following specific steps of:

step 4.1: establishing a tree based on each piece of query information, wherein all the trees form a forest;

step 4.2: the method adopts a Bayesian rose tree search task clustering algorithm to combine forest recursion into a tree, and comprises the following steps:

step 4.2.1: calculating the edge likelihood of a tree containing a set of query information by:

step 4.2.1.1: based on the learning output of a pair of query information, calculating the query similarity based on the learning output, and the steps are as follows:

step 4.2.1.1.1: calculating a first learning output Euler distance of a pair of queries by adopting an Euler distance calculation method based on a symbol vector formed by a first learning output of a learning output set of query information;

step 4.2.1.1.2: calculating a Hamming distance between a first piece of learning output of a pair of queries by adopting a Hamming distance calculation method based on a symbol vector formed by a first piece of learning output of a learning output set of query information;

step 4.2.1.1.3: summing the symbol vectors formed by all learning outputs of the learning output set of each piece of query information according to vectors, and calculating the Euler distance of a pair of query learning output sets by adopting an Euler distance calculation method;

step 4.2.1.1.4: summing the symbol vectors formed by all learning outputs of the learning output set of each piece of query information according to vectors, and calculating the Hamming distance of a pair of query learning output sets by adopting a Hamming distance calculation method;

step 4.2.1.1.5: calculating an arithmetic mean value of the first learning output Euler distance, the first learning output Hamming distance, the learning output set Euler distance and the learning output set Hamming distance to serve as a pair of query similarity based on learning output;

step 4.2.1.2: adopting a Bayesian rose tree search task clustering algorithm, taking the query similarity based on the learning output as a similarity measurement index, and calculating the edge likelihood of a tree containing a group of query information;

step 4.2.2: combining forest recursion into a tree by adopting a Bayesian rose tree search task clustering algorithm according to the edge likelihood of the tree containing a group of query information;

and 5: and outputting the Bayesian rose tree structure obtained by clustering, namely the clustering result of the given search task.

The invention adopts a Bayesian rose tree-based search task clustering method and a learning output-based query similarity measurement method in the clustering process to realize clustering of search tasks. The invention makes up the defect that the conventional search task clustering method only focuses on query and click in the search process and ignores the learning output, provides the search process clustering method based on the learning output, and improves the clustering effect of the search tasks by considering the learning output in the clustering process.

Drawings

FIG. 1 is a graph of a search task clustering process based on learning outcome.

Detailed Description

In order to solve the problem of search task clustering based on learning outcome, the invention is described in detail with reference to fig. 1, and the specific implementation steps are as follows:

step 1: according to a given search task, determining a user session identifier, query submission time, a query word set, a click result address set and a learning output set which are queried in the search task, wherein each query word is a five-dimensional vector consisting of the user session identifier, the query submission time, the query word set, the click result address set and the learning output set.

Step 2: determining the constituent symbols forming the learning outcome, and counting the constituent symbols of the learning outcome to obtain a constituent symbol set C ═ C of the learning outcome₁，c₂，c₃，...，c_i}。

And step 3: counting learning output LO based on learning output forming symbol set C_jThe number of occurrences of each constituent symbol in the learning vector is quantized into a constituent symbol vector

step 4.1: based on each piece of query information d_iBuilding tree T_i＝{d_iAll trees form a forest F ═ T₁，T₂，T₃，...，T_n}。

step 4.2.1: computing edge likelihood for a tree containing a set of query information

The method comprises the following steps:

step 4.2.1.1: based on a pair of query information q_iAnd q is_jThe query similarity based on the learning outcome is calculated, and the steps are as follows:

step 4.2.1.1.1: symbolic vector V of first learning outcome of query information-based learning outcome set_fCalculating the Euler distance h generated by the first learning of a pair of queries by adopting an Euler distance calculation method₂。

Step 4.2.1.1.2: symbolic vector V of first learning outcome of query information-based learning outcome set_fBy using the calculation method of Hamming distanceCalculating the Hamming distance a of the first learning output of a pair of queries₂。

Step 4.2.1.1.3: summing the symbol vectors formed by all the learning outputs of the learning output set of each piece of query information according to the vectors to obtain V_sCalculating a pair of query learning output set Euler distance h by using an Euler distance calculation method₁。

Step 4.2.1.1.4: summing the symbol vectors formed by all the learning outputs of the learning output set of each piece of query information according to the vectors to obtain V_sCalculating the Hamming distance a of a pair of inquired learning output sets by adopting a Hamming distance calculation method₁。

Step 4.2.1.1.5: calculating the average value of the first learning and producing Euler distance, the first learning and producing Hamming distance, the learning and producing set Euler distance and the learning and producing set Hamming distance

Query similarity based on learning outcome as a pair of queries.

Step 4.2.1.2: the Bayesian rose tree search task clustering algorithm is adopted, and the query similarity based on learning output is taken as r₀For a similarity metric, the edge likelihood of a tree containing a set of query information is computed.

Step 4.2.2: and combining the forest recursions into a tree by adopting a Bayesian rose tree search task clustering algorithm according to the edge likelihood of the tree containing a group of query information.

Claims

1. A search task clustering method based on learning outcome adopts a search task clustering method based on a Bayesian rose tree, and adopts a query similarity measurement method based on learning outcome in the clustering process to realize clustering of search tasks; the method is characterized by comprising the following specific steps:

step 2: determining the constituent symbols forming the learning output, and counting the constituent symbols forming the learning output to obtain a constituent symbol set forming the learning output; for a program written by adopting an object-oriented programming language, the composition symbols of the program are programming interface classes provided by the programming language, and the obtained composition symbol set is a set of all the programming interface classes; the learning output set refers to a learning output ordered set constructed by a user after submitting the query word set to a search engine and obtaining a search result list page and before submitting the next group of query word sets to the search engine or completing a search task; the learning output refers to a fruit constructed by the learned knowledge in the searching process of the user; the learning outcome is an ordered, repeatable set of constituent symbols;

and step 3: counting the occurrence frequency of each constituent symbol in the learning output based on the constituent symbol set of the learning output, and vectorizing the learning output into constituent symbol vectors; vectorizing a program written by adopting an object-oriented programming language into a vector based on programming interface classes, wherein the value of each item in the vector represents the occurrence frequency of the programming interface class corresponding to the value in learning output; the composition symbols refer to symbols forming learning outcome; vectorizing a learning outcome into constituent symbol vectors based on a set of constituent symbols; the constituent symbol set refers to an ordered set formed by a group of constituent symbols; the constituent symbol vector refers to a vector with the same length as the constituent symbol set; the value of each term of the constituent symbol vectors is a non-negative integer representing the number of occurrences of a constituent symbol at the same position in the constituent symbol set in the learning outcome corresponding to the constituent symbol vector;