CN109740062B - Search task clustering method based on learning output - Google Patents

Search task clustering method based on learning output Download PDF

Info

Publication number
CN109740062B
CN109740062B CN201910006059.4A CN201910006059A CN109740062B CN 109740062 B CN109740062 B CN 109740062B CN 201910006059 A CN201910006059 A CN 201910006059A CN 109740062 B CN109740062 B CN 109740062B
Authority
CN
China
Prior art keywords
query
learning output
learning
search
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201910006059.4A
Other languages
Chinese (zh)
Other versions
CN109740062A (en
Inventor
张引
祝孟莨
徐瑞康
孙铭真
赵玉丽
张斌
高克宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201910006059.4A priority Critical patent/CN109740062B/en
Publication of CN109740062A publication Critical patent/CN109740062A/en
Application granted granted Critical
Publication of CN109740062B publication Critical patent/CN109740062B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a search process clustering method based on learning output, and belongs to the field of search engines. And a search task clustering method based on a Bayesian rose tree is adopted, and a query similarity measurement method based on learning output is adopted in the clustering process, so that clustering of the search tasks is realized. The invention makes up the defect that the conventional search task clustering method only focuses on query and click in the search process and ignores the learning output, and improves the clustering effect of the search tasks by considering the learning output in the clustering process.

Description

Search task clustering method based on learning output
Technical Field
The invention belongs to the field of search engines, and particularly relates to a search task clustering method based on learning output.
Background
With the continuous improvement of the overall complexity of the society, people also face more and more complex problems in work and life. Search engines are one of the most common tools people use to solve everyday problems. As people increasingly use search engines to solve complex problems encountered in work and life, researchers have also begun to focus on how to develop new search technologies to help people solve complex problems.
One approach to help people solve complex problems using search engines is to cluster queries belonging to the same search task in search logs to identify the search task. The method of clustering queries of search tasks and identifying search tasks is referred to as a search task clustering method. Most of the existing search task clustering methods adopt a search theory based on a literature paradigm. Search theory based on the literature paradigm is only concerned with queries and clicks during the search. However, in the process of solving the complex problem, people always need to learn some knowledge from the search result, form learning outcome (such as knowledge memorized in the brain, recorded notes, written programs, etc.), and then perform the next search. The search task clustering method adopting the literature paradigm only focuses on query and click in the search tasks, ignores learning output, and causes that the search task clustering effect is not ideal. In contrast to search theories based on literature paradigms, search theories based on learning paradigms are concerned not only with queries and clicks during the search process, but also with learning outcome during the search process. Aiming at the defects of the existing search task clustering method based on the literature paradigm, the invention provides a search task clustering method based on learning output, which adopts the search task clustering method based on the Bayesian rose tree and proposed by Mehrotra and Yilmaz in the SIGIR 2017 conference, and improves the search task clustering effect by considering the learning output in the clustering process.
Disclosure of Invention
The invention provides a search task clustering method based on learning outcome, which adopts a search task clustering method based on a Bayesian rose tree and adopts a query similarity measurement method based on learning outcome in the clustering process to realize clustering of search tasks.
The invention is realized by the following technical scheme:
a search task clustering method based on learning output comprises the following steps: ,
step 1: determining a user session identifier, query submission time, a query word set, a click result address set and a learning output set which are queried in a search task according to a given search task, wherein each query word is a five-dimensional vector consisting of the user session identifier, the query submission time, the query word set, the click result address set and the learning output set; the search task refers to an ordered set of query information generated by a user in a search process, and each element in the set is the query information; the query information refers to a quintuple consisting of a user session identifier, query submission time, a query word set, a click result address set and a learning output set; the user session identifier refers to a binary group formed by the user identifier to which the query information belongs and the session identifier to which the query information belongs; the user identifier refers to an identifier for uniquely distinguishing different users; the session identifier refers to an identifier for uniquely distinguishing different sessions; the query submission time refers to the time when the query word set of the query information is submitted to a search engine; the query term set refers to an ordered set formed by query terms submitted to a search engine by a user at one time; the click result address set refers to an ordered set formed by the addresses of the search results clicked on the search result list page after the user submits the query word set to the search engine and obtains the search result list page;
step 2: determining the constituent symbols forming the learning output, and counting the constituent symbols forming the learning output to obtain a constituent symbol set forming the learning output; for example, for a program written in an object-oriented programming language, the constituent symbols may be the programming interface classes provided by the programming language, and the resulting set of constituent symbols may be a set of all the programming interface classes; the learning output set refers to a learning output ordered set constructed by a user after submitting the query word set to a search engine and obtaining a search result list page and before submitting the next group of query word sets to the search engine or completing a search task; the learning outcome refers to an outcome constructed by using learned knowledge, such as a written paper, a written program and the like, in the searching process of the user; the learning outcome is an ordered, repeatable set of constituent symbols;
and step 3: counting the occurrence frequency of each constituent symbol in the learning output based on the constituent symbol set of the learning output, and vectorizing the learning output into constituent symbol vectors; for example, for a program written in an object-oriented programming language, the program can be vectorized into a vector based on the programming interface classes, and the value of each item in the vector represents the number of times that the programming interface class corresponding to the value appears in the learning output; the composition symbols refer to symbols forming learning outcome; based on a set of constituent symbols, a learning outcome may be vectorized into constituent symbol vectors; the constituent symbol set refers to an ordered set formed by a group of constituent symbols; the constituent symbol vector refers to a vector with the same length as the constituent symbol set; the value of each term of the constituent symbol vectors is a non-negative integer representing the number of occurrences of a constituent symbol at the same position in the constituent symbol set in the learning outcome corresponding to the constituent symbol vector;
and 4, step 4: the method adopts a Bayesian rose tree search task clustering algorithm to perform clustering processing based on learning output on the query information, and comprises the following specific steps of:
step 4.1: establishing a tree based on each piece of query information, wherein all the trees form a forest;
step 4.2: the method adopts a Bayesian rose tree search task clustering algorithm to combine forest recursion into a tree, and comprises the following steps:
step 4.2.1: calculating the edge likelihood of a tree containing a set of query information by:
step 4.2.1.1: based on the learning output of a pair of query information, calculating the query similarity based on the learning output, and the steps are as follows:
step 4.2.1.1.1: calculating a first learning output Euler distance of a pair of queries by adopting an Euler distance calculation method based on a symbol vector formed by a first learning output of a learning output set of query information;
step 4.2.1.1.2: calculating a Hamming distance between a first piece of learning output of a pair of queries by adopting a Hamming distance calculation method based on a symbol vector formed by a first piece of learning output of a learning output set of query information;
step 4.2.1.1.3: summing the symbol vectors formed by all learning outputs of the learning output set of each piece of query information according to vectors, and calculating the Euler distance of a pair of query learning output sets by adopting an Euler distance calculation method;
step 4.2.1.1.4: summing the symbol vectors formed by all learning outputs of the learning output set of each piece of query information according to vectors, and calculating the Hamming distance of a pair of query learning output sets by adopting a Hamming distance calculation method;
step 4.2.1.1.5: calculating an arithmetic mean value of the first learning output Euler distance, the first learning output Hamming distance, the learning output set Euler distance and the learning output set Hamming distance to serve as a pair of query similarity based on learning output;
step 4.2.1.2: adopting a Bayesian rose tree search task clustering algorithm, taking the query similarity based on the learning output as a similarity measurement index, and calculating the edge likelihood of a tree containing a group of query information;
step 4.2.2: combining forest recursion into a tree by adopting a Bayesian rose tree search task clustering algorithm according to the edge likelihood of the tree containing a group of query information;
and 5: and outputting the Bayesian rose tree structure obtained by clustering, namely the clustering result of the given search task.
The invention adopts a Bayesian rose tree-based search task clustering method and a learning output-based query similarity measurement method in the clustering process to realize clustering of search tasks. The invention makes up the defect that the conventional search task clustering method only focuses on query and click in the search process and ignores the learning output, provides the search process clustering method based on the learning output, and improves the clustering effect of the search tasks by considering the learning output in the clustering process.
Drawings
FIG. 1 is a graph of a search task clustering process based on learning outcome.
Detailed Description
In order to solve the problem of search task clustering based on learning outcome, the invention is described in detail with reference to fig. 1, and the specific implementation steps are as follows:
step 1: according to a given search task, determining a user session identifier, query submission time, a query word set, a click result address set and a learning output set which are queried in the search task, wherein each query word is a five-dimensional vector consisting of the user session identifier, the query submission time, the query word set, the click result address set and the learning output set.
Step 2: determining the constituent symbols forming the learning outcome, and counting the constituent symbols of the learning outcome to obtain a constituent symbol set C ═ C of the learning outcome1,c2,c3,...,ci}。
And step 3: counting learning output LO based on learning output forming symbol set CjThe number of occurrences of each constituent symbol in the learning vector is quantized into a constituent symbol vector
Figure BDA0001935495100000051
And 4, step 4: the method adopts a Bayesian rose tree search task clustering algorithm to perform clustering processing based on learning output on the query information, and comprises the following specific steps of:
step 4.1: based on each piece of query information diBuilding tree Ti={diAll trees form a forest F ═ T1,T2,T3,...,Tn}。
Step 4.2: the method adopts a Bayesian rose tree search task clustering algorithm to combine forest recursion into a tree, and comprises the following steps:
step 4.2.1: computing edge likelihood for a tree containing a set of query information
Figure BDA0001935495100000061
The method comprises the following steps:
step 4.2.1.1: based on a pair of query information qiAnd q isjThe query similarity based on the learning outcome is calculated, and the steps are as follows:
step 4.2.1.1.1: symbolic vector V of first learning outcome of query information-based learning outcome setfCalculating the Euler distance h generated by the first learning of a pair of queries by adopting an Euler distance calculation method2
Step 4.2.1.1.2: symbolic vector V of first learning outcome of query information-based learning outcome setfBy using the calculation method of Hamming distanceCalculating the Hamming distance a of the first learning output of a pair of queries2
Step 4.2.1.1.3: summing the symbol vectors formed by all the learning outputs of the learning output set of each piece of query information according to the vectors to obtain VsCalculating a pair of query learning output set Euler distance h by using an Euler distance calculation method1
Step 4.2.1.1.4: summing the symbol vectors formed by all the learning outputs of the learning output set of each piece of query information according to the vectors to obtain VsCalculating the Hamming distance a of a pair of inquired learning output sets by adopting a Hamming distance calculation method1
Step 4.2.1.1.5: calculating the average value of the first learning and producing Euler distance, the first learning and producing Hamming distance, the learning and producing set Euler distance and the learning and producing set Hamming distance
Figure BDA0001935495100000062
Query similarity based on learning outcome as a pair of queries.
Step 4.2.1.2: the Bayesian rose tree search task clustering algorithm is adopted, and the query similarity based on learning output is taken as r0For a similarity metric, the edge likelihood of a tree containing a set of query information is computed.
Step 4.2.2: and combining the forest recursions into a tree by adopting a Bayesian rose tree search task clustering algorithm according to the edge likelihood of the tree containing a group of query information.
And 5: and outputting the Bayesian rose tree structure obtained by clustering, namely the clustering result of the given search task.

Claims (1)

1. A search task clustering method based on learning outcome adopts a search task clustering method based on a Bayesian rose tree, and adopts a query similarity measurement method based on learning outcome in the clustering process to realize clustering of search tasks; the method is characterized by comprising the following specific steps:
step 1: determining a user session identifier, query submission time, a query word set, a click result address set and a learning output set which are queried in a search task according to a given search task, wherein each query word is a five-dimensional vector consisting of the user session identifier, the query submission time, the query word set, the click result address set and the learning output set; the search task refers to an ordered set of query information generated by a user in a search process, and each element in the set is the query information; the query information refers to a quintuple consisting of a user session identifier, query submission time, a query word set, a click result address set and a learning output set; the user session identifier refers to a binary group formed by the user identifier to which the query information belongs and the session identifier to which the query information belongs; the user identifier refers to an identifier for uniquely distinguishing different users; the session identifier refers to an identifier for uniquely distinguishing different sessions; the query submission time refers to the time when the query word set of the query information is submitted to a search engine; the query term set refers to an ordered set formed by query terms submitted to a search engine by a user at one time; the click result address set refers to an ordered set formed by the addresses of the search results clicked on the search result list page after the user submits the query word set to the search engine and obtains the search result list page;
step 2: determining the constituent symbols forming the learning output, and counting the constituent symbols forming the learning output to obtain a constituent symbol set forming the learning output; for a program written by adopting an object-oriented programming language, the composition symbols of the program are programming interface classes provided by the programming language, and the obtained composition symbol set is a set of all the programming interface classes; the learning output set refers to a learning output ordered set constructed by a user after submitting the query word set to a search engine and obtaining a search result list page and before submitting the next group of query word sets to the search engine or completing a search task; the learning output refers to a fruit constructed by the learned knowledge in the searching process of the user; the learning outcome is an ordered, repeatable set of constituent symbols;
and step 3: counting the occurrence frequency of each constituent symbol in the learning output based on the constituent symbol set of the learning output, and vectorizing the learning output into constituent symbol vectors; vectorizing a program written by adopting an object-oriented programming language into a vector based on programming interface classes, wherein the value of each item in the vector represents the occurrence frequency of the programming interface class corresponding to the value in learning output; the composition symbols refer to symbols forming learning outcome; vectorizing a learning outcome into constituent symbol vectors based on a set of constituent symbols; the constituent symbol set refers to an ordered set formed by a group of constituent symbols; the constituent symbol vector refers to a vector with the same length as the constituent symbol set; the value of each term of the constituent symbol vectors is a non-negative integer representing the number of occurrences of a constituent symbol at the same position in the constituent symbol set in the learning outcome corresponding to the constituent symbol vector;
and 4, step 4: the method adopts a Bayesian rose tree search task clustering algorithm to perform clustering processing based on learning output on the query information, and comprises the following specific steps of:
step 4.1: establishing a tree based on each piece of query information, wherein all the trees form a forest;
step 4.2: the method adopts a Bayesian rose tree search task clustering algorithm to combine forest recursion into a tree, and comprises the following steps:
step 4.2.1: calculating the edge likelihood of a tree containing a set of query information by:
step 4.2.1.1: based on the learning output of a pair of query information, calculating the query similarity based on the learning output, and the steps are as follows:
step 4.2.1.1.1: calculating a first learning output Euler distance of a pair of queries by adopting an Euler distance calculation method based on a symbol vector formed by a first learning output of a learning output set of query information;
step 4.2.1.1.2: calculating a Hamming distance between a first piece of learning output of a pair of queries by adopting a Hamming distance calculation method based on a symbol vector formed by a first piece of learning output of a learning output set of query information;
step 4.2.1.1.3: summing the symbol vectors formed by all learning outputs of the learning output set of each piece of query information according to vectors, and calculating the Euler distance of a pair of query learning output sets by adopting an Euler distance calculation method;
step 4.2.1.1.4: summing the symbol vectors formed by all learning outputs of the learning output set of each piece of query information according to vectors, and calculating the Hamming distance of a pair of query learning output sets by adopting a Hamming distance calculation method;
step 4.2.1.1.5: calculating an arithmetic mean value of the first learning output Euler distance, the first learning output Hamming distance, the learning output set Euler distance and the learning output set Hamming distance to serve as a pair of query similarity based on learning output;
step 4.2.1.2: adopting a Bayesian rose tree search task clustering algorithm, taking the query similarity based on the learning output as a similarity measurement index, and calculating the edge likelihood of a tree containing a group of query information;
step 4.2.2: combining forest recursion into a tree by adopting a Bayesian rose tree search task clustering algorithm according to the edge likelihood of the tree containing a group of query information;
and 5: and outputting the Bayesian rose tree structure obtained by clustering, namely the clustering result of the given search task.
CN201910006059.4A 2019-01-04 2019-01-04 Search task clustering method based on learning output Expired - Fee Related CN109740062B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910006059.4A CN109740062B (en) 2019-01-04 2019-01-04 Search task clustering method based on learning output

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910006059.4A CN109740062B (en) 2019-01-04 2019-01-04 Search task clustering method based on learning output

Publications (2)

Publication Number Publication Date
CN109740062A CN109740062A (en) 2019-05-10
CN109740062B true CN109740062B (en) 2020-10-16

Family

ID=66363274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910006059.4A Expired - Fee Related CN109740062B (en) 2019-01-04 2019-01-04 Search task clustering method based on learning output

Country Status (1)

Country Link
CN (1) CN109740062B (en)

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200846942A (en) * 2007-05-21 2008-12-01 Univ Nat Taiwan Science Tech Clustering TRIZ analysis model
US8873836B1 (en) * 2012-06-29 2014-10-28 Emc Corporation Cluster-based classification of high-resolution data
CN103064941B (en) * 2012-12-25 2016-12-28 深圳先进技术研究院 Image search method and device
CN106372090B (en) * 2015-07-23 2021-02-09 江苏苏宁云计算有限公司 Query clustering method and device
CN107491447B (en) * 2016-06-12 2021-01-22 百度在线网络技术(北京)有限公司 Method for establishing query rewrite judging model, method for judging query rewrite and corresponding device
US20180285438A1 (en) * 2017-03-31 2018-10-04 Change Healthcase Holdings, Llc Database system and method for identifying a subset of related reports
CN108038183B (en) * 2017-12-08 2020-11-24 北京百度网讯科技有限公司 Structured entity recording method, device, server and storage medium
CN108228884B (en) * 2018-01-30 2022-04-05 东北大学 Reading difficulty oriented search result preview system and preview method

Also Published As

Publication number Publication date
CN109740062A (en) 2019-05-10

Similar Documents

Publication Publication Date Title
WO2022041727A1 (en) Question and answer management method, apparatus, and device for medical inquiry system, and storage medium
CN106021364B (en) Foundation, image searching method and the device of picture searching dependency prediction model
CN103064903B (en) Picture retrieval method and device
CN107590128B (en) Paper homonymy author disambiguation method based on high-confidence characteristic attribute hierarchical clustering method
CN110147421B (en) Target entity linking method, device, equipment and storage medium
CN109408578B (en) Monitoring data fusion method for heterogeneous environment
WO2011037603A1 (en) Searching for information based on generic attributes of the query
CN112035599B (en) Query method and device based on vertical search, computer equipment and storage medium
CN107291895B (en) Quick hierarchical document query method
US20030212663A1 (en) Neural network feedback for enhancing text search
US11537905B2 (en) Inference-based assignment of data type to data
CN113569057B (en) Sample query method oriented to ontology tag knowledge graph
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
WO2020147259A1 (en) User portait method and apparatus, readable storage medium, and terminal device
CN112860916B (en) Movie-television-oriented multi-level knowledge map generation method
Wu et al. Discovering topical structures of databases
CN109740062B (en) Search task clustering method based on learning output
CN110019714A (en) More intent query method, apparatus, equipment and storage medium based on historical results
Balaji et al. An ensemble blocking scheme for entity resolution of large and sparse datasets
CN106528595B (en) Realm information based on website homepage content is collected and correlating method
Schenker et al. A comparison of two novel algorithms for clustering web documents
CN116881437B (en) Data processing system for acquiring text set
Dobrescu et al. Multi-modal CBIR algorithm based on Latent Semantic Indexing
AU2020104033A4 (en) CDM- Separating Items Device: Separating Items into their Corresponding Class using Iris Dataset Machine Learning Classification Device
CN115859968B (en) Policy granulation analysis system based on natural language analysis and machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20201016