CN114595305A

CN114595305A - Intention identification method based on semantic index

Info

Publication number: CN114595305A
Application number: CN202210223886.0A
Authority: CN
Inventors: 高航; 胡毅; 曹梦华
Original assignee: Hunan Xingsheng Optimization Network Technology Co ltd
Current assignee: Hunan Xingsheng Optimization Network Technology Co ltd
Priority date: 2022-03-09
Filing date: 2022-03-09
Publication date: 2022-06-07

Abstract

The invention discloses an intention identification method based on semantic indexes, which uses a search semantic model to train a user to search data; counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index; inputting a search semantic model into an online user Query to acquire a Query semantic vector v; searching the semantic intention index by using the v of the online user, and acquiring a plurality of records which are most similar to the online Query semantic in the semantic intention index and corresponding semantic similarity; and calculating the intention recognition result of the online Query. The invention realizes intention identification by using a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate.

Description

Intention identification method based on semantic index

Technical Field

The invention belongs to the technical field of search, and particularly relates to an intention identification method based on semantic indexing.

Background

The intelligent search analyzes the user input in the Query understanding link, and then realizes the high-quality retrieval of the content library and the reasonable sequencing of the results according to the analysis results. Intent recognition is a key technology in Query understanding. For more sophisticated search engines, it is common for the content repository to contain multiple types of documents. Users mostly pay attention to a specific document or documents of some kind, but not all kinds, in one search request. The purpose of intention recognition is to predict the type and corresponding (will strength) distribution of the document that the user wishes to retrieve according to the search word (i.e. Query) input by the user. On one hand, the better intention identification can reduce the range of document retrieval and ensure that the retrieval result is more accurate; on the other hand, important basis can be provided for the sequencing of the retrieval results. The reasonable degree of ranking directly affects user satisfaction.

Currently, there are two main types of mainstream intention recognition methods. The first is based on dictionaries and rules. And (4) mining a user search click log in an off-line manner, and counting < historical Query, category distribution of click documents >. Matching the user Query with the history Query in a dictionary looking-up manner on line, and taking the category distribution corresponding to the matched history Query as an intention recognition result of the online Query. The second is a text classification prediction based approach. Intent recognition is considered a text classification prediction task. The classification model is typically a multi-label text classification model such as svm, textcnn, fasttext, etc. The training data is < historical Query, type of clicked document >.

The main problems of the existing methods are: 1) the first method can accurately reflect the document category distribution corresponding to the search terms based on the statistics of the user search click behaviors. But the dictionary lookup approach lacks generalization. When the user online input cannot match to the historical Query, the intent recognition has no result. 2) The second method has better generalization, however, higher accuracy and recall rate are difficult to achieve. Especially for scenes with more document types, the multi-label classification model is difficult to realize accurate prediction, and reasonable intention distribution cannot be provided according to prediction probability. Therefore, in practical applications, both methods are assisted by a large amount of manual review and operation configuration.

Disclosure of Invention

In view of the above, the invention provides an intention identification method based on semantic intention indexes, which is characterized in that the offline link counts the types of historical search Query and clicked documents after search, constructs a statistical intention, then constructs a search semantic model by using Word2vec, and constructs a semantic intention index; the online link obtains a semantic vector of the user Query by using a search semantic model; searching semantic intention indexes by using semantic vector retrieval to obtain top k records; and fusing the intention distribution in top k records to obtain a final intention recognition result.

The invention discloses an intention identification method based on semantic indexing, which comprises the following steps:

training user search data by using a search semantic model, wherein the training data simultaneously comprises a search word sequence of a single user and a search word sequence of a single document;

counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index;

inputting a search semantic model into an online user Query to acquire a Query semantic vector v; retrieving the semantic intention index by using a semantic vector v of an online user, and acquiring a plurality of records which are most similar to online Query semantics in the semantic intention index and corresponding semantic similarity;

and fusing the retrieval results, and calculating to obtain the intention identification result of the online Query.

Further, the search semantic model uses a neural network model of Word2vec or FastText.

Furthermore, when the Query is divided into words, the domain word stock is preferentially used, and if the domain word stock does not exist, the n-gram characteristics of the word level are extracted on the basis of the general word stock.

Further, the statistical items of the query and the intention are<q_i,intent_i>Wherein intent_i＝{category₁:prob₁,category₂:prob₂… }, wherein q is_iIs the ith Query, intent_iFor the ith purpose, category1, category2 are the 1 st and 2 nd purposes, prob1, prob2 are the probabilities of the 1 st and 2 nd purposes.

Further, each record of the semantic intent index is shaped as:<q_i,v_i,intent_i>。

further, using a cosine similarity method, retrieving the semantic intention index from the semantic vector v of the online user.

Further, setting a similarity threshold theta, and accordingly acquiring k records { rec ] most similar to the online Query semantics_i1,2, … k and corresponding semantic similarity { sim |_i|i＝1,2,…k}。

Further, a final prediction result is calculated according to the following search results:

if the search result is not null, the record with the maximum similarity in the result is rec_aSimilarity sim_a1, the intent of the record_aAs a final prediction result;

if the retrieval result is not empty and all the similarity degrees are less than 1, obtaining a final prediction result through weighting calculation:

wherein k is k records most similar to the online Query semantics;

if the retrieval result is null, the user input is recognized as a universal intent Query, i.e., a search without explicit intent.

Furthermore, the identification frequency a and the click frequency of the universal intention Query are recordedb, searching the click behavior times c, and calculating an universal Query update value Q_u：

Wherein alpha and beta are preset weight parameters, a1 and a2 are preset recognition times threshold values, b1 is a preset click times threshold value, and when the universal intention Query updates a value Q_uWhen the updating threshold Q1 is exceeded, the generalized intent Query is added to the semantic intent index as the search semantic model and the statistical intent are periodically updated.

The invention has the following beneficial effects:

the invention realizes intention identification by utilizing a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate compared with the existing mainstream method, thereby providing support for subsequent search recall and result sequencing.

Drawings

FIG. 1 is a flow chart of an intent recognition method of the present invention;

FIG. 2 is a flow diagram of the present invention for constructing a semantic intent index offline;

FIG. 3 is a flow diagram of the online prediction of the present invention.

Detailed Description

The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.

The invention discloses an intention identification method based on semantic indexes, which comprises the following steps:

step 1, an off-line link: constructing search semantic models

The search semantic model uses a neural network model like Word2 vec. The training data includes two corpora: 1) splicing the search terms of each user into a sequence by taking the users as a group; 2) and (5) taking the documents clicked in the search as a group, and splicing the search terms corresponding to each document into a sequence. When the Query is divided, a method of a domain word library is preferably used. If the domain word stock does not exist, the n-gram characteristics of the word level can be extracted on the basis of the general word stock. If the quality of the semantic model is to be guaranteed, the following two points are required: 1) in order to ensure that the model can cover the search semantics of a long period, the training corpus is extracted from a search log of a long period, taking e-commerce search as an example, which is usually one year. 2) In order to ensure that the model can learn the semantics of the new Query in time, the semantic model updating period is not too long. For the e-commerce search example, the update frequency is typically once per week.

Step 2, an off-line link: constructing semantic intent indexes

First, the logs generated by the search engine are counted. The frequency of occurrence of the historical Query exceeds a certain threshold. The statistical items are as follows:<q_i,intent_i>wherein intent_i＝{category₁:prob₁,category₂:prob₂… }. And then, inputting the historical Query into a search semantic model to obtain a semantic vector. Adding the semantic vector into the statistical items to obtain a semantic intention index, wherein each record is as follows:<q_i,v_i,intent_i>. The semantic intent index must be updated after each semantic model update.

And step 3, an online link: constructing semantic vector of user Query to search index of statistical intention

And inputting the Query of the online user into a search semantic model to obtain a Query semantic vector v. The cosine similarity of the semantic vectors of the two Query is a decimal number from 0 to 1, and the semantic similarity of the two Query can be reflected. And searching the statistical intention index by v through a cosine similarity method. Setting a higher similarity threshold theta, and accordingly acquiring k records { rec ] most similar to the on-line Query semantics_i1,2, … k and corresponding semantic similarity { sim |_i|i＝1,2,…k}

And 4, an online link: integrating the returned retrieval results in the last step, and calculating to obtain the intention identification result of the online Query

According to three conditions of the retrieval result, calculating a final prediction result

1) The search result is not null, and the similarity in the resultThe largest record is rec_aSimilarity sim_a1. This illustrates online Query and q_aAre identical. Therefore, will intent_aAs a final prediction result.

2) The search result is not null, and all the similarity degrees are less than 1. This indicates that the historical Query in the search result has a higher similarity to the online Query, but is not exactly the same. At this time, the final prediction result is obtained by weighting calculation:

3) and if the retrieval result is null, the intention index does not contain the history Query similar to the online Query in semantics. At this point, the user input is identified as a generalized intent Query, i.e., a search with no explicit intent.

It should be noted that: the semantic intent index contains a history of long-term statistics, so the third case is very rare. If the same or similar new Query appears continuously, the new Query will be identified as a universal Query in the early stage by the present invention. At this time, the type of the document returned by the search is not limited, and includes types desired and undesired by the user. As the cumulative number of occurrences increases and with a certain amount of search click behavior, new queries are added to the semantic intent index as the search semantic model and statistical intent are periodically updated. The specific steps of updating the new Query to the semantic intent index are as follows:

recording the identification times a, the click times b and the search click behavior times c of the universal intention Query, and calculating an updated value Q of the universal intention Query_u：

Wherein alpha and beta are preset weight parameters, a1 and a2 are preset recognition times threshold values, b1 is a preset click times threshold value, and when the universal intention Query updates a value Q_uWhen the updating threshold Q1 is exceeded, the universal intention Query is added to the semantic intention index along with the periodic updating of the search semantic model and the statistical intention.

The invention has the following beneficial effects:

The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.

Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.

Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of units or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.

In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements which are included in the protection scope of the present invention.

Claims

1. The intention identification method based on the semantic index is characterized by comprising the following steps of:

2. The semantic-index-based intent recognition method of claim 1, wherein the search semantic model uses a neural network model of Word2vec or FastText.

3. The method for semantic-index-based intention recognition according to claim 1, wherein a domain thesaurus is preferentially used when a Query is participated, and if no domain thesaurus exists, n-gram features at a word level are extracted on the basis of a general thesaurus.

4. The semantic-index-based intent recognition method of claim 1, wherein the statistical entries of query and intent are<q_i,intent_i>Wherein intent_i＝*category₁:prob₁,category₂:prob₂… + wherein q_iIs the ith Query, intent_iFor the ith purpose, category1, category2 are the 1 st and 2 nd purposes, prob1, prob2 are the probabilities of the 1 st and 2 nd purposes.

5. The semantic-index-based intent recognition method of claim 4, wherein each record of the semantic intent index is in the form of:<q_i,v_i,intent_i>。

6. the semantic-index-based intention recognition method of claim 1, wherein the semantic intention index is retrieved by a semantic vector v of an online user using a cosine similarity method.

7. The method for semantic-index-based intention recognition according to claim 1, wherein a similarity threshold θ is set, and k records arec closest to the online Query semantic are obtained according to the similarity threshold θ_iI | ═ 1,2, … k + and corresponding semantic similarity ═ sim_i|i＝1,2,…k+。

8. The semantic-index-based intention recognition method of claim 7, wherein the final prediction result is calculated from the search results of:

wherein k is k records most similar to the online Query semantics;

9. The method for identifying intent based on semantic index according to claim 8, characterized in that the identification times a, the number of clicks b and the number of search clicks c of the universal intent Query are recorded, and the updated value Q of the universal intent Query is calculated_u：

Wherein, alpha and beta are preset weight parameters, a1 and a2 are preset threshold values of the number of recognitions, b1 is a preset threshold value of the number of clicks, and when the universal Query updates a value Q_uWhen the updating threshold Q1 is exceeded, the generalized intent Query is added to the semantic intent index as the search semantic model and the statistical intent are periodically updated.