CN114595305A - Intention identification method based on semantic index - Google Patents
Intention identification method based on semantic index Download PDFInfo
- Publication number
- CN114595305A CN114595305A CN202210223886.0A CN202210223886A CN114595305A CN 114595305 A CN114595305 A CN 114595305A CN 202210223886 A CN202210223886 A CN 202210223886A CN 114595305 A CN114595305 A CN 114595305A
- Authority
- CN
- China
- Prior art keywords
- semantic
- query
- search
- intent
- intention
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/194—Calculation of difference between files
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Probability & Statistics with Applications (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an intention identification method based on semantic indexes, which uses a search semantic model to train a user to search data; counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index; inputting a search semantic model into an online user Query to acquire a Query semantic vector v; searching the semantic intention index by using the v of the online user, and acquiring a plurality of records which are most similar to the online Query semantic in the semantic intention index and corresponding semantic similarity; and calculating the intention recognition result of the online Query. The invention realizes intention identification by using a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate.
Description
Technical Field
The invention belongs to the technical field of search, and particularly relates to an intention identification method based on semantic indexing.
Background
The intelligent search analyzes the user input in the Query understanding link, and then realizes the high-quality retrieval of the content library and the reasonable sequencing of the results according to the analysis results. Intent recognition is a key technology in Query understanding. For more sophisticated search engines, it is common for the content repository to contain multiple types of documents. Users mostly pay attention to a specific document or documents of some kind, but not all kinds, in one search request. The purpose of intention recognition is to predict the type and corresponding (will strength) distribution of the document that the user wishes to retrieve according to the search word (i.e. Query) input by the user. On one hand, the better intention identification can reduce the range of document retrieval and ensure that the retrieval result is more accurate; on the other hand, important basis can be provided for the sequencing of the retrieval results. The reasonable degree of ranking directly affects user satisfaction.
Currently, there are two main types of mainstream intention recognition methods. The first is based on dictionaries and rules. And (4) mining a user search click log in an off-line manner, and counting < historical Query, category distribution of click documents >. Matching the user Query with the history Query in a dictionary looking-up manner on line, and taking the category distribution corresponding to the matched history Query as an intention recognition result of the online Query. The second is a text classification prediction based approach. Intent recognition is considered a text classification prediction task. The classification model is typically a multi-label text classification model such as svm, textcnn, fasttext, etc. The training data is < historical Query, type of clicked document >.
The main problems of the existing methods are: 1) the first method can accurately reflect the document category distribution corresponding to the search terms based on the statistics of the user search click behaviors. But the dictionary lookup approach lacks generalization. When the user online input cannot match to the historical Query, the intent recognition has no result. 2) The second method has better generalization, however, higher accuracy and recall rate are difficult to achieve. Especially for scenes with more document types, the multi-label classification model is difficult to realize accurate prediction, and reasonable intention distribution cannot be provided according to prediction probability. Therefore, in practical applications, both methods are assisted by a large amount of manual review and operation configuration.
Disclosure of Invention
In view of the above, the invention provides an intention identification method based on semantic intention indexes, which is characterized in that the offline link counts the types of historical search Query and clicked documents after search, constructs a statistical intention, then constructs a search semantic model by using Word2vec, and constructs a semantic intention index; the online link obtains a semantic vector of the user Query by using a search semantic model; searching semantic intention indexes by using semantic vector retrieval to obtain top k records; and fusing the intention distribution in top k records to obtain a final intention recognition result.
The invention discloses an intention identification method based on semantic indexing, which comprises the following steps:
training user search data by using a search semantic model, wherein the training data simultaneously comprises a search word sequence of a single user and a search word sequence of a single document;
counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index;
inputting a search semantic model into an online user Query to acquire a Query semantic vector v; retrieving the semantic intention index by using a semantic vector v of an online user, and acquiring a plurality of records which are most similar to online Query semantics in the semantic intention index and corresponding semantic similarity;
and fusing the retrieval results, and calculating to obtain the intention identification result of the online Query.
Further, the search semantic model uses a neural network model of Word2vec or FastText.
Furthermore, when the Query is divided into words, the domain word stock is preferentially used, and if the domain word stock does not exist, the n-gram characteristics of the word level are extracted on the basis of the general word stock.
Further, the statistical items of the query and the intention are<qi,intenti>Wherein intenti={category1:prob1,category2:prob2… }, wherein q isiIs the ith Query, intentiFor the ith purpose, category1, category2 are the 1 st and 2 nd purposes, prob1, prob2 are the probabilities of the 1 st and 2 nd purposes.
Further, each record of the semantic intent index is shaped as:<qi,vi,intenti>。
further, using a cosine similarity method, retrieving the semantic intention index from the semantic vector v of the online user.
Further, setting a similarity threshold theta, and accordingly acquiring k records { rec ] most similar to the online Query semanticsi1,2, … k and corresponding semantic similarity { sim |i|i=1,2,…k}。
Further, a final prediction result is calculated according to the following search results:
if the search result is not null, the record with the maximum similarity in the result is recaSimilarity sima1, the intent of the recordaAs a final prediction result;
if the retrieval result is not empty and all the similarity degrees are less than 1, obtaining a final prediction result through weighting calculation:wherein k is k records most similar to the online Query semantics;
if the retrieval result is null, the user input is recognized as a universal intent Query, i.e., a search without explicit intent.
Furthermore, the identification frequency a and the click frequency of the universal intention Query are recordedb, searching the click behavior times c, and calculating an universal Query update value Qu:
Wherein alpha and beta are preset weight parameters, a1 and a2 are preset recognition times threshold values, b1 is a preset click times threshold value, and when the universal intention Query updates a value QuWhen the updating threshold Q1 is exceeded, the generalized intent Query is added to the semantic intent index as the search semantic model and the statistical intent are periodically updated.
The invention has the following beneficial effects:
the invention realizes intention identification by utilizing a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate compared with the existing mainstream method, thereby providing support for subsequent search recall and result sequencing.
Drawings
FIG. 1 is a flow chart of an intent recognition method of the present invention;
FIG. 2 is a flow diagram of the present invention for constructing a semantic intent index offline;
FIG. 3 is a flow diagram of the online prediction of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
The invention discloses an intention identification method based on semantic indexes, which comprises the following steps:
step 1, an off-line link: constructing search semantic models
The search semantic model uses a neural network model like Word2 vec. The training data includes two corpora: 1) splicing the search terms of each user into a sequence by taking the users as a group; 2) and (5) taking the documents clicked in the search as a group, and splicing the search terms corresponding to each document into a sequence. When the Query is divided, a method of a domain word library is preferably used. If the domain word stock does not exist, the n-gram characteristics of the word level can be extracted on the basis of the general word stock. If the quality of the semantic model is to be guaranteed, the following two points are required: 1) in order to ensure that the model can cover the search semantics of a long period, the training corpus is extracted from a search log of a long period, taking e-commerce search as an example, which is usually one year. 2) In order to ensure that the model can learn the semantics of the new Query in time, the semantic model updating period is not too long. For the e-commerce search example, the update frequency is typically once per week.
Step 2, an off-line link: constructing semantic intent indexes
First, the logs generated by the search engine are counted. The frequency of occurrence of the historical Query exceeds a certain threshold. The statistical items are as follows:<qi,intenti>wherein intenti={category1:prob1,category2:prob2… }. And then, inputting the historical Query into a search semantic model to obtain a semantic vector. Adding the semantic vector into the statistical items to obtain a semantic intention index, wherein each record is as follows:<qi,vi,intenti>. The semantic intent index must be updated after each semantic model update.
And step 3, an online link: constructing semantic vector of user Query to search index of statistical intention
And inputting the Query of the online user into a search semantic model to obtain a Query semantic vector v. The cosine similarity of the semantic vectors of the two Query is a decimal number from 0 to 1, and the semantic similarity of the two Query can be reflected. And searching the statistical intention index by v through a cosine similarity method. Setting a higher similarity threshold theta, and accordingly acquiring k records { rec ] most similar to the on-line Query semanticsi1,2, … k and corresponding semantic similarity { sim |i|i=1,2,…k}
And 4, an online link: integrating the returned retrieval results in the last step, and calculating to obtain the intention identification result of the online Query
According to three conditions of the retrieval result, calculating a final prediction result
1) The search result is not null, and the similarity in the resultThe largest record is recaSimilarity sima1. This illustrates online Query and qaAre identical. Therefore, will intentaAs a final prediction result.
2) The search result is not null, and all the similarity degrees are less than 1. This indicates that the historical Query in the search result has a higher similarity to the online Query, but is not exactly the same. At this time, the final prediction result is obtained by weighting calculation:
3) and if the retrieval result is null, the intention index does not contain the history Query similar to the online Query in semantics. At this point, the user input is identified as a generalized intent Query, i.e., a search with no explicit intent.
It should be noted that: the semantic intent index contains a history of long-term statistics, so the third case is very rare. If the same or similar new Query appears continuously, the new Query will be identified as a universal Query in the early stage by the present invention. At this time, the type of the document returned by the search is not limited, and includes types desired and undesired by the user. As the cumulative number of occurrences increases and with a certain amount of search click behavior, new queries are added to the semantic intent index as the search semantic model and statistical intent are periodically updated. The specific steps of updating the new Query to the semantic intent index are as follows:
recording the identification times a, the click times b and the search click behavior times c of the universal intention Query, and calculating an updated value Q of the universal intention Queryu:
Wherein alpha and beta are preset weight parameters, a1 and a2 are preset recognition times threshold values, b1 is a preset click times threshold value, and when the universal intention Query updates a value QuWhen the updating threshold Q1 is exceeded, the universal intention Query is added to the semantic intention index along with the periodic updating of the search semantic model and the statistical intention.
The invention has the following beneficial effects:
the invention realizes intention identification by utilizing a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate compared with the existing mainstream method, thereby providing support for subsequent search recall and result sequencing.
The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.
Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of units or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.
In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements which are included in the protection scope of the present invention.
Claims (9)
1. The intention identification method based on the semantic index is characterized by comprising the following steps of:
training user search data by using a search semantic model, wherein the training data simultaneously comprises a search word sequence of a single user and a search word sequence of a single document;
counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index;
inputting a search semantic model into an online user Query to acquire a Query semantic vector v; retrieving the semantic intention index by using a semantic vector v of an online user, and acquiring a plurality of records which are most similar to online Query semantics in the semantic intention index and corresponding semantic similarity;
and fusing the retrieval results, and calculating to obtain the intention identification result of the online Query.
2. The semantic-index-based intent recognition method of claim 1, wherein the search semantic model uses a neural network model of Word2vec or FastText.
3. The method for semantic-index-based intention recognition according to claim 1, wherein a domain thesaurus is preferentially used when a Query is participated, and if no domain thesaurus exists, n-gram features at a word level are extracted on the basis of a general thesaurus.
4. The semantic-index-based intent recognition method of claim 1, wherein the statistical entries of query and intent are<qi,intenti>Wherein intenti=*category1:prob1,category2:prob2… + wherein qiIs the ith Query, intentiFor the ith purpose, category1, category2 are the 1 st and 2 nd purposes, prob1, prob2 are the probabilities of the 1 st and 2 nd purposes.
5. The semantic-index-based intent recognition method of claim 4, wherein each record of the semantic intent index is in the form of:<qi,vi,intenti>。
6. the semantic-index-based intention recognition method of claim 1, wherein the semantic intention index is retrieved by a semantic vector v of an online user using a cosine similarity method.
7. The method for semantic-index-based intention recognition according to claim 1, wherein a similarity threshold θ is set, and k records arec closest to the online Query semantic are obtained according to the similarity threshold θiI | ═ 1,2, … k + and corresponding semantic similarity ═ simi|i=1,2,…k+。
8. The semantic-index-based intention recognition method of claim 7, wherein the final prediction result is calculated from the search results of:
if the search result is not null, the record with the maximum similarity in the result is recaSimilarity sima1, the intent of the recordaAs a final prediction result;
if the retrieval result is not empty and all the similarity degrees are less than 1, obtaining a final prediction result through weighting calculation:wherein k is k records most similar to the online Query semantics;
if the retrieval result is null, the user input is recognized as a universal intent Query, i.e., a search without explicit intent.
9. The method for identifying intent based on semantic index according to claim 8, characterized in that the identification times a, the number of clicks b and the number of search clicks c of the universal intent Query are recorded, and the updated value Q of the universal intent Query is calculatedu:
Wherein, alpha and beta are preset weight parameters, a1 and a2 are preset threshold values of the number of recognitions, b1 is a preset threshold value of the number of clicks, and when the universal Query updates a value QuWhen the updating threshold Q1 is exceeded, the generalized intent Query is added to the semantic intent index as the search semantic model and the statistical intent are periodically updated.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210223886.0A CN114595305A (en) | 2022-03-09 | 2022-03-09 | Intention identification method based on semantic index |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210223886.0A CN114595305A (en) | 2022-03-09 | 2022-03-09 | Intention identification method based on semantic index |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114595305A true CN114595305A (en) | 2022-06-07 |
Family
ID=81807274
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210223886.0A Pending CN114595305A (en) | 2022-03-09 | 2022-03-09 | Intention identification method based on semantic index |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114595305A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117312500A (en) * | 2023-11-30 | 2023-12-29 | 山东齐鲁壹点传媒有限公司 | Semantic retrieval model building method based on ANN and BERT |
-
2022
- 2022-03-09 CN CN202210223886.0A patent/CN114595305A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117312500A (en) * | 2023-11-30 | 2023-12-29 | 山东齐鲁壹点传媒有限公司 | Semantic retrieval model building method based on ANN and BERT |
CN117312500B (en) * | 2023-11-30 | 2024-02-27 | 山东齐鲁壹点传媒有限公司 | Semantic retrieval model building method based on ANN and BERT |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11790006B2 (en) | Natural language question answering systems | |
US11442932B2 (en) | Mapping natural language to queries using a query grammar | |
US20230350959A1 (en) | Systems and methods for improved web searching | |
CN109829104B (en) | Semantic similarity based pseudo-correlation feedback model information retrieval method and system | |
CN112069298A (en) | Human-computer interaction method, device and medium based on semantic web and intention recognition | |
CN110765277B (en) | Knowledge-graph-based mobile terminal online equipment fault diagnosis method | |
CN101542475A (en) | System and method for searching and matching data having ideogrammatic content | |
KR20040013097A (en) | Category based, extensible and interactive system for document retrieval | |
US20100312778A1 (en) | Predictive person name variants for web search | |
US20100191758A1 (en) | System and method for improved search relevance using proximity boosting | |
CN114911917B (en) | Asset meta-information searching method and device, computer equipment and readable storage medium | |
CN115309872B (en) | Multi-model entropy weighted retrieval method and system based on Kmeans recall | |
Delpeuch | A survey of OpenRefine reconciliation services | |
US11151317B1 (en) | Contextual spelling correction system | |
CN114595305A (en) | Intention identification method based on semantic index | |
Navarro et al. | Matchsimile: a flexible approximate matching tool for searching proper names | |
CN111859066B (en) | Query recommendation method and device for operation and maintenance work order | |
CN114579729A (en) | FAQ question-answer matching method and system fusing multi-algorithm model | |
CN116610853A (en) | Search recommendation method, search recommendation system, computer device, and storage medium | |
CN113688633A (en) | Outline determination method and device | |
CN113076740A (en) | Synonym mining method and device in government affair service field | |
CN112507687A (en) | Work order retrieval method based on secondary sorting | |
CN110991862A (en) | Network management system for enterprise wind control analysis and control method thereof | |
CN116932487B (en) | Quantized data analysis method and system based on data paragraph division | |
Bagheri et al. | Sentiment miner: a novel unsupervised framework for aspect detection from customer reviews |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |