CN114595305A - Intention identification method based on semantic index - Google Patents

Intention identification method based on semantic index Download PDF

Info

Publication number
CN114595305A
CN114595305A CN202210223886.0A CN202210223886A CN114595305A CN 114595305 A CN114595305 A CN 114595305A CN 202210223886 A CN202210223886 A CN 202210223886A CN 114595305 A CN114595305 A CN 114595305A
Authority
CN
China
Prior art keywords
semantic
query
search
intent
intention
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210223886.0A
Other languages
Chinese (zh)
Inventor
高航
胡毅
曹梦华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Xingsheng Optimization Network Technology Co ltd
Original Assignee
Hunan Xingsheng Optimization Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Xingsheng Optimization Network Technology Co ltd filed Critical Hunan Xingsheng Optimization Network Technology Co ltd
Priority to CN202210223886.0A priority Critical patent/CN114595305A/en
Publication of CN114595305A publication Critical patent/CN114595305A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an intention identification method based on semantic indexes, which uses a search semantic model to train a user to search data; counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index; inputting a search semantic model into an online user Query to acquire a Query semantic vector v; searching the semantic intention index by using the v of the online user, and acquiring a plurality of records which are most similar to the online Query semantic in the semantic intention index and corresponding semantic similarity; and calculating the intention recognition result of the online Query. The invention realizes intention identification by using a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate.

Description

Intention identification method based on semantic index
Technical Field
The invention belongs to the technical field of search, and particularly relates to an intention identification method based on semantic indexing.
Background
The intelligent search analyzes the user input in the Query understanding link, and then realizes the high-quality retrieval of the content library and the reasonable sequencing of the results according to the analysis results. Intent recognition is a key technology in Query understanding. For more sophisticated search engines, it is common for the content repository to contain multiple types of documents. Users mostly pay attention to a specific document or documents of some kind, but not all kinds, in one search request. The purpose of intention recognition is to predict the type and corresponding (will strength) distribution of the document that the user wishes to retrieve according to the search word (i.e. Query) input by the user. On one hand, the better intention identification can reduce the range of document retrieval and ensure that the retrieval result is more accurate; on the other hand, important basis can be provided for the sequencing of the retrieval results. The reasonable degree of ranking directly affects user satisfaction.
Currently, there are two main types of mainstream intention recognition methods. The first is based on dictionaries and rules. And (4) mining a user search click log in an off-line manner, and counting < historical Query, category distribution of click documents >. Matching the user Query with the history Query in a dictionary looking-up manner on line, and taking the category distribution corresponding to the matched history Query as an intention recognition result of the online Query. The second is a text classification prediction based approach. Intent recognition is considered a text classification prediction task. The classification model is typically a multi-label text classification model such as svm, textcnn, fasttext, etc. The training data is < historical Query, type of clicked document >.
The main problems of the existing methods are: 1) the first method can accurately reflect the document category distribution corresponding to the search terms based on the statistics of the user search click behaviors. But the dictionary lookup approach lacks generalization. When the user online input cannot match to the historical Query, the intent recognition has no result. 2) The second method has better generalization, however, higher accuracy and recall rate are difficult to achieve. Especially for scenes with more document types, the multi-label classification model is difficult to realize accurate prediction, and reasonable intention distribution cannot be provided according to prediction probability. Therefore, in practical applications, both methods are assisted by a large amount of manual review and operation configuration.
Disclosure of Invention
In view of the above, the invention provides an intention identification method based on semantic intention indexes, which is characterized in that the offline link counts the types of historical search Query and clicked documents after search, constructs a statistical intention, then constructs a search semantic model by using Word2vec, and constructs a semantic intention index; the online link obtains a semantic vector of the user Query by using a search semantic model; searching semantic intention indexes by using semantic vector retrieval to obtain top k records; and fusing the intention distribution in top k records to obtain a final intention recognition result.
The invention discloses an intention identification method based on semantic indexing, which comprises the following steps:
training user search data by using a search semantic model, wherein the training data simultaneously comprises a search word sequence of a single user and a search word sequence of a single document;
counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index;
inputting a search semantic model into an online user Query to acquire a Query semantic vector v; retrieving the semantic intention index by using a semantic vector v of an online user, and acquiring a plurality of records which are most similar to online Query semantics in the semantic intention index and corresponding semantic similarity;
and fusing the retrieval results, and calculating to obtain the intention identification result of the online Query.
Further, the search semantic model uses a neural network model of Word2vec or FastText.
Furthermore, when the Query is divided into words, the domain word stock is preferentially used, and if the domain word stock does not exist, the n-gram characteristics of the word level are extracted on the basis of the general word stock.
Further, the statistical items of the query and the intention are<qi,intenti>Wherein intenti={category1:prob1,category2:prob2… }, wherein q isiIs the ith Query, intentiFor the ith purpose, category1, category2 are the 1 st and 2 nd purposes, prob1, prob2 are the probabilities of the 1 st and 2 nd purposes.
Further, each record of the semantic intent index is shaped as:<qi,vi,intenti>。
further, using a cosine similarity method, retrieving the semantic intention index from the semantic vector v of the online user.
Further, setting a similarity threshold theta, and accordingly acquiring k records { rec ] most similar to the online Query semanticsi1,2, … k and corresponding semantic similarity { sim |i|i=1,2,…k}。
Further, a final prediction result is calculated according to the following search results:
if the search result is not null, the record with the maximum similarity in the result is recaSimilarity sima1, the intent of the recordaAs a final prediction result;
if the retrieval result is not empty and all the similarity degrees are less than 1, obtaining a final prediction result through weighting calculation:
Figure BDA0003538493380000031
wherein k is k records most similar to the online Query semantics;
if the retrieval result is null, the user input is recognized as a universal intent Query, i.e., a search without explicit intent.
Furthermore, the identification frequency a and the click frequency of the universal intention Query are recordedb, searching the click behavior times c, and calculating an universal Query update value Qu
Figure BDA0003538493380000032
Wherein alpha and beta are preset weight parameters, a1 and a2 are preset recognition times threshold values, b1 is a preset click times threshold value, and when the universal intention Query updates a value QuWhen the updating threshold Q1 is exceeded, the generalized intent Query is added to the semantic intent index as the search semantic model and the statistical intent are periodically updated.
The invention has the following beneficial effects:
the invention realizes intention identification by utilizing a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate compared with the existing mainstream method, thereby providing support for subsequent search recall and result sequencing.
Drawings
FIG. 1 is a flow chart of an intent recognition method of the present invention;
FIG. 2 is a flow diagram of the present invention for constructing a semantic intent index offline;
FIG. 3 is a flow diagram of the online prediction of the present invention.
Detailed Description
The invention is further described with reference to the accompanying drawings, but the invention is not limited in any way, and any alterations or substitutions based on the teaching of the invention are within the scope of the invention.
The invention discloses an intention identification method based on semantic indexes, which comprises the following steps:
step 1, an off-line link: constructing search semantic models
The search semantic model uses a neural network model like Word2 vec. The training data includes two corpora: 1) splicing the search terms of each user into a sequence by taking the users as a group; 2) and (5) taking the documents clicked in the search as a group, and splicing the search terms corresponding to each document into a sequence. When the Query is divided, a method of a domain word library is preferably used. If the domain word stock does not exist, the n-gram characteristics of the word level can be extracted on the basis of the general word stock. If the quality of the semantic model is to be guaranteed, the following two points are required: 1) in order to ensure that the model can cover the search semantics of a long period, the training corpus is extracted from a search log of a long period, taking e-commerce search as an example, which is usually one year. 2) In order to ensure that the model can learn the semantics of the new Query in time, the semantic model updating period is not too long. For the e-commerce search example, the update frequency is typically once per week.
Step 2, an off-line link: constructing semantic intent indexes
First, the logs generated by the search engine are counted. The frequency of occurrence of the historical Query exceeds a certain threshold. The statistical items are as follows:<qi,intenti>wherein intenti={category1:prob1,category2:prob2… }. And then, inputting the historical Query into a search semantic model to obtain a semantic vector. Adding the semantic vector into the statistical items to obtain a semantic intention index, wherein each record is as follows:<qi,vi,intenti>. The semantic intent index must be updated after each semantic model update.
And step 3, an online link: constructing semantic vector of user Query to search index of statistical intention
And inputting the Query of the online user into a search semantic model to obtain a Query semantic vector v. The cosine similarity of the semantic vectors of the two Query is a decimal number from 0 to 1, and the semantic similarity of the two Query can be reflected. And searching the statistical intention index by v through a cosine similarity method. Setting a higher similarity threshold theta, and accordingly acquiring k records { rec ] most similar to the on-line Query semanticsi1,2, … k and corresponding semantic similarity { sim |i|i=1,2,…k}
And 4, an online link: integrating the returned retrieval results in the last step, and calculating to obtain the intention identification result of the online Query
According to three conditions of the retrieval result, calculating a final prediction result
1) The search result is not null, and the similarity in the resultThe largest record is recaSimilarity sima1. This illustrates online Query and qaAre identical. Therefore, will intentaAs a final prediction result.
2) The search result is not null, and all the similarity degrees are less than 1. This indicates that the historical Query in the search result has a higher similarity to the online Query, but is not exactly the same. At this time, the final prediction result is obtained by weighting calculation:
Figure BDA0003538493380000051
3) and if the retrieval result is null, the intention index does not contain the history Query similar to the online Query in semantics. At this point, the user input is identified as a generalized intent Query, i.e., a search with no explicit intent.
It should be noted that: the semantic intent index contains a history of long-term statistics, so the third case is very rare. If the same or similar new Query appears continuously, the new Query will be identified as a universal Query in the early stage by the present invention. At this time, the type of the document returned by the search is not limited, and includes types desired and undesired by the user. As the cumulative number of occurrences increases and with a certain amount of search click behavior, new queries are added to the semantic intent index as the search semantic model and statistical intent are periodically updated. The specific steps of updating the new Query to the semantic intent index are as follows:
recording the identification times a, the click times b and the search click behavior times c of the universal intention Query, and calculating an updated value Q of the universal intention Queryu
Figure BDA0003538493380000061
Wherein alpha and beta are preset weight parameters, a1 and a2 are preset recognition times threshold values, b1 is a preset click times threshold value, and when the universal intention Query updates a value QuWhen the updating threshold Q1 is exceeded, the universal intention Query is added to the semantic intention index along with the periodic updating of the search semantic model and the statistical intention.
The invention has the following beneficial effects:
the invention realizes intention identification by utilizing a semantic matching and intention counting method, and has better generalization and higher accuracy and recall rate compared with the existing mainstream method, thereby providing support for subsequent search recall and result sequencing.
The word "preferred" is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "preferred" is not necessarily to be construed as advantageous over other aspects or designs. Rather, use of the word "preferred" is intended to present concepts in a concrete fashion. The term "or" as used in this application is intended to mean an inclusive "or" rather than an exclusive "or". That is, unless specified otherwise or clear from context, "X employs A or B" is intended to include either of the permutations as a matter of course. That is, if X employs A; b is used as X; or X employs both A and B, then "X employs A or B" is satisfied in any of the foregoing examples.
Also, although the disclosure has been shown and described with respect to one or an implementation, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present disclosure includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components (e.g., elements, etc.), the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the disclosure. In addition, while a particular feature of the disclosure may have been disclosed with respect to only one of several implementations, such feature may be combined with one or other features of the other implementations as may be desired and advantageous for a given or particular application. Furthermore, to the extent that the terms "includes," has, "" contains, "or variants thereof are used in either the detailed description or the claims, such terms are intended to be inclusive in a manner similar to the term" comprising.
Each functional unit in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or a plurality of units or more than one unit are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium. The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Each apparatus or system described above may execute the storage method in the corresponding method embodiment.
In summary, the above-mentioned embodiment is an implementation manner of the present invention, but the implementation manner of the present invention is not limited by the above-mentioned embodiment, and any other changes, modifications, substitutions, combinations, and simplifications which do not depart from the spirit and principle of the present invention should be regarded as equivalent replacements which are included in the protection scope of the present invention.

Claims (9)

1. The intention identification method based on the semantic index is characterized by comprising the following steps of:
training user search data by using a search semantic model, wherein the training data simultaneously comprises a search word sequence of a single user and a search word sequence of a single document;
counting logs generated by a search engine, and establishing Query and intention counting entries for the historical Query with the occurrence frequency of the historical Query exceeding a certain threshold; inputting a historical Query into a search semantic model to obtain a semantic vector; adding the semantic vector into the statistical items to obtain a semantic intention index;
inputting a search semantic model into an online user Query to acquire a Query semantic vector v; retrieving the semantic intention index by using a semantic vector v of an online user, and acquiring a plurality of records which are most similar to online Query semantics in the semantic intention index and corresponding semantic similarity;
and fusing the retrieval results, and calculating to obtain the intention identification result of the online Query.
2. The semantic-index-based intent recognition method of claim 1, wherein the search semantic model uses a neural network model of Word2vec or FastText.
3. The method for semantic-index-based intention recognition according to claim 1, wherein a domain thesaurus is preferentially used when a Query is participated, and if no domain thesaurus exists, n-gram features at a word level are extracted on the basis of a general thesaurus.
4. The semantic-index-based intent recognition method of claim 1, wherein the statistical entries of query and intent are<qi,intenti>Wherein intenti=*category1:prob1,category2:prob2… + wherein qiIs the ith Query, intentiFor the ith purpose, category1, category2 are the 1 st and 2 nd purposes, prob1, prob2 are the probabilities of the 1 st and 2 nd purposes.
5. The semantic-index-based intent recognition method of claim 4, wherein each record of the semantic intent index is in the form of:<qi,vi,intenti>。
6. the semantic-index-based intention recognition method of claim 1, wherein the semantic intention index is retrieved by a semantic vector v of an online user using a cosine similarity method.
7. The method for semantic-index-based intention recognition according to claim 1, wherein a similarity threshold θ is set, and k records arec closest to the online Query semantic are obtained according to the similarity threshold θiI | ═ 1,2, … k + and corresponding semantic similarity ═ simi|i=1,2,…k+。
8. The semantic-index-based intention recognition method of claim 7, wherein the final prediction result is calculated from the search results of:
if the search result is not null, the record with the maximum similarity in the result is recaSimilarity sima1, the intent of the recordaAs a final prediction result;
if the retrieval result is not empty and all the similarity degrees are less than 1, obtaining a final prediction result through weighting calculation:
Figure FDA0003538493370000021
wherein k is k records most similar to the online Query semantics;
if the retrieval result is null, the user input is recognized as a universal intent Query, i.e., a search without explicit intent.
9. The method for identifying intent based on semantic index according to claim 8, characterized in that the identification times a, the number of clicks b and the number of search clicks c of the universal intent Query are recorded, and the updated value Q of the universal intent Query is calculatedu
Figure FDA0003538493370000022
Wherein, alpha and beta are preset weight parameters, a1 and a2 are preset threshold values of the number of recognitions, b1 is a preset threshold value of the number of clicks, and when the universal Query updates a value QuWhen the updating threshold Q1 is exceeded, the generalized intent Query is added to the semantic intent index as the search semantic model and the statistical intent are periodically updated.
CN202210223886.0A 2022-03-09 2022-03-09 Intention identification method based on semantic index Pending CN114595305A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210223886.0A CN114595305A (en) 2022-03-09 2022-03-09 Intention identification method based on semantic index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210223886.0A CN114595305A (en) 2022-03-09 2022-03-09 Intention identification method based on semantic index

Publications (1)

Publication Number Publication Date
CN114595305A true CN114595305A (en) 2022-06-07

Family

ID=81807274

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210223886.0A Pending CN114595305A (en) 2022-03-09 2022-03-09 Intention identification method based on semantic index

Country Status (1)

Country Link
CN (1) CN114595305A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312500A (en) * 2023-11-30 2023-12-29 山东齐鲁壹点传媒有限公司 Semantic retrieval model building method based on ANN and BERT

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117312500A (en) * 2023-11-30 2023-12-29 山东齐鲁壹点传媒有限公司 Semantic retrieval model building method based on ANN and BERT
CN117312500B (en) * 2023-11-30 2024-02-27 山东齐鲁壹点传媒有限公司 Semantic retrieval model building method based on ANN and BERT

Similar Documents

Publication Publication Date Title
US11790006B2 (en) Natural language question answering systems
US11442932B2 (en) Mapping natural language to queries using a query grammar
US20230350959A1 (en) Systems and methods for improved web searching
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN110765277B (en) Knowledge-graph-based mobile terminal online equipment fault diagnosis method
CN101542475A (en) System and method for searching and matching data having ideogrammatic content
KR20040013097A (en) Category based, extensible and interactive system for document retrieval
US20100312778A1 (en) Predictive person name variants for web search
US20100191758A1 (en) System and method for improved search relevance using proximity boosting
CN114911917B (en) Asset meta-information searching method and device, computer equipment and readable storage medium
CN115309872B (en) Multi-model entropy weighted retrieval method and system based on Kmeans recall
Delpeuch A survey of OpenRefine reconciliation services
US11151317B1 (en) Contextual spelling correction system
CN114595305A (en) Intention identification method based on semantic index
Navarro et al. Matchsimile: a flexible approximate matching tool for searching proper names
CN111859066B (en) Query recommendation method and device for operation and maintenance work order
CN114579729A (en) FAQ question-answer matching method and system fusing multi-algorithm model
CN116610853A (en) Search recommendation method, search recommendation system, computer device, and storage medium
CN113688633A (en) Outline determination method and device
CN113076740A (en) Synonym mining method and device in government affair service field
CN112507687A (en) Work order retrieval method based on secondary sorting
CN110991862A (en) Network management system for enterprise wind control analysis and control method thereof
CN116932487B (en) Quantized data analysis method and system based on data paragraph division
Bagheri et al. Sentiment miner: a novel unsupervised framework for aspect detection from customer reviews

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination