CN112035730A

CN112035730A - Semantic retrieval method and device and electronic equipment

Info

Publication number: CN112035730A
Application number: CN202011221206.9A
Authority: CN
Inventors: 周阳; 钱泓锦; 刘占亮; 窦志成
Original assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Current assignee: Beijing Zhiyuan Artificial Intelligence Research Institute
Priority date: 2020-11-05
Filing date: 2020-11-05
Publication date: 2020-12-04
Anticipated expiration: 2040-11-05
Also published as: CN112035730B

Abstract

The invention discloses a semantic retrieval method, a semantic retrieval device and electronic equipment, wherein the method comprises the following steps: receiving query information sent by a user; correcting the text in the query information to obtain a corrected text; performing user intention analysis on the corrected text, and determining a first score of the identified user intention; for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy; for the common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy; sorting the candidate answers according to the first score, the second score and the third score to obtain answers; compared with the search based on the key words, the method can better meet the query requirements of the user.

Description

Semantic retrieval method and device and electronic equipment

Technical Field

The invention relates to the technical field of information processing, in particular to a semantic retrieval method and device and electronic equipment.

Background

In mass information of the internet, information needed by a user often needs to be retrieved through a search engine, however, the retrieval effect of the search engine is not good at present, the user still needs to screen a large number of returned webpages, and the retrieval requirement of convenience and quickness cannot be met. This has resulted in intelligent services that digitize information using intelligent means, but it is still difficult to mine information associations between data, resulting in the inefficient use of much of the data information.

In the existing search engines, most of the existing search engines adopt the traditional ways of keyword matching, PageRank, inverted index and the like as search methods, and in order to meet the query requirements of users as much as possible, the user query is often subjected to lexical analysis based on word segmentation, part of speech recognition, named entity recognition and the like, and then is subjected to combined query. Although the method can improve the query effect, the method only remains in shallow semantic analysis and cannot understand the query intention of the user.

In a knowledge graph-based retrieval and question-answering system, most of retrieval and question-answering are queries based on simple facts, namely one-hop queries, and the more complicated multi-hop queries often cannot obtain good retrieval results or even return results.

Disclosure of Invention

The invention provides a semantic retrieval method, a semantic retrieval device and electronic equipment, which can effectively solve the problems that the existing retrieval method cannot understand the query intention and the query effect cannot meet the requirements of users.

A semantic retrieval method comprising:

receiving query information sent by a user;

correcting the text in the query information to obtain a corrected text;

performing user intention analysis on the corrected text based on a question template library, and determining a first score of the identified user intention, wherein the user intention comprises simple fact question answers and common question answers;

for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy;

for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy;

and sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.

Further, correcting the text in the query information to obtain a corrected text, including:

adopting a Chinese word segmentation device to cut words of the text, and carrying out error detection through word granularity and word granularity to generate a suspected error position candidate set;

traversing all suspected error positions, searching phonetic and morphological words from a pre-stored dictionary to replace words at the suspected error positions, and calculating sentence confusion degree through a language model;

sorting the replacement results according to the sentence confusion degree calculation result to obtain an optimal corrected word;

and generating the corrected text according to the optimal corrected word.

Further, for the simple fact question-answering, retrieving based on a pre-constructed knowledge graph, and obtaining a first candidate answer set comprises:

extracting entity information, relationship information and attribute information in the corrected text, and using a synonym dictionary to link the entity information, relationship information or attribute information to the entity, relationship or attribute in the knowledge graph to generate an SQL query statement;

and filling the SQL query statement to the position of the extracted corresponding word slot, and executing query to obtain a first candidate answer set.

Further, for the common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair, and obtaining a second candidate answer set includes:

and performing text vectorization on the correction text, searching similar vectors from the vectorized FAQ question-answer pair, obtaining corresponding answers, and generating a second candidate answer set.

Further, finding similar vectors from the vectorized FAQ question-answer pair comprises:

calculating the similarity between the vectorized correction text and the questions in the vectorized FAQ question-answer pair, and returning the answers corresponding to the questions with the highest similarity; and/or

And calculating the similarity between the vectorized correction text and the answer in the vectorized FAQ question-answer pair, and returning the answer with the highest similarity.

Further, ranking the candidate answers according to the first score, the second score, and the third score to obtain answers includes:

weighting and summing the first score and the second score of the simple fact question-answer to obtain a fourth score of each candidate answer in the first candidate answer set;

weighting and summing the first score and the third score of the common question answer to obtain a fifth score of each candidate answer in a second candidate answer set;

sorting all the candidate answers according to the fourth score and the fifth score, and selecting the answer with the highest sorting;

and generating an answer feedback to the user according to the selected answer and the answer template.

Further, the question template library is pre-constructed in the following way:

collecting historical user query information, and constructing the problem template library according to the user query information;

the vectorized FAQ question-answer pair is pre-constructed in the following way:

collecting common questions of a user, making standard answers, and vectorizing the common questions and the standard answers to obtain the vectorized FAQ question-answer pairs.

A semantic retrieval apparatus comprising:

the receiving module is used for receiving query information sent by a user;

the error correction module is used for correcting the text in the query information to obtain a corrected text;

an intent determination module for performing a user intent analysis on the corrected text based on a question template library, determining a first score of the identified user intent, the user intent including simple fact question answering and common question answering;

the first retrieval module is used for retrieving simple fact questions and answers based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy;

the second retrieval module is used for answering the common questions, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy;

and the answer generation module is used for sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.

An electronic device comprises a processor and a memory, wherein the memory stores a plurality of instructions, and the processor is used for reading the instructions and executing the semantic retrieval method.

A computer-readable storage medium having stored thereon a plurality of instructions readable by a processor and performing the semantic retrieval method described above.

The semantic retrieval method, the semantic retrieval device and the electronic equipment at least have the following beneficial effects:

(1) the natural language understanding based on the semantic level can better match the real intention of the user, improve the retrieval efficiency and accuracy, and better meet the query requirement of the user compared with the retrieval based on the key words;

(2) based on the synonym dictionary, normalized description can be carried out on the identified entities, attributes and relations, normalized description is carried out on the entities which are not normalized and have inaccurate expression in the query sentence of the user, the problem that the entities cannot be correctly linked to the entity nodes in the knowledge graph because the description of the entities is not normalized is avoided, and the robustness of the knowledge graph-based retrieval system is improved;

(3) for non-simple fact queries such as FAQ, answers which best meet the user intention can be queried through a vectorization retrieval service at a semantic level.

Drawings

Fig. 1 is a flowchart of an embodiment of a semantic retrieval method provided in the present invention.

Fig. 2 is a flowchart of an embodiment of a text error correction method in the semantic retrieval method provided by the present invention.

Fig. 3 is a flowchart of an embodiment of a knowledge-graph-based retrieval method in the semantic retrieval method provided by the present invention.

Fig. 4 is a flowchart of an embodiment of a retrieval method based on vectorized FAQ question-answer pairs in the semantic retrieval method provided by the present invention.

Fig. 5 is a flowchart illustrating an embodiment of a method for ranking candidate answers to obtain answers in the semantic retrieval method according to the present invention.

Fig. 6 is a schematic structural diagram of an embodiment of a semantic retrieval apparatus according to the present invention.

Fig. 7 is a schematic structural diagram of an embodiment of an electronic device provided in the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Referring to fig. 1, in some embodiments, there is provided a semantic retrieval method comprising:

step S101, receiving query information sent by a user;

step S102, correcting the text in the query information to obtain a corrected text;

step S103, analyzing user intentions of the corrected texts based on a question template library, and determining a first score of the identified user intentions, wherein the user intentions comprise simple fact questions and answers and common question answers;

step S104, for simple fact questions and answers, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy;

step S105, for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy;

and S106, sorting the candidate answers according to the first score, the second score and the third score to obtain answers.

The semantic retrieval method provided by the embodiment can better match the real intention of the user, improves the retrieval efficiency and accuracy, and can better meet the query requirement of the user compared with the retrieval based on the key words.

Specifically, before the above method is performed, a question template library, a knowledge graph, and vectorized FAQ (Frequently Asked Questions) question-answer pairs are constructed in advance.

The knowledge graph adopts an entity-relation-entity triple form, and can organize a large amount of discrete information in the information in a structured mode. For example, high and new technology enterprise affirmation-transacting time-working day 09:00-12:00 am, 13:30-17:00 pm, the head entity is "high and new technology enterprise affirmation", the tail entity is "working day 09:00-12:00 am, 13:30-17:00 pm", and the relationship between the two entities is "transacting time".

The FAQ questions and answers are generally the most common questions and answers in the service handling, the frequently asked questions of the user can be manually collected and labeled to make relevant standard answers, and then the questions and the corresponding answers are vectorized by using a uniform semantic vectorization model to obtain the vectorized FAQ question-answer pairs. Common vectorization schemes include BM25, TFIDF, etc., with deep learning semantic directions such as Bert, etc. After vectorization, a vector search tool can be used to perform a fast match search, such as sessions, annoy, etc.

During cold start, the user's query idioms, such as "what the phone of xxx is", "where the xxx address is", etc., may be recorded in a variety of ways, such as in a manual window, mail, etc. And constructing the problem template library by collecting historical user query information.

In some embodiments, referring to fig. 2, in step S102, performing error correction on the text in the query information to obtain a corrected text, including:

step S1021, a Chinese word segmentation device is adopted to cut words of the text, error detection is carried out according to word granularity and word granularity, and a suspected error position candidate set is generated;

step S1022, traverse all suspected wrong positions, and look for similar words and similar words from the dictionary stored in advance to replace the words in suspected wrong positions, calculate the sentence puzzlement degree through the language model;

step S1023, sorting the replacement results according to the sentence confusion degree calculation result to obtain an optimal corrected word;

and step S1024, generating the corrected text according to the optimal corrected words.

In consideration of wrongly written characters, spoken descriptions, and non-standard words that may appear to a user (for example, "advanced technology enterprise" is simply referred to as "advanced enterprise" or "advanced enterprise"), chinese text correction is required. Error correction mainly has two steps: error detection and error correction. The method comprises the steps of firstly, carrying out error detection on a text word by a Chinese word segmentation device, wherein a word segmentation result often has a wrong segmentation condition due to wrongly-written characters contained in a sentence, so that errors are detected from two aspects of word granularity and word granularity, and suspected error results of the two granularities are integrated to form a suspected error position candidate set; and the error correction part traverses all suspected error positions, replaces words in the error positions with similar words, calculates sentence confusion degree through a language model, compares and sorts results of all candidate sets to obtain the optimal corrected words. The method for text error correction has the advantages of controllability, flexibility, high speed, small resource occupation and the like.

In some embodiments, in step S103, whether the user intends to be a simple factual question answer or a common question answer is identified through short text classification, and in order to improve the robustness of the system, different intentions are scored as a first score. A higher first score indicates a greater likelihood of conforming to the user's true query intent.

In some embodiments, referring to fig. 3, in step S104, for the simple fact question-answer, retrieving based on a pre-constructed knowledge graph, and obtaining a first candidate answer set includes:

step S1041, extracting entity information, relationship information and attribute information in the corrected text, using a synonym dictionary to link the synonym dictionary to the entity, relationship or attribute in the knowledge map, and generating an SQL query statement;

step S1042, filling the SQL query sentence to the extracted corresponding word slot position, and executing query to obtain a first candidate answer set.

Specifically, the entity linking step includes two parts: identification and disambiguation. The identification part mainly uses entity identification in lexical analysis to obtain entities and relationship attributes in user query. For some special fields, a field dictionary is also added in the lexical analysis. The disambiguation part mainly searches the identified entities from the knowledge graph, including aliases, acronyms and the like, as a candidate entity set. The method of Learning to Rank is then used to select the appropriate entity from the candidate set.

And determining a second score of each candidate answer in the first candidate answer set according to the relevance, wherein the higher the second score is, the higher the possibility that the retrieval result meets the requirement of the user is.

In some embodiments, referring to fig. 4, in step S105, for the common question solution, retrieving based on a pre-constructed vectorized FAQ question-answer pair, and obtaining a second candidate answer set includes:

step S1051, carrying out text vectorization on the correction text;

step S1052, searching for similar vectors from the vectorized FAQ question-answer pair, obtaining corresponding answers, and generating a second candidate answer set.

Wherein searching for similar vectors from the vectorized FAQ question-answer pair comprises:

The method for searching similar vectors from the vectorized FAQ question-answer pair comprises similar question matching and question answer matching. In practical application, similarity problem matching is mainly adopted, and the vectorization method is mainly based on the semantic level vector of Bert.

And determining a third score of each candidate answer in the second candidate answer set according to the relevance, wherein the higher the third score is, the higher the possibility that the retrieval result meets the requirement of the user is.

In some embodiments, referring to fig. 5, in step S106, ranking the candidate answers according to the first score, the second score, and the third score to obtain an answer includes:

step S1061, performing weighted summation on the first score and the second score of the simple fact question-answer to obtain a fourth score of each candidate answer in the first candidate answer set.

Step S1062, performing weighted summation on the first score and the third score of the common answer to obtain a fifth score of each candidate answer in the second candidate answer set.

Step S1063, sorting all candidate answers according to the fourth score and the fifth score, and selecting the answer with the highest sorting; the sorting is in the order of scores from high to low.

And step S1064, generating an answer feedback to the user according to the selected answer and the answer template.

After the system is operated online, logs are checked regularly, new questions proposed by a user are collected, vectorization processing is carried out after standard answers are made for marking, and the vectorization processing is added into vectorization FAQ question-answer pairs and/or updated into a knowledge graph, so that continuous optimization is realized.

In some embodiments, referring to fig. 6, there is provided a semantic retrieval apparatus including:

a receiving module 201, configured to receive query information sent by a user;

the error correction module 202 is configured to correct errors of the text in the query information to obtain a corrected text;

an intent determination module 203, configured to perform user intent analysis on the corrected text based on a question template library, and determine a first score of the identified user intent, where the user intent includes simple fact question answering and common question answering;

the first retrieval module 204 is configured to retrieve the simple fact questions and answers based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determine a second score of each candidate answer in the first candidate answer set according to the relevance;

the second retrieval module 205 is configured to solve the common questions, retrieve based on a pre-constructed vectorized FAQ question-answer pair, obtain a second candidate answer set, and determine a third score of each candidate answer in the second candidate answer set according to the relevance;

and the answer generating module 206 is configured to rank the candidate answers according to the first score, the second score, and the third score to obtain answers.

Specifically, the error correction module 202 is further configured to perform word segmentation on the text by using a chinese word segmenter, and perform error detection through word granularity and word granularity to generate a candidate set of suspected error positions; traversing all suspected error positions, searching phonetic and morphological words from a pre-stored dictionary to replace words at the suspected error positions, and calculating sentence confusion degree through a language model; sorting the replacement results according to the sentence confusion degree calculation result to obtain an optimal corrected word; and generating the corrected text according to the optimal corrected word.

The first retrieval module 204 is further configured to extract entity information, relationship information, and attribute information in the corrected text, link the entity information, relationship information, and attribute information to an entity, relationship, or attribute in the knowledge graph using a synonym dictionary, and generate an SQL query statement; and filling the SQL query statement to the position of the extracted corresponding word slot, and executing query to obtain a first candidate answer set.

The second retrieving module 205 is further configured to perform text vectorization on the correction text, search for similar vectors from the vectorized FAQ question-answer pair, obtain corresponding answers, and generate a second candidate answer set.

The second retrieving module 205 is further configured to calculate similarity between the vectorized correction text and the question in the vectorized FAQ question-answer pair, and return an answer corresponding to the question with the highest similarity; and/or calculating the similarity between the vectorized correction text and the answer in the vectorized FAQ question-answer pair, and returning the answer with the highest similarity.

The answer generating module 206 is further configured to perform weighted summation on the first score and the second score of the simple fact question-answer to obtain a fourth score of each candidate answer in the first candidate answer set; weighting and summing the first score and the third score of the common question answer to obtain a fifth score of each candidate answer in a second candidate answer set; sorting all the candidate answers according to the fourth score and the fifth score, and selecting the answer with the highest sorting; and generating an answer feedback to the user according to the selected answer and the answer template.

For the specific working principle, please refer to the above method embodiments, which are not described herein again.

Referring to fig. 7, in some embodiments, there is further provided an electronic device including a processor 301 and a memory 302, where the memory 302 stores a plurality of instructions, and the processor 301 is configured to read the plurality of instructions and execute the semantic retrieval method described above, for example, including: receiving query information sent by a user; correcting the text in the query information to obtain a corrected text; performing user intention analysis on the corrected text based on a question template library, and determining a first score of the identified user intention, wherein the user intention comprises simple fact question answers and common question answers; for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy; for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy; and sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.

In some embodiments, there is also provided a computer-readable storage medium storing a plurality of instructions that are readable by a processor and perform the semantic retrieval method described above, for example, comprising: receiving query information sent by a user; correcting the text in the query information to obtain a corrected text; performing user intention analysis on the corrected text based on a question template library, and determining a first score of the identified user intention, wherein the user intention comprises simple fact question answers and common question answers; for simple fact question answering, retrieving based on a pre-constructed knowledge graph to obtain a first candidate answer set, and determining a second score of each candidate answer in the first candidate answer set according to the relevancy; for common question answers, retrieving based on a pre-constructed vectorized FAQ question-answer pair to obtain a second candidate answer set, and determining a third score of each candidate answer in the second candidate answer set according to the relevancy; and sequencing the candidate answers according to the first score, the second score and the third score to obtain answers.

In summary, the semantic retrieval method, the semantic retrieval device, and the electronic device provided in the embodiments at least have the following advantages:

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A semantic retrieval method, comprising:

receiving query information sent by a user;

correcting the text in the query information to obtain a corrected text;

2. The method of claim 1, wherein correcting the text in the query information to obtain corrected text comprises:

and generating the corrected text according to the optimal corrected word.

3. The method of claim 2, wherein for simple fact questions and answers, retrieving based on a pre-constructed knowledge graph, obtaining a first set of candidate answers comprises:

4. The method of claim 3, wherein for the common problem solution, retrieving based on a pre-constructed vectorized FAQ question-answer pair, obtaining a second set of candidate answers comprises:

5. The method of claim 4, wherein finding similar vectors from the vectored FAQ question-answer pair comprises:

6. The method of claim 5, wherein ranking the candidate answers according to the first score, the second score, and the third score to obtain an answer comprises:

7. The method of claim 1, wherein the problem template library is pre-constructed in the following manner:

8. A semantic retrieval apparatus, comprising:

the receiving module is used for receiving query information sent by a user;

9. An electronic device comprising a processor and a memory, the memory storing a plurality of instructions, the processor being configured to read the plurality of instructions and execute the semantic retrieval method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing a plurality of instructions readable by a processor and performing the semantic retrieval method of any one of claims 1 to 7.