CN110750616B

CN110750616B - Retrieval type chatting method and device and computer equipment

Info

Publication number: CN110750616B
Application number: CN201910985485.7A
Authority: CN
Inventors: 邵建智; 毛晓曦; 范长杰
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2019-10-16
Filing date: 2019-10-16
Publication date: 2023-02-03
Anticipated expiration: 2039-10-16
Also published as: CN110750616A

Abstract

The invention provides a retrieval type chatting method, a retrieval type chatting device and computer equipment, relates to the technical field of natural language processing, and solves the technical problem that sentences with many co-occurring characters and large difference in actual meanings influence retrieval results to obtain replies far away from expectations. The method is applied to a chat robot, a text index library and a semantic index library are prestored in the chat robot, problem indexes in the text index library are text feature indexes, and problem indexes in the semantic index library are semantic vector indexes; the method comprises the following steps: determining a first alternative answer set through text characteristics of a target question to be answered based on a text index library; determining a second alternative answer set through a semantic vector of the target question based on a semantic index library; and determining at least one target answer with the highest matching degree with the target question as a reply of the target question according to the first candidate answer set and the second candidate answer set based on a pre-trained matching model.

Description

Retrieval type chatting method and device and computer equipment

Technical Field

The present application relates to the field of natural language processing technologies, and in particular, to a retrieval type chat method, apparatus, and computer device.

Background

With the rapid development of networks and information technologies, the internet has become a main means for people to acquire information, and there are two general means for acquiring information from the internet: one is to search keywords through a search engine to obtain relevant information; the other is to ask questions and obtain corresponding answers by providing customer service to a website or social software. The former is based on keyword search, users are difficult to express own retrieval intention clearly, related web pages returned by a search engine are numerous, and users are difficult to quickly locate required information; the latter is based on natural language to ask questions, can clearly express the user's intention, but manual asking and answering needs to provide a large amount of customer service to answer, causes the sharp increase of human cost. To solve the human cost problem, various large websites or product manufacturers adopt an "auto-answer" chat robot to answer questions for users, such as smismimi in korea, siri in apple, kyoto customer service, mini i robot, microsoft mini ice, microsoft mini na (Cortana), and the like.

The chat robot is an important field to which a Natural Language Processing (NLP) technology is applied, and replies based mainly on a search method. Namely, reply is made by relying on the existing question-answering library. For knowledge question answering in the general/open field, a best matching answer can be found out by maintaining a large amount of general/open field knowledge and using a retrieval mode; the current approach is primarily to determine whether two problems are close by co-occurrence of words where the two problems occur. However, due to the complexity of natural language, two sentences with many co-occurring words have great difference in actual meanings, and thus, the reply is far from the expected reply.

Disclosure of Invention

The invention aims to provide a retrieval type chatting method, a retrieval type chatting device and computer equipment, which are used for solving the technical problem that sentences which have a large number of co-occurring words but have large actual meanings influence retrieval results to obtain replies far away from expectation.

In a first aspect, an embodiment of the present application provides an indexing type chat method, which is applied to a chat robot, where a text index library and a semantic index library are prestored in the chat robot, a problem index in the text index library is a text feature index, and a problem index in the semantic index library is a semantic vector index; the method comprises the following steps:

determining a first alternative answer set through text features of the target question to be answered based on the text index library;

determining a second alternative answer set through a semantic vector of the target question based on the semantic index library;

and determining at least one target answer with the highest matching degree with the target question as a reply of the target question according to the first candidate answer set and the second candidate answer set based on a pre-trained matching model.

In one possible implementation, the text index library includes a first question-answer corpus, and the question index in the text index library is used for indexing a question in the first question-answer corpus; the step of determining a first alternative answer set through the text features of the target question to be answered based on the text index library comprises the following steps:

determining a first preset number of first questions with the maximum text similarity with the target question according to the text characteristics of the target question to be answered based on the text index library;

and determining a first alternative answer set according to the corresponding answer of each first question in the first question-answer pair corpus.

In one possible implementation, the step of determining, based on the text index library and according to text features of the target questions to be answered, a first preset number of first questions having a maximum text similarity with the target questions includes:

based on the text index library, determining the text similarity of the current question and the target question according to one or more of the number of co-occurring words appearing between the retrieved current question and the target question, the proportion of the co-occurring words in the current question and the target question respectively, and the importance degree of the co-occurring words in the first question-answer pair corpus;

determining a first preset number of first questions with the maximum text similarity with the target question in the questions of the first question-and-answer corpus;

the current question is any one of the questions retrieved from the text index library according to the text features of the target question, and the lower the frequency of the co-occurrence words appearing in the first answer pair corpus is, the higher the importance degree is.

In one possible implementation, the step of determining, based on the text index library and according to text features of the target question to be answered, a first preset number of first questions having a maximum text similarity with the target question includes:

based on the text index library, determining the text similarity of the current question and the target question according to the retrieved Jacard distance between the two word sets corresponding to the current question and the target question;

and the current question is any one of the questions retrieved from the text index database according to the text characteristics of the target question.

In one possible implementation, the semantic index library includes a first question-and-answer corpus, and the question index in the semantic index library is used for indexing a question in the first question-and-answer corpus; the step of determining a second candidate answer set by the semantic vector of the target question based on the semantic index library comprises the following steps:

determining a second preset number of second questions with the maximum semantic similarity with the target question according to the semantic vector of the target question to be answered based on the semantic index library;

and determining a second alternative answer set according to the corresponding answer of each second question in the first question-answer pair corpus.

In one possible implementation, the step of determining, based on the semantic index library, a second preset number of second questions having a maximum semantic similarity with the target question according to the semantic vector of the target question to be answered includes:

determining a semantic vector of the target problem;

based on the semantic index library, sorting the problems in the semantic index library according to semantic similarity between the current semantic vector and the semantic vector of the target problem;

selecting a second preset number of second problems with the largest semantic similarity according to the sequence of the problems in the semantic index library;

and the current semantic vector is any one of the semantic vectors in the semantic index library.

In one possible implementation, the semantic vector is extracted by a pre-trained natural language processing NLP model.

In one possible implementation, the step of ranking the questions in the semantic index base according to the semantic similarity between the current semantic vector and the semantic vector of the target question based on the semantic index base includes:

based on the semantic index library, determining cosine similarity between a front semantic vector and a semantic vector of the target problem;

and sequencing the problems in the semantic index library according to the cosine similarity.

In one possible implementation, the NLP model includes any one of a Bidirectional Encoder Representation from transforms (BERT) model, a word vector Language (ELMo) model, and a Generative Pre-Training Language (GPT) model.

In one possible implementation, the pre-trained matching model is obtained by training an initial matching model according to a training sample, where the training sample includes a positive sample and a negative sample; the positive sample is an original question in a second question-answer pair material and an original answer corresponding to the original question; the negative sample is the least similar answer in a third specified number of answers most similar to the original answer in the original question and the second question-answer corpus.

In one possible implementation, the method further includes: and sequentially taking a third preset number of answers which are most similar to the text similarity and/or the semantic similarity of the original answer in the second question-answer corpus as current answers to execute the following steps: if the text similarity and/or the semantic similarity between the current answer and the original answer is smaller than a preset value, determining that the current answer is the most dissimilar answer to the original answer;

and if the text similarity and/or the semantic similarity between the third preset number of most similar answers and the original answer are larger than a preset value, randomly selecting one answer from the third preset number of most similar answers as the least similar answer to the original answer.

In one possible implementation, the first question-answer pair corpus is the same as the second question-answer pair corpus.

In a second aspect, an indexing chat device is provided, which is applied to a chat robot, wherein a text index library and a semantic index library are prestored in the chat robot, problem indexes in the text index library are text feature indexes, and problem indexes in the semantic index library are semantic vector indexes; the device comprises:

the first determination module is used for determining a first alternative answer set through text characteristics of a target question to be answered based on the text index library;

a second determining module, configured to determine, based on the semantic index library, a second candidate answer set through a semantic vector of the target question;

and the third determining module is used for determining at least one target answer with the highest matching degree with the target question as the reply of the target question according to the first candidate answer set and the second candidate answer set based on a pre-trained matching model.

In a third aspect, an embodiment of the present application further provides a computer device, including a memory and a processor, where the memory stores a computer program executable on the processor, and the processor implements the method of the first aspect when executing the computer program.

In a fourth aspect, this embodiment of the present application further provides a computer-readable storage medium storing machine executable instructions, which, when invoked and executed by a processor, cause the processor to perform the method of the first aspect.

The embodiment of the application brings the following beneficial effects: the embodiment of the application provides a retrieval type chatting method, a retrieval type chatting device and computer equipment. Firstly, a plurality of alternative answer sets are determined from a text index library and a semantic index library respectively through text features and semantic vectors of a target question, then, based on a pre-trained matching model, final answers are selected from the alternative answer sets, similarity is measured from the perspective of text and semantics, high-quality alternative answers are finally selected, the final answers determined based on the high-quality alternative answers can better accord with expected results, and user experience is more friendly.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the detailed description of the present application or the technical solutions in the prior art, the drawings used in the detailed description or the prior art description will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a retrieval type chat method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a training process of a matching model according to an embodiment of the present disclosure;

fig. 3 is another schematic flow chart of a retrieval type chat method according to the embodiment of the present application;

fig. 4 is a schematic structural diagram of a retrieval type chat device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions of the present application will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "comprises" and "comprising," and any variations thereof, as used in the embodiments of this application, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Embodiments of the present invention are further described below with reference to the accompanying drawings.

Fig. 1 is a schematic flowchart of a retrieval type chat method according to an embodiment of the present application. The method is applied to a chat robot, a text index library and a semantic index library are pre-stored in the chat robot, problem indexes in the text index library are text feature indexes, and problem indexes in the semantic index library are semantic vector indexes.

As an example, the chat robot may correspond to a terminal provided with a graphical user interface for interacting with a user, and the terminal may transmit a target question to be answered to the chat robot.

As another example, the chat robot may also be provided locally with a graphical user interface for interacting with the user to retrieve the targeted questions to be answered.

For example, a graphical user interface provided by the chat robot or terminal may be used to display a question input area and an answer display area. The user may enter a target question to be answered for the question entry area. The terminal determines an input target question to be answered in response to an input operation for the question input area. The input operation may include a variety of ways, such as a physical input operation (e.g., a keyboard input) triggered in response to pressing a physical key; triggering touch input operation by responding to the touch virtual key; or by responding to a slide operation on a touch screen, a triggered handwriting input operation, or the like.

Of course, the terminal or the chat robot may also provide an audio device locally for interacting with the user to retrieve the target question to be answered. For example, a microphone provided by the chat robot or the terminal may acquire audio uttered by the user, and then, by recognizing the audio, get a target question to be answered, and the terminal determines a reply to the target question in response to the target question. The chat robot or terminal may also provide a speaker to output answers corresponding to the target questions.

As shown in fig. 1, the method may specifically include the following steps:

and S110, determining a first alternative answer set through text characteristics of the target question to be answered based on the text index library.

The text index library may be built for the material based on the first question and answer. An inverted index (inverted index) table may be established according to text features corresponding to questions in the first question-answer pair corpus, and each entry in the inverted index table may include one text feature and each question having the text feature. Each question-answer pair in the question-answer pair corpus can be indexed by the inverted index table. The text feature may refer to a word or phrase.

For example, the question-answer pair is "question: you are beautiful; answer is as follows: thank you for your likes, "the text index library generated based on this question-answer pair may be as shown in table 1.

TABLE 1

Through the text features included in the target question, a plurality of questions with the same text features can be retrieved from the text index database, and the first alternative answer set can be determined according to answers corresponding to the questions with the same text features.

In addition, in the chat robot, a first question-answer corpus may be predetermined; the first question-answer pair material can be collected in advance and obtained through pretreatment. Preprocessing the collected corpus pairs may include data cleansing, which may include filtering out yellow, political, and storm-related sentences through sensitive thesaurus and rules; and filtering sentences with too long or too short length, then putting the data into a labeling platform for manual labeling, and further manually filtering all improper question-answer pairs to obtain a high-quality first question-answer pair corpus.

And S120, determining a second alternative answer set through the semantic vector of the target question based on the semantic index library.

The semantic index library may also be built for the material based on the first question and answer. The semantic index table may be established according to a semantic vector corresponding to a question in the first question-answer pair corpus, and each entry in the semantic index table may include a semantic vector and a question-answer pair. Each question-answer pair in the question-answer pair corpus can be indexed by the inverted index table. The text feature may refer to a word or phrase. Additionally, in some embodiments, the semantic index library may also be built for the corpus based on a different question-answer than the first question-answer corpus.

For example, the question-answer pair is "question: you are beautiful; and (3) answer: thank you for your likes, "the library of text indexes generated based on this challenge-response pair can be as shown in table 1.

And S130, determining at least one target answer with the highest matching degree with the target question as a reply of the target question according to the first candidate answer set and the second candidate answer set based on a pre-trained matching model.

According to the retrieval type chatting method provided by the embodiment of the application, a plurality of alternative answer sets are determined from a text index library and a semantic index library respectively through text features and semantic vectors of target questions, final answers are selected from the alternative answer sets based on a pre-trained matching model, the similarity is measured from two angles of texts and semantics, high-quality alternative answers are selected finally, the final answers determined based on the high-quality alternative answers can better accord with expected results, and therefore user experience is improved.

The pre-trained matching model can be obtained by training an initial matching model according to a training sample. As shown in fig. 2, the training samples may include positive samples and negative samples.

In some embodiments, the positive sample may be a second question-answer to an original question in the material and an original answer corresponding to the original question; the negative sample is the least similar answer in a third specified number of answers most similar to the original answer in the original question and the second question-answer corpus.

For example, in the second question-answer corpus, there are 5 answers most similar to the original answer, and similarity comparisons are performed within these 5 answers, and the answer most dissimilar to the original answer is taken as a negative sample.

The negative sample obtained by the method can ensure that the replies of the positive sample and the negative sample have certain similarity and are not too similar. Therefore, the situation that the positive sample and the negative sample have overlarge difference, the positive sample and the original problem contain co-occurrence words, the negative sample basically cannot have the situation, the matching model excessively relates to the characteristics related to the co-occurrence words, and the matching scores of all replies to the problem and the co-occurrence words are high during prediction, so that the matching effect is reduced. On the other hand, the situation that the difference between the positive sample and the negative sample is too small is avoided, and the negative sample can also be used as the reply of the original problem, so that a large amount of noise appears in training data and the normal training of the matching model is interfered.

In some embodiments, further comprising: and sequentially taking a third preset number of answers which are most similar to the text similarity and/or the semantic similarity of the original answer in the second question-answer pair corpus as current answers to execute the following steps:

if the text similarity and/or the semantic similarity between the current answer and the original answer are smaller than a preset value, determining that the current answer is the most dissimilar answer to the original answer;

As shown in fig. 2, a plurality of replies most similar to the original reply (original answer) are retrieved from the question-answer corpus by using the inverted index table, after the nth most similar reply is obtained, whether the similarity between the reply and the original reply is lower than a threshold value or not can be calculated, if so, the nth most similar reply is taken as a negative sample, and if not, the negative sample is randomly sampled from the most similar replies so as to train the matching model by using the positive sample and the negative sample.

For example, in the second question-answer corpus, there are 5 answers with the similarity greater than the first threshold with the original answer, and in the 5 answers, a negative sample is selected from the answers with the similarity lower than the second threshold with the original answer, of course, the similarity of the second threshold is greater than the first threshold; if there is no answer with similarity lower than the second threshold from among the 5 answers, one answer is randomly selected as a negative sample from among the 5 answers.

By the negative sample determining method, a relatively more appropriate negative sample can be determined under various conditions, so that the responses of the positive sample and the negative sample are ensured to have certain similarity and are not too similar.

As shown in fig. 3, the second pair of question-answer corpus may be the same as the first pair of question-answer corpus in step S110 (the question-answer corpus in fig. 3). For example, a third preset number of answers with the most similar text similarity to the original answer may be retrieved by using the text index library in the above steps, and a plurality of answers with the same text features are retrieved in the text index library according to the text features included in the original answer; or, the third preset number of answers with the most similar semantic similarity to the original answer can be retrieved by separately or simultaneously utilizing the semantic index library in the above steps, and a plurality of answers with the same semantic vector are retrieved in the semantic index library through the semantic vector included in the original answer.

For example, the original answer in the first question-answer corpus is input into the text index library to obtain a plurality of answers similar to the original answer text. Of course, the original answer may also be input into the semantic index library to obtain a plurality of answers similar to the original answer text.

Therefore, a plurality of answers similar to the original answer text and/or semantics can be quickly and accurately obtained by directly utilizing the text index library and/or the semantic index library in the steps, and the efficiency of data processing is improved.

In some embodiments, the step S110 may specifically include:

step 1.1), determining a first preset number of first questions with the maximum text similarity to the target question according to the text characteristics of the target question to be answered based on a text index library;

step 1.2), determining a first alternative answer set according to the corresponding answer of each first question in the first question-answer pair corpus.

For example, as shown in fig. 3, a plurality of first questions having the most similar text features to the target question to be answered may be retrieved from the text index library by using the inverted index table, and then answers corresponding to the first questions are queried in the first question-answer corpus, and the corresponding answers are used as the first candidate answer set.

The first alternative answer set obtained by the method can fully utilize the text characteristics of the target question to be answered, and retrieve the answer corresponding to the question with the maximum text similarity to the target question.

In some embodiments, the text similarity of the two questions is determined according to one or more of the number of co-occurring words appearing between the two questions, the proportion of the co-occurring words in each question of the two questions, and the importance degree of the co-occurring words in the question-answer corpus; wherein the lower the frequency of occurrence of a word in question-answering expectations, the higher the degree of importance. Exemplarily, the step 1.1) may specifically include:

based on a text index library, determining the text similarity of the current question and the target question according to one or more items of the number of co-occurring words between the retrieved current question and the target question, the ratio of the co-occurring words in the current question and the target question respectively, and the importance degree of the co-occurring words in the first question-answer pair corpus;

determining a first preset number of first questions with the maximum similarity with the target question text in the questions of the first question-answer corpus;

the current question is any one of the questions retrieved from the text index library according to the text features of the target question, and the lower the frequency of the co-occurrence words appearing in the first question-answer pair corpus is, the higher the importance degree is.

Of course, the BM25 algorithm may be used to retrieve the problem with higher similarity to the target problem from the text index database. It should be noted that the BM25 algorithm can calculate the matching degree between the target question and all the questions in the text index library, and obtain a plurality of questions with the highest matching degree.

For example, the BM25 algorithm is used to analyze the words and phrases in the target question, retrieve the question that the words and phrases also appear from the text index database, and consider the frequency of the common words and phrases appearing between the target question and all the questions in the text index database, and the importance of the words and phrases (e.g., the less the number of times the words and phrases appear in all the questions, the greater the importance).

By the retrieval method, various factors can be considered based on the text to retrieve the problem most similar to the target problem text in the text index database.

In some embodiments, the two question text similarity is determined according to the Jacard distance between the two sets of words to which the two questions correspond. Exemplarily, the step 1.1) may specifically include:

determining the text similarity of the current problem and the target problem according to the retrieved Jacard distance between two word sets corresponding to the current problem and the target problem based on a text index library;

determining a first preset number of first questions with the maximum similarity with the target question text in the questions of the first question-and-answer corpus;

the current question is any one of the questions retrieved from the text index database according to the text features of the target question.

It should be noted that a similarity index between texts of two questions can be calculated by using the Jaccard distance (Jaccard) to measure the similarity between the two questions. In this embodiment, the similarity index between the two questions is: the number of the morphemes of the intersection between the two word sets corresponding to the texts of the two questions is divided by the number of the morphemes of the union. The formula is as follows:

for example, there are 3 words in the current question and 5 words in the target question, where there are 2 words that co-occur between the current question and the target question. Then the jackard distance between the current question and the target question is: 2/(3 + 5) =1/4.

Of course, the similarity threshold may also be preset. For example, when calculating the similarity between the current question and the target question, assuming that the number of words appearing in both questions is less than 4 and the jlcard distance is less than 0.3, the answer may be considered to be dissimilar from the original answer, and is taken as a negative sample. Therefore, the number (2) of the words which appear together in the two problems is less than 4, and the Jacard distance 1/4 between the current problem and the target problem is less than the preset 0.3, so that the text similarity between the current problem and the target problem can be determined to be low.

By the method for determining the text similarity between the two texts by using the Jacard distance, the similarity between the two problems can be more accurately and reasonably calculated, so that the problem with the maximum text similarity with the target problem can be more accurately determined.

In some embodiments, the semantic index library comprises a first query-response corpus, and the question index in the semantic index library is used for indexing a question in the first query-response corpus; the step S120 specifically includes:

step 2.1), determining a second preset number of second questions with the maximum semantic similarity with the target question according to the semantic vector of the target question to be answered based on the semantic index library;

and 2.2) determining a second alternative answer set according to the corresponding answer of each second question in the first question-answer pair corpus.

By the method for determining the semantic similarity according to the semantic vector, two questions which do not contain the same words but have similar semantics can be searched, so that the second alternative answer set can better meet the actual conditions of various questions.

For example, the target question is "you are beautiful", compared with the question of "you are really looking good" in the first question-answer corpus, the two questions do not have the same words except for the "you", but the semantics are very close, and the existing text retrieval mode cannot retrieve similar questions. The method provided by the embodiment of the application can be used for calculating that the beautiful semantic vector is very similar to the long-life and good-looking semantic vector, so that the problem that the question is long-life and good-looking is similar to the target problem is determined, the corresponding answer of the question that the question is long-life and good-looking in the first question-answer corpus is also included into the alternative answer set, and the problem that the answer of the more suitable target problem is omitted due to the fact that no common words exist is avoided.

In some embodiments, the pre-establishing a corresponding relationship between the semantic vector of each question in the first question-answer pair corpus and the question, for example, step 2.1) may specifically include:

step 2.11), determining a semantic vector of the target problem;

step 2.12), based on the semantic index library, sorting the problems in the semantic index library according to the semantic similarity between the current semantic vector and the semantic vector of the target problem;

step 2.13), selecting a second preset number of second problems with the maximum semantic similarity according to the sequence of the problems in the semantic index database;

For step 2.11), the target question may be encoded into a semantic vector using an encoder, as shown in fig. 3. Specifically, the target problem is input to an encoder, and the target problem is encoded into a vector of fixed length by the encoder, and the vector of fixed length is used as a semantic vector.

For step 2.12), because the semantic vector of each question in the first question-answer corpus has a corresponding relationship with the question, the similarity between each question in the first question-answer corpus and the target question can be obtained according to the corresponding relationship and the semantic similarity by calculating the semantic similarity between any one semantic vector in the semantic index library and the semantic vector of the target question, and then all the questions in the first question-answer corpus can be subjected to similarity ranking according to the similarity of each question.

Step 2.13), selecting a plurality of second questions with the largest semantic similarity from the first question-answer pair material according to the sequence obtained in the step 2.12).

By the method for determining the semantic similarity according to the semantic vector, two problems which do not contain the same word and have similar semantics can be searched, and the semantic similarity between the two problems is calculated through the semantic vector, so that the search result is more accurate and precise.

In some embodiments, the semantic vector is extracted by a pre-trained natural language processing NLP model.

Illustratively, a natural language processing NLP model may be trained in advance, and each question in the target question and the first question-answer corpus may be converted into a semantic word vector of a fixed length by using the trained NLP model, where each question may correspond to several semantic word vectors.

Through the pre-trained NLP model, the problem can be converted into the corresponding semantic word vector more quickly and efficiently, and the semantic word vector output by the NLP model is more accurate.

In some embodiments, the NLP model includes any one of a BERT model, an ELMo model, and a GPT model.

For the specific conversion process of the word vector, the BERT model is taken as an example for explanation.

Firstly, inputting a target problem into a BERT model, wherein the BERT model converts each word in the target problem into a one-dimensional original word vector through codes in a searched word vector table (dictionary), and the original word vector is used for expressing the self word meaning of the word; then, aiming at each word in the target problem, the BERT model calculates the mapping relation between the word and other words in the target problem by using the original word vector, and the mapping relation is blended into the original word vector of the word to obtain a multi-dimensional (the BERT model is 512-dimensional) semantic word vector, wherein the semantic word vector represents the comprehensive semantics of the word in the target problem where the word is located and can also be understood as vector information after full text (all texts in the target problem) semantic information is blended; and then, outputting a vector which is fused with full-text semantic information and corresponds to each word in the target problem by the BERT model. It can also be understood that the output of the BERT model is a vector representation that enhances the semantics of each word in the target problem separately.

It should be noted that the meaning of a word (a word or a phrase, which is used to refer to the word in this embodiment) expressed in a text is generally related to its context. For example, it is difficult to analyze the semantic expression of "target" by analyzing only one semantic expression of "target", but it is easy to obtain a more precise and richer semantic expression of "target" by analyzing in connection with the context "target" of "target". Therefore, the context information of a word is helpful to enhance the semantic representation of the word, and the exact and rich semantic expression of the word is convenient to obtain. Furthermore, different words in the context tend to play different roles in enhancing the semantic representation. For example, "hong" has the greatest effect on analyzing the "target" semantic, while "it" has a relatively small effect.

In this embodiment, the BERT model can determine the action relationship between each word in the input target problem and other words in the target problem, so as to enhance the semantic representation of the word by using the context word information, so that the semantic word vector corresponding to each word output by the BERT model can comprehensively and accurately represent the semantics of the word in the whole target problem.

The concrete implementation method of the BERT model is described below.

The BERT model uses a Transformer coder as a language model, the Transformer language model utilizes a self-attention (self-attention) mechanism, and all the upper and lower layers in the BERT model are directly connected with each other due to the self-attention mechanism.

For the Attention mechanism, it mainly involves three concepts: query, key, and Value. In the semantic expression process of the enhanced word, the target word and the word of the context thereof have respective original Value (original word vector), the Attention mechanism takes the target word as Query and each word of the context thereof as Key, takes the calculated similarity between Query and each Key as weight, and fuses the weight of each word of the context into the original Value of the target word to obtain the Value (semantic word vector) of the enhanced semantic of the target word, and takes the Value as the output of Attention.

For a text (question) input into a BERT model, the Attention mechanism needs to enhance the semantic expression vector of each word, and therefore, each word needs to be used as Query respectively, and semantic information of all other words in the text is weighted and fused to obtain an enhanced semantic vector of each word. In this case, the vectors of Query, key and Value are all from the same input text (same target problem), and therefore, the Attention mechanism is also called Self-Attention.

The Self-orientation may have a plurality of headers (Multi-head), for which the Self-orientation mechanism may be understood as a Multi-way vector fusion for a plurality of different semantic scenes. For example, "a city changhuan bridge" may have different semantic expressions for this sentence under different semantic scenes: "the bridge of Yangtze river/the city of A", or "the bridge of Yangtze river/the city of A". For the length in the sentence, the correct semantic unit can be formed only by combining the length with the length in the former semantic scene; in the latter semantic scenario, it needs to be combined with "city" to form a correct semantic unit. The purpose of the Self-orientation mechanism is to enhance the semantic representation of the target word with other words in the text, and under different semantic scenarios, the words focused on by Self-orientation are different. Therefore, the Multi-head Self-orientation can be understood as considering different fusion modes of semantic vectors of a target word and other words in a text under various semantic scenes. It can be seen that the output form of each head in the Multi-head Self-orientation is completely the same, and is a semantic word vector formed by fusing full-text semantic information with each word. Finally, the Self-extension can linearly combine the semantic word vectors of a plurality of heads corresponding to each word, thereby obtaining a final semantic word vector with the same length as the original word vector after semantic enhancement.

For the whole BERT model, not only the semantic word vector fused with full-text (target problem) semantic information is output, but also the semantic sentence vector of the target problem can be output. For example, after inputting a target question, the BERT model inserts a [ CLS ] symbol before the target question, and takes the semantic sentence vector output corresponding to the symbol as the overall text semantic representation of the target question, which can be used for text classification. It can also be understood that the semantic sentence vector can more comprehensively and objectively fuse the semantic information of all words in the target question, such as the overall positive emotion and negative emotion of a sentence, compared with each semantic word vector in the target question. In practical application, a vector corresponding to the first character "[ CLS ]" output by the BERT model is taken as a semantic sentence vector.

Furthermore, for a 512-dimensional semantic word vector in the BERT model, it can be understood that 512 dimensions are used to represent the semantic word vector, and one dimension may correspond to a feature value of a feature.

For example, to determine the relationship of words to Gender (Gender) characteristics, assuming a male Gender of-1 and a female Gender of +1, the characteristic value for the Gender characteristics of man may be-1 and the characteristic value for the Gender characteristics of woman may be +1. According to this rule, it can be found that the gender feature value of king is-0.95, that of queen is +0.97, and that apple and orange are not gender-specific. As another example, another feature may be the Royal degree of these words, in which man, wman and nobility are hardly related, and apple and orange are also hardly related, so that the feature value of their nobility feature may be close to 0. And king and queen are expensive, and the eigenvalues of their expensive characteristics can be determined to be large.

Of course, there may be many other features such as Size, cost, live, action, noun, etc.

In the BERT model, there are 512 different features, so that a 512-dimensional vector can be formed to represent the man word, and a 512-dimensional vector can also be used to represent the wman word. For example, by using this representation method to represent the two words "apple" and "orange", the two words "apple" and "orange" belong to the similar fruit, and the two words may be very similar in their characteristic representations, but some characteristics may be different, such as color and taste of orange and color and taste of apple, or some other characteristics may be different. In general, most of features of the applet and the orange are actually similar, have a plurality of similar feature values, and also have a small number of different feature values, so that the features of the orange and the join can be effectively quantized, and the vector relation between the orange and the join can be conveniently calculated.

For a 512-dimensional feature vector, it can be assumed that 512-dimensional data is embedded in a two-dimensional space or a three-dimensional space, so that 512-dimensional data can be visualized. In the two-dimensional or three-dimensional space, the words man and wman are gathered together, the words king and queen are gathered together, the words are human and animal, and the words fruit, so that the characteristic relation between each word is mapped in the visual space. For a 512-dimensional feature vector, a 512-dimensional space, similar to a two-dimensional space or a three-dimensional space, may be assumed, and for the BERT model, the 512-dimensional space is used to represent the semantic word vector of each word.

In some embodiments, the similarity of the semantic vectors of the two questions is determined according to the cosine similarity of the two semantic vectors. Exemplarily, the step 2.12) may specifically include:

based on a semantic index library, determining cosine similarity between the front semantic vector and the semantic vector of the target problem;

and sorting the problems in the semantic index library according to the cosine similarity.

It should be noted that, since similar words have similar semantic word vectors, the distance of the similar semantic word vectors in the feature space is closer. In a high-dimensional space, the similarity in the vector space can be used to represent the similarity in the word meaning in the text, and the distance between words with similar meanings in the feature vector space is relatively close, and the directions pointed by the two vectors are also very close.

In this embodiment, the cosine similarity between the target word vector and the semantic word vector in the semantic index library is used to determine the semantic similarity therebetween. That is to say, the value of the included angle between two word vectors can be used as a measure of the degree of similarity between the two words, and the more similar the two words are, the more similar the corresponding semantic word vectors are, the smaller the included angle between the semantic word vectors is, and the higher the cosine similarity is. By means of a cosine function, the degree of closeness between two word vectors can be calculated, as follows:

a and B are two word vectors respectively, cos (theta) represents the cosine of an included angle between the two word vectors A and B, and similarity represents the similarity between words corresponding to the two word vectors A and B.

Therefore, the distance between two semantic word vectors can be calculated more accurately by utilizing the cosine similarity, so that the similarity between two words can be determined more accurately, and the problem similarity sequencing result in the semantic index library is more accurate.

Fig. 4 provides a schematic diagram of a structure of a retrievable chat device. The device is applied to a chat robot, a text index library and a semantic index library are prestored in the chat robot, problem indexes in the text index library are text feature indexes, and problem indexes in the semantic index library are semantic vector indexes; as shown in fig. 4, the retrieval type chatting apparatus 400 includes:

a first determining module 401, configured to determine, based on the text index library, a first candidate answer set according to text features of a target question to be answered;

a second determining module 402, configured to determine a second candidate answer set through a semantic vector of the target question based on the semantic index library;

and a third determining module 403, configured to determine, based on a pre-trained matching model, at least one target answer with the highest matching degree with the target question according to the first candidate answer set and the second candidate answer set, as a reply to the target question.

In some embodiments, the text index library comprises a first question-answer corpus, and the question index in the text index library is used for indexing a question in the first question-answer corpus;

the first determination module 401 includes:

the first determining submodule is used for determining a first preset number of first questions with the maximum text similarity with the target question according to the text characteristics of the target question to be answered based on the text index library;

and the second determining submodule is used for determining a first alternative answer set according to the corresponding answer of each first question in the first question-answer pair corpus.

In some embodiments, the first determination submodule is specifically configured to:

based on a text index library, determining the text similarity of the current question and the target question according to one or more of the number of co-occurring words appearing between the retrieved current question and the target question, the ratio of the co-occurring words in the current question and the target question respectively, and the importance degree of the co-occurring words in the first question-answer pair corpus;

In some embodiments, the semantic index library comprises a first query-response corpus, and the question index in the semantic index library is used for indexing a question in the first query-response corpus; the second determining module 402 includes:

the third determining submodule is used for determining a second preset number of second questions with the maximum semantic similarity with the target question according to the semantic vector of the target question to be answered based on the semantic index library;

and the fourth determining submodule is used for determining a second alternative answer set according to the corresponding answer of each second question in the first question-answer pair corpus.

In some embodiments, the third determination submodule is specifically configured to:

determining a semantic vector of the target problem;

based on a semantic index library, sorting the problems in the semantic index library according to semantic similarity between the current semantic vector and the semantic vector of the target problem;

In some embodiments, the semantic vectors are extracted from a pre-trained natural language processing NLP model.

In some embodiments, the pre-trained matching model is obtained by training an initial matching model according to training samples, the training samples including positive samples and negative samples; the positive sample is an original question in the second question-answer pair material and an original answer corresponding to the original question; the negative examples are the least similar answers among the third specified number of answers in the original question and the second question-and-answer corpus that are most similar to the original answers.

In some embodiments, the above apparatus further comprises:

the execution module is used for sequentially taking a third preset number of answers which are most similar to the text similarity and/or the semantic similarity of the original answer in the second question-answer pair corpus as current answers to execute the following steps: if the text similarity and/or the semantic similarity between the current answer and the original answer is smaller than a preset value, determining that the current answer is the most dissimilar answer to the original answer;

and the selecting module randomly selects one answer from the third preset number of most similar answers as the least similar answer to the original answer if the text similarity and/or the semantic similarity between the third preset number of most similar answers and the original answer are larger than the preset value.

In some embodiments, the first question-answer corpus is the same as the second question-answer corpus.

The retrieval type chat device provided by the embodiment of the application has the same technical characteristics as the retrieval type chat method provided by the embodiment, so that the same technical problems can be solved, and the same technical effects can be achieved.

As shown in fig. 5, an embodiment of the present application provides a computer device 500, including: a processor 501, a memory 502 and a bus, wherein the memory 502 stores machine-readable instructions executable by the processor 501, when the electronic device is operated, the processor 501 communicates with the memory 502 through the bus, and the processor 501 executes the machine-readable instructions to perform the steps of the retrieval chat method as described above.

Specifically, the memory 502 and the processor 501 can be general-purpose memory and processor, and are not limited in particular, and when the processor 501 runs a computer program stored in the memory 502, the retrievable chat method can be executed.

Corresponding to the retrieval type chat method, the embodiment of the application also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program is executed by a processor to execute the steps of the retrieval type chat method.

The retrieval type chat device provided by the embodiment of the application can be specific hardware on the equipment or software or firmware installed on the equipment. The device provided by the embodiment of the present application has the same implementation principle and technical effect as the foregoing method embodiments, and for the sake of brief description, reference may be made to the corresponding contents in the foregoing method embodiments where no part of the device embodiments is mentioned. It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the foregoing systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application, or portions of the technical solutions that substantially or partially contribute to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the mobile control method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the scope of the embodiments of the present application. Are intended to be covered by the scope of the present application.

Claims

1. A retrieval type chatting method is characterized in that the method is applied to a chatting robot, a text index base and a semantic index base are prestored in the chatting robot, problem indexes in the text index base are text feature indexes, and problem indexes in the semantic index base are semantic vector indexes; the method comprises the following steps:

determining a first alternative answer set through text features of a target question to be answered based on the text index library;

determining at least one target answer with the highest matching degree with the target question as a reply of the target question according to the first candidate answer set and the second candidate answer set based on a pre-trained matching model; the pre-trained matching model is used for determining the matching degree of the question and the answer;

the pre-trained matching model is obtained by training an initial matching model according to training samples, wherein the training samples comprise positive samples and negative samples; the positive sample is an original question in a second question-answer pair material and an original answer corresponding to the original question; the negative sample is the least similar answer in a third preset number of answers which are most similar to the original answer in the original question and the second question-and-answer corpus;

further comprising: and sequentially taking a third preset number of answers which are most similar to the text similarity and/or the semantic similarity of the original answer in the second question-answer corpus as current answers to execute the following steps: if the text similarity and/or the semantic similarity between the current answer and the original answer are smaller than a preset value, determining that the current answer is the most dissimilar answer to the original answer; and if the text similarity and/or the semantic similarity between the third preset number of most similar answers and the original answer are larger than preset values, randomly selecting one answer from the third preset number of most similar answers as the least similar answer to the original answer.

2. The method of claim 1, wherein the text index library comprises a first question-answer pair corpus, and wherein a question index in the text index library is used for indexing a question in the first question-answer pair corpus; the step of determining a first alternative answer set by the text features of the target question to be answered based on the text index library comprises the following steps:

determining a first preset number of first questions with the maximum text similarity with the target question according to the text features of the target question to be answered based on the text index library;

3. The method according to claim 2, wherein the step of determining a first preset number of first questions with the maximum text similarity to the target question according to the text features of the target question to be answered based on the text index library comprises:

determining a first preset number of first questions with the maximum text similarity with the target question in the questions of the first question-answer corpus;

4. The method according to claim 2, wherein the step of determining a first preset number of first questions with the maximum text similarity to the target question according to the text features of the target question to be answered based on the text index library comprises:

5. The method according to claim 1, wherein the semantic index library comprises a first query-response corpus, and the question index in the semantic index library is used for indexing a question in the first query-response corpus; a step of determining a second candidate answer set by a semantic vector of the target question based on the semantic index library, comprising:

6. The method according to claim 5, wherein the step of determining a second preset number of second questions with the largest semantic similarity to the target question according to the semantic vector of the target question to be answered based on the semantic index library comprises:

determining a semantic vector of the target problem;

7. The method of claim 6, wherein the semantic vector is extracted by a pre-trained Natural Language Processing (NLP) model.

8. The method according to claim 7, wherein the step of ordering the questions in the semantic index library according to semantic similarity between the current semantic vector and the semantic vector of the target question based on the semantic index library comprises:

9. The method of claim 7, wherein the NLP model comprises any one of a BERT model, an ELMo model, and a GPT model.

10. The method of claim 9, wherein said first question-answer corpus is identical to said second question-answer corpus.

11. An indexing type chatting device is applied to a chatting robot, a text index base and a semantic index base are prestored in the chatting robot, problem indexes in the text index base are text feature indexes, and problem indexes in the semantic index base are semantic vector indexes; the device comprises:

the first determining module is used for determining a first alternative answer set through text characteristics of the target question to be answered based on the text index library;

a third determining module, configured to determine, based on a pre-trained matching model, at least one target answer with a highest matching degree with the target question according to the first candidate answer set and the second candidate answer set, where the at least one target answer is used as a reply to the target question; the pre-trained matching model is used for determining the matching degree of the question and the answer;

the pre-trained matching model is obtained by training an initial matching model according to a training sample, wherein the training sample comprises a positive sample and a negative sample; the positive sample is an original question in a second question-answer pair material and an original answer corresponding to the original question; the negative sample is the least similar answer in a third preset number of answers which are most similar to the original answer in the original question and the second question-and-answer corpus;

an execution module, configured to sequentially use, as current answers, a third preset number of answers that are most similar to the text similarity and/or the semantic similarity of the original answer in the second question-answer corpus to execute the following steps: if the text similarity and/or the semantic similarity between the current answer and the original answer are smaller than a preset value, determining that the current answer is the most dissimilar answer to the original answer;

and the selecting module is used for randomly selecting one answer from the third preset number of most similar answers as the least similar answer to the original answer if the text similarity and/or the semantic similarity between the third preset number of most similar answers and the original answer are larger than preset values.

12. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the method of any one of claims 1 to 10 when executing the computer program.

13. A computer readable storage medium having stored thereon machine executable instructions which, when invoked and executed by a processor, cause the processor to perform the method of any of claims 1 to 10.