CN116340481B

CN116340481B - Method and device for automatically replying to question, computer readable storage medium and terminal

Info

Publication number: CN116340481B
Application number: CN202310182371.5A
Authority: CN
Inventors: 史可欢; 徐清; 蔡华
Original assignee: Huayuan Computing Technology Shanghai Co ltd
Current assignee: Huayuan Computing Technology Shanghai Co ltd
Priority date: 2023-02-27
Filing date: 2023-02-27
Publication date: 2024-05-10
Anticipated expiration: 2043-02-27
Also published as: CN116340481A

Abstract

A method and device for automatically replying to a question, a computer readable storage medium and a terminal, wherein the method comprises the following steps: determining a question-answer library, wherein the question-answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question; determining a first similarity between a first sentence vector of an input question and a second sentence vector of each standard question of the question-answering library; if the maximum first similarity is smaller than a first threshold, screening the question-answer library based on each first similarity to obtain a to-be-matched problem set; respectively determining second similarity between the first sentence vector and a third sentence vector of each question of the question set to be matched; if the maximum second similarity is greater than or equal to a second threshold value, at least one answer corresponding to the question to which the maximum second similarity belongs is used as a final answer for inputting the question; wherein the first threshold is less than or equal to the second threshold. The scheme can improve the efficiency and accuracy of the answer.

Description

Method and device for automatically replying to question, computer readable storage medium and terminal

Technical Field

The present invention relates to the field of natural language processing technologies, and in particular, to a method and apparatus for automatically replying to a question, a computer readable storage medium, and a terminal.

Background

In a social production consumption activity scenario, many areas of staff need to face a large number of consultations and questions from visitors, and the timeliness and accuracy of the answers will determine the feeling of the visitors and affect the next progress of the work. For example, a merchant or host on an e-commerce platform needs to answer customer questions about stores, merchandise information, and logistics conditions, scenic spot travel center personnel needs to answer questions about tourists about ticket information, scenic spot routes, and cultural history backgrounds, and so on. In the face of a huge number of questions, the manual answer mode cannot meet the aging requirement. Based on this, research has been initiated into applying natural language processing technology to the field of questions and answers, automatically answering user questions through a machine. The method improves the timeliness and accuracy of the reply questions, and has important research value.

The automatic reply to the question of the user needs to find the question sentence which is the most similar to the current question in the question-answer library, and then replies the answer corresponding to the most similar question sentence, which is essentially a technology for matching the question with the similarity of the existing question list. Specifically, for a question input by a user, a terminal device (e.g., a robot) needs to find a question sentence most similar to the input question from a question-and-answer library, and then reply to an answer corresponding to the most similar question sentence. In the prior art, question matching is mainly realized by calculating similarity among questions, and a common scheme comprises the following steps:

(1) A character string matching method. Character string matching mainly comprises word-by-word accurate matching, keyword matching, regular matching, text matching by calculating editing distance for measuring character string difference degree, searching longest public subsequence and the like. However, the character string matching method cannot accurately identify the semantic features of the questions, and the questions with large expression differences and similar semantically are easy to misjudge as dissimilar.

(2) Short text classification schemes based on machine learning. The existing short text classification method generally needs to conduct supervised training on a short text classification model by adopting a large number of data sets with labels (usually needing manual labeling), has the problems of high acquisition cost and insufficient quantity of training data sets, is difficult to cover a large number of diversified expressions under similar semantics in a real language, and further leads to insufficient accuracy of finally obtained answers.

Disclosure of Invention

The technical problem solved by the embodiment of the invention is how to answer the input questions efficiently and accurately.

In order to solve the above technical problems, an embodiment of the present invention provides a method for automatically replying to a question, including the following steps: determining a question-and-answer library, wherein the question-and-answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question; determining a first sentence vector of an input question, determining a second sentence vector of each standard question in the question-answering library, and then respectively determining a first similarity between the first sentence vector and the second sentence vector of each standard question; if the obtained maximum first similarity is smaller than a first threshold value, screening the questions in the question-answering library based on the obtained first similarities to obtain a to-be-matched question set; respectively determining second similarity between the first sentence vector and a third sentence vector of each question in the to-be-matched question set; if the obtained maximum second similarity is greater than or equal to a second threshold value, at least one answer corresponding to the question to which the maximum second similarity belongs is used as the final answer of the input question; wherein the first threshold is less than or equal to the second threshold.

Optionally, the determining the first sentence vector of the input question and the second sentence vector of each standard question in the question-answering library includes: respectively inputting each standard question in the input question and the question-answering library into a preset language model to determine a first sentence vector of the input question and a second sentence vector of each standard question in the question-answering library; the language model is obtained by performing initial training on an initial language model to obtain an optimized language model by adopting a full word shielding method and a preset Chinese data set, and then performing fine tuning training on the optimized language model by adopting a training problem set.

Optionally, the initial language model is a chinese BERT model; the training of fine tuning of the optimized language model using the training problem set includes: and inputting the training problem set into the optimized language model for iterative training by adopting an unsupervised SimCSE contrast learning method to obtain the language model.

Optionally, the screening the questions in the question-answering library based on the obtained first similarities to obtain a question set to be matched includes: sequencing the obtained first similarity from big to small according to the numerical value; selecting a preset number of first similarity degrees which are ranked ahead from the first similarity degrees which are ranked; and taking the selected standard problem to which the first similarity belongs and the set of the similarity problems corresponding to the standard problem as the problem set to be matched.

Optionally, the screening the questions in the question-answering library based on the obtained first similarities to obtain a question set to be matched includes: selecting the first similarity with the numerical value larger than or equal to a third threshold value from the first similarities; taking the selected standard problem to which the first similarity belongs and a set of corresponding similarity problems as the problem set to be matched; wherein the third threshold is less than the first threshold.

Optionally, the method further comprises: and if the maximum first similarity is greater than or equal to a first threshold value, taking at least one answer corresponding to the standard question to which the maximum first similarity belongs as a final answer of the input question.

Optionally, the method further comprises: and if the obtained maximum second similarity is smaller than the second threshold value, confirming that no answer corresponding to the input question exists in the question-answer library.

Optionally, the determining the question-answer library includes: determining an initial problem set, wherein the initial problem set comprises a plurality of initial problem subsets, and each initial problem subset belongs to different fields respectively; determining the domain of a preset initial question-answer library containing one or more standard questions; determining an initial question subset which is the same as the domain to which the initial question-answering library belongs from the initial question set as a candidate question subset; for each standard question in the initial question-answer library, determining a third similarity between the standard question and each question in the candidate question subset; selecting each problem with the third similarity larger than or equal to a fourth threshold value from the candidate problem subset as a similarity problem corresponding to the standard problem; and adding each selected similar problem to the initial question-answer library to obtain the question-answer library.

Optionally, after determining the question-answer library, the method further comprises: inputting a standard problem to be added into a trained text generation model to generate a plurality of similar problems corresponding to the standard problem to be added, wherein the trained text generation model is obtained by training a preset text generation model by adopting a data set formed by a plurality of groups of similar problems; and adding at least part of the standard questions to be added and the corresponding multiple similar questions to the question-answering library.

The embodiment of the invention also provides a device for automatically replying to the question, which comprises: the system comprises a question and answer library determining module, a question and answer library determining module and a question and answer library processing module, wherein the question and answer library determining module is used for determining a question and answer library, the question and answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question; the first similarity determining module is used for determining a first sentence vector for inputting questions, determining a second sentence vector of each standard question in the question-answering library, and then respectively determining the first similarity between the first sentence vector and the second sentence vector of each standard question; the question screening module is used for screening the questions in the question-answering library based on the obtained first similarity if the obtained maximum first similarity is smaller than a first threshold value, so as to obtain a question set to be matched; a second similarity determining module, configured to determine a second similarity between the first sentence vector and a third sentence vector of each question in the set of questions to be matched, respectively; the answer determining module is used for taking at least one answer corresponding to the question to which the maximum second similarity belongs as a final answer of the input question if the obtained maximum second similarity is larger than or equal to a second threshold value; wherein the first threshold is less than or equal to the second threshold.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the method for automatically replying to a question.

The embodiment of the invention also provides a terminal which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the method for automatically replying to the question when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

The embodiment of the invention provides a method for automatically replying to a question, which comprises the steps of firstly determining a question-answer library, wherein each group of questions in the question-answer library comprises a standard question and one or more similar questions corresponding to the standard question; then, respectively determining a first similarity between a first sentence vector of the input question and a second sentence vector of each standard question; if the obtained maximum first similarity is smaller than a first threshold value, screening the questions in the question-answering library based on the obtained first similarities to obtain a to-be-matched question set; respectively determining second similarity between the first sentence vector and a third sentence vector of each question in the to-be-matched question set; determining a final answer to the input question based on a comparison of the maximum second similarity to a second threshold; wherein the first threshold is less than or equal to the second threshold.

In the embodiment of the invention, on one hand, in practical application, under the condition that the semantemes of questions input by users are the same or similar, the expression modes of the questions are different and even very different, and sentence vectors of questions can accurately reflect semantic information contained in the questions. Therefore, compared with the existing character string matching or regular matching scheme, the method and the device easily leak question sentences which are large in expression difference with the input question and similar in meaning, the embodiment of the invention can effectively improve accuracy of question sentence matching based on similarity of sentence vectors, and further improve accuracy of final answers. On the other hand, the embodiment of the invention firstly screens the questions in the question-answering library based on the first similarity of each standard question in the input question and the question-answering library to obtain a to-be-matched question set; and determining the final answer of the input question based on the second similarity between the input question and each question in the to-be-matched question set. The number of the questions of the to-be-matched question set obtained through screening is obviously smaller than that of questions in a question-answer library, so that the data quantity participating in operation can be greatly reduced, the operation cost is reduced, and the efficiency of automatically replying to questions is improved.

Further, the determining a first sentence vector of the input question and determining a second sentence vector of each standard question in the question-answering library includes: respectively inputting each standard question in the input question and the question-answering library into a preset language model to determine a first sentence vector of the input question and a second sentence vector of each standard question in the question-answering library; the language model is obtained by performing initial training on an initial language model to obtain an optimized language model by adopting a full word shielding method and a preset Chinese data set, and then performing fine tuning training on the optimized language model by adopting a training problem set.

In the embodiment of the invention, the fine adjustment training of the optimized language model obtained by initial training is performed by adopting the training problem set, so that the fine adjustment of the parameters of the optimized language model can be realized, and the parameters of the model are further optimized. Further, the method of fine tuning training may be a comparison learning method without supervision SimCSE, where the training goal is to increase the similarity of two sentence vectors in the positive example as much as possible and reduce the similarity of sentence vectors between the positive example and the negative example as much as possible. Compared with the prior art, the method adopts the labeled data set for supervised training, and the method can remarkably reduce the training cost by adopting an unsupervised training mode. Furthermore, the sentence vectors output by the language model obtained after fine adjustment can reflect the semantic features of different input questions more accurately, so that the accuracy of the similarity calculation result of the subsequent sentence vectors is improved, and the accuracy of the final answer is further improved.

Further, the determining the question-answer library includes: determining an initial problem set, wherein the initial problem set comprises a plurality of initial problem subsets, and each initial problem subset belongs to different fields respectively; determining the domain of a preset initial question-answer library containing one or more standard questions; determining an initial question subset which is the same as the domain to which the initial question-answering library belongs from the initial question set as a candidate question subset; for each standard question in the initial question-answer library, determining a third similarity between the standard question and each question in the candidate question subset; selecting each problem with the third similarity larger than or equal to a fourth threshold value from the candidate problem subset as a similarity problem corresponding to the standard problem; and adding each selected similar problem to the initial question-answer library to obtain the question-answer library. In the embodiment of the invention, the huge number of initial problem sets are divided based on different 'fields'; and then screening similar questions corresponding to each standard question in the initial question-answering library from the question subsets in the same field according to the preset field to which the initial question-answering library belongs. Therefore, the similar problems corresponding to each standard problem can be rapidly determined, and the configuration efficiency of the question-answering library is improved.

Further, after determining the question and answer library, the method further comprises: inputting a standard problem to be added into a trained text generation model to generate a plurality of similar problems corresponding to the standard problem to be added, wherein the trained text generation model is obtained by training a preset text generation model by adopting a data set formed by a plurality of groups of similar problems; and adding at least part of the standard questions to be added and the corresponding multiple similar questions to the question-answering library. Because the trained text generation model learns the characteristics among similar questions, the corresponding multiple similar questions can be quickly and accurately generated based on any standard questions required to be added to the question-answering library according to actual scene requirements, and the question-answering library can be updated in time so as to expand the number of questions in the question-answering library.

Drawings

FIG. 1 is a flow chart of a method for automatically replying to a question in an embodiment of the present invention;

FIG. 2 is a flow chart of one embodiment of step S11 of FIG. 1;

FIG. 3 is a flow chart of a first embodiment of step S13 of FIG. 1;

FIG. 4 is a flow chart of a second embodiment of step S13 of FIG. 1;

fig. 5 is a schematic structural diagram of an apparatus for automatically replying to a question according to an embodiment of the present invention.

Detailed Description

As previously mentioned, techniques for implementing an automatic answer question may be considered essentially as techniques for matching questions to existing question listings in a question and answer library. The accuracy of the match often directly affects the accuracy of the final answer.

In the prior art, the method for searching the most similar questions by question matching is mainly realized by calculating the similarity between questions, and the common scheme comprises the following steps:

(1) A character string matching method. Character string matching mainly comprises word-by-word accurate matching, keyword matching, regular matching, text matching by calculating editing distance for measuring character string difference degree, searching longest public subsequence and the like. However, exact matches and keyword matches are prone to mismatching for ambiguities and negative sentences, and cannot be determined for context of the context; regular matching requires manual complex rule configuration for each type of problem, and is difficult to cover various language expressions exhaustively; calculating the edit distance or the longest common subsequence may result in a narrower matching range, missing the problem of large differences in expressions but semantically similar, and also the problem of mismatching sentences with similar characters but large differences in semantics.

(2) Machine learning matching schemes. The machine learning matching scheme mainly includes short text classification. Short text classification may be implemented using common classification algorithms including naive bayes, support vector machines, XGBoost, and so on. The existing machine learning matching scheme generally needs to manually sort and classify the problem sets in the field and label the problem sets with class labels to obtain a training data set; then, the training data set is adopted to carry out supervised training on the model, and a short text classification model in the field is obtained; and finally, the user questions are classified into corresponding classifications through the model, and preset answers of the classifications are returned. The method relies on a large amount of manual labeling work, and has the problems of limited classification quantity and limited training data set scale; model training is needed to be carried out again for each classification change, so that the cost is high; the training data set adopted does not cover a huge number of diversified expressions under similar semantics in real language, so that the final answer accuracy is insufficient.

In order to solve the above technical problems, an embodiment of the present invention provides a method for automatically replying to a question, which specifically includes: determining a question-and-answer library, wherein the question-and-answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question; determining a first sentence vector of an input question, determining a second sentence vector of each standard question in the question-answering library, and then respectively determining a first similarity between the first sentence vector and the second sentence vector of each standard question; if the obtained maximum first similarity is smaller than a first threshold value, screening the questions in the question-answering library based on the obtained first similarities to obtain a to-be-matched question set; respectively determining second similarity between the first sentence vector and a third sentence vector of each question in the to-be-matched question set; if the obtained maximum second similarity is greater than or equal to a second threshold value, at least one answer corresponding to the question to which the maximum second similarity belongs is used as the final answer of the input question; wherein the first threshold is less than or equal to the second threshold.

In the embodiment of the invention, on one hand, in practical application, under the condition that the semantemes of questions input by users are the same or similar, the expression modes of the questions are different and even very different, and sentence vectors of questions can accurately reflect semantic information contained in the questions. Therefore, compared with the existing character string matching or regular matching scheme, the method and the device easily leak question sentences which are large in expression difference with the input question and similar in meaning, the embodiment of the invention can effectively improve the accuracy of question sentence matching by carrying out question sentence matching based on the similarity of sentence vectors, and further improve the accuracy of the determined final answer.

On the other hand, the embodiment of the invention firstly screens the questions in the question-answering library based on the first similarity of each standard question in the input question and the question-answering library to obtain a to-be-matched question set; and determining the final answer of the input question based on the second similarity between the input question and each question in the to-be-matched question set. The number of the questions of the to-be-matched question set obtained through screening is obviously smaller than that of questions in a question-answer library, so that the data quantity participating in operation can be greatly reduced, the operation cost is reduced, and the efficiency of automatically replying to questions is improved.

In order to make the above objects, features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

Referring to fig. 1, fig. 1 is a flowchart of a method for automatically replying to a question according to an embodiment of the present invention. The method may include steps S11 to S14:

Step S11: determining a question-and-answer library, wherein the question-and-answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question;

Step S12: determining a first sentence vector of an input question, determining a second sentence vector of each standard question in the question-answering library, and then respectively determining a first similarity between the first sentence vector and the second sentence vector of each standard question;

step S13: if the obtained maximum first similarity is smaller than a first threshold value, screening the questions in the question-answering library based on the obtained first similarities to obtain a to-be-matched question set;

Step S14: and respectively determining second similarity between the first sentence vector and a third sentence vector of each question in the to-be-matched question set.

Step S15: and if the obtained maximum second similarity is larger than or equal to a second threshold value, taking at least one answer corresponding to the question to which the maximum second similarity belongs as the final answer of the input question.

Wherein the first threshold is less than or equal to the second threshold.

In the implementation of step S11, in each set of questions in the question-answering library, the standard questions and the similar questions are semantically identical or similar.

As one non-limiting example, for the following problems: "how far the distance between Beijing and Shanghai is", "how far the Beijing is from Shanghai", "how many kilometers the Shanghai is from Beijing", "how much the distance between Shanghai and Beijing is", "how many kilometers from Shanghai to Beijing", etc. Although the expressions of the questions have certain differences (for example, the lexical sequences are different), the semantics of the expressions of the questions are the same or similar (specifically, the voice information contained in the questions is the same or similar), so that the questions can be divided into a group of questions in the question-answering library.

In each group of questions obtained by division, one of the questions can be randomly selected as a standard question, and the other questions are similar questions of the standard question; or the questions which accord with the preset question expression mode can be used as standard questions, and the other questions can be used as similar questions of the standard questions; or may separately determine the standard problem and its similar problems in other suitable manners.

Further, referring to fig. 2, fig. 2 is a flowchart of a specific embodiment of step S11 in fig. 1, and the process of determining the question-answer library in step S11 may specifically include steps S21 to S26.

In step S21, an initial question set is determined, said initial question set comprising a plurality of initial question subsets, wherein each initial question subset belongs to a different domain, respectively.

The initial question set can be a user history question set crawled from each large website by adopting a text crawling method; or a set of user history questions gathered from business scenarios in different domains. The initial problem set contains a relatively large number of problems, which may for example contain thousands to tens of thousands or even more.

In a specific implementation, the initial problem set may be divided based on different fields, so as to obtain a plurality of initial problem subsets.

Specifically, a one-stage classification, a two-stage classification, or a multi-stage classification may be employed. Taking a two-level classification as an example, the first level domain may include, but is not limited to: education, e-commerce, law, etc.; wherein, the second level field under the e-commerce field can be subdivided into: food, health products, cosmetics, clothing, and the like. Wherein, each initial problem subset in the initial problem set belongs to the same or different first-level fields, and the second-level fields belong to different second-level fields. The method for classifying the domain of the initial question set may be a method for classifying the domain based on keywords, a method for classifying the domain based on an existing text classification model, or other suitable methods.

In step S22, for a preset initial question-and-answer library including one or more standard questions, a domain to which the initial question-and-answer library belongs is determined.

In a specific implementation, for example, the initial question-answer library may be determined by manually dividing different domains in combination with a scene, and then acquiring a plurality of questions and answers thereof belonging to the domain according to a preset domain.

In step S23, an initial question subset that is the same as the domain to which the initial question-answering library belongs is determined from the initial question set as a candidate question subset.

In step S24, for each standard question in the initial question-answer library, a third similarity between the standard question and each question in the candidate question subset is determined.

In a specific implementation, the semantic information or semantic features expressed/contained by each question may be accurately represented by using a sentence vector, and further, a similarity (for example, cosine similarity) between the sentence vector of the standard question and the sentence vector of each question in the candidate question subset may be used as the third similarity. Wherein the sentence vector of the question can be obtained by adopting a proper sentence vector generation model

The similarity between the sentence vectors mentioned in the embodiments of the present invention may be determined by an existing suitable manner, for example, by performing a dot-multiplication operation on the sentence vectors, or by performing a dot-multiplication operation on the sentence vectors and a multidimensional matrix (a multidimensional matrix formed by a plurality of sentence vectors).

In step S25, selecting each question with a third similarity greater than or equal to a fourth threshold from the candidate question subset, as a similarity question corresponding to the standard question;

In step S26, each selected similar question is added to the initial question-answer library to obtain the question-answer library.

Specifically, for each standard question in the initial question-answering library, adding a similar question corresponding to the standard question selected from the candidate question subset to the initial question-answering library, thereby obtaining the question-answering library.

The fourth threshold may be set according to actual scene requirements, for example, an appropriate value in the value interval [0.8,1.0] may be selected as the fourth threshold. It will be appreciated that the larger the fourth threshold, the closer the semantics between the standard question and the corresponding similar question.

In a specific implementation, in addition to the method for determining a question and answer library provided in the embodiment shown in fig. 2, other suitable methods may be used to determine the question and answer library. For example, semantic similarity analysis may be performed manually on several candidate questions, and then each candidate question with the same or similar semantics may be divided into a set of questions in the question-answer library. For another example, respective initial sentence vectors may be generated for a plurality of candidate questions, and then the plurality of candidate questions with similarity between the initial sentence vectors being greater than or equal to a preset similarity threshold value are divided into a group of questions in the question-answer library.

Further, after determining the question and answer library, the method may further include: inputting a standard problem to be added into a trained text generation model to generate a plurality of similar problems corresponding to the standard problem to be added, wherein the trained text generation model is obtained by training a preset text generation model by adopting a data set formed by a plurality of groups of similar problems; and adding at least part of the standard questions to be added and the corresponding multiple similar questions to the question-answering library.

The multiple groups of similar questions may be obtained by classifying an initial question set (the initial question set may be directly obtained by using the initial question set in step S21) by using an existing text classification model. The pre-set text generation model may be an existing machine learning model that can implement text generation functions, such as a bi-directional and auto-regressive transformation model (Bidirectional and Auto-REGRESSIVE TRANSFORMERS, BART). BART is an open source unit text understanding and generation model that is excellent in the task of composing text summaries, dialog generation, and the like.

In the embodiment of the invention, the text generation model is trained by adopting a plurality of groups of similar problems, so that the model can learn the characteristics among the similar problems. Therefore, according to the method, the actual scene needs can be combined, the corresponding multiple similar questions can be quickly and accurately generated based on any standard questions required to be added to the question-answering library, and the question-answering library can be updated in time so as to expand the number of the questions in the question-answering library.

With continued reference to fig. 1, in the implementation of step S12, the semantic information or semantic features contained in the input question are represented by using the first sentence vector, and the semantic information or semantic features contained in each standard question in the question-answering library are represented by using the second sentence vector; question matching is then performed based on a method of calculating similarity of sentence vectors (e.g., calculating cosine similarity).

It should be noted that, compared with the question itself or the character string obtained based on the question, the sentence vector (sentence embedding) of the question can reflect the semantic features of the whole sentence, so as to help solve the problem that the question with large expression difference but similar semantically is easy to be misjudged as dissimilar (or small in similarity) due to lack of semantic information in the simple character matching method.

Further, the method for determining the first sentence vector and the second sentence vector in step S12 may specifically include: and respectively inputting the input questions and the standard questions in the question-answering library into a preset language model to determine a first sentence vector of the input questions and a second sentence vector of each standard question in the question-answering set.

The first sentence vector of the input question and the second sentence vector of the standard question may be one-dimensional vector data having a preset length, for example, may be one-dimensional vector having a length of 768. Specifically, the length of the sentence vector may refer to the number of codes contained in the sentence vector. If there are N standard questions in the question-and-answer library, all second sentence vectors of the N standard questions may form a matrix with dimension n×768.

The language model is obtained by performing initial training on an initial language model to obtain an optimized language model by adopting a full word shielding method and a preset Chinese data set, and then performing fine tuning training on the optimized language model by adopting a training problem set.

Wherein, the preset Chinese data set can be a large-scale data set obtained from different fields of multiple channels such as encyclopedia, news, question and answer, and the like; the training problem set may directly employ the initial problem set described in step S21.

Further, the initial language model is a Chinese BERT model; the training of fine tuning of the optimized language model using the training problem set includes: and inputting the training problem set into the optimized language model for iterative training by adopting an unsupervised SimCSE contrast learning method and a preset loss function until the loss function converges or the iteration number reaches the preset number of times, so as to obtain the language model.

The penalty function may employ an existing contrast learning penalty function, such as a Multiple NEGATIVES RANKING Loss function (Multiple NEGATIVES RANKING Loss).

Specifically, in the unsupervised SimCSE contrast learning method, for each question in the training question set, the question is input into the optimized language model twice, and each time is randomly masked, so as to obtain a pair of different sentence vectors; and adopting the sentence vector pair as a pair of positive examples, and inputting the sentence vector pair with other problems in the training problem set into the optimized language model to obtain sentence vectors which are opposite to each other. The training target is to improve the similarity of two sentence vectors in the positive example and reduce the similarity of the sentence vectors between the positive example and the negative example until the termination condition is reached.

Compared with the prior art that the training data set is difficult to acquire and the training cost is high in the supervised training mode by adopting the data set with the label, the method and the device can remarkably reduce the training cost by adopting the unsupervised training mode. Further, the language model obtained after fine adjustment based on the contrast learning method can output sentence vectors which more accurately reflect semantic features of different input questions, and the distinguishing capability of the language model between similar questions and non-similar questions is improved, so that the accuracy of a subsequent sentence vector similarity calculation result is improved, and the accuracy of answers is further improved.

In a specific implementation, the language model may also be obtained by performing fine tuning training on the Chinese RoBERTa-wwm-ext model by using a training problem set. The chinese RoBERTa-wwm-ext model is trained on the chinese BERT model using a full word masking method and a large-scale dataset (a total word count of about 54 hundred million-scale chinese dataset (ext) obtained from multiple channels of encyclopedia, news, questions and answers, etc.). Compared with the Chinese BERT model, the Chinese RoBERTa-wwm-ext model has remarkable improvement in the tasks of machine reading understanding, single sentence classification, sentence classification and the like.

For the method for performing fine tuning training on the chinese RoBERTa-wwm-ext model, reference is made to the above method for performing unsupervised training on the optimized language model, which is not described herein.

In the implementation of step S13, it is determined whether the maximum value (i.e., the maximum first similarity) of the first similarity between the first sentence vector of the input question and the second sentence vector of each standard question determined in step S12 is smaller than a first threshold; and if the maximum first similarity is smaller than a first threshold, screening the questions in the question-answering library based on the first similarities obtained in the step S12 to obtain a question set to be matched.

Further, if the maximum first similarity is greater than or equal to a first threshold, at least one answer corresponding to the standard question to which the maximum first similarity belongs is used as a final answer of the input question.

Specifically, if the maximum first similarity is greater than or equal to a first threshold, randomly selecting one answer from a plurality of answers corresponding to the standard questions to which the maximum first similarity belongs, wherein the answer is used as a final answer of the input question; or selecting one answer with highest priority from a plurality of answers corresponding to the standard questions with the maximum first similarity as the final answer.

The standard problem to which the maximum first similarity belongs specifically refers to: and the second sentence vector of each standard question of the question-answering library has the largest first similarity value with the first sentence vector of the input question, and the standard question to which the second sentence vector belongs.

In a specific implementation, the first threshold may be set according to actual needs. It is understood that the greater the first threshold setting, the greater the probability that the maximum first similarity in step S13 is smaller than the first threshold; the smaller the first threshold setting is, the greater the probability that the maximum first similarity in step S13 is equal to or greater than the first threshold is.

Further, referring to fig. 3, fig. 3 is a flowchart of a first embodiment of step S13 in fig. 1. In the first specific embodiment of the step S13, the screening the questions in the question-answering library based on the obtained first similarities to obtain the question set to be matched may include steps S31 to S33.

In step S31, the obtained first similarities are sorted from big to small according to the values.

In step S32, a preset number of first similarities ranked earlier are selected from the first similarities ranked.

In a specific implementation, the preset number may be set appropriately according to actual needs, for example, may be determined based on the total number of the first similarities multiplied by a preset percentage. Without limitation, the preset percentage may be an appropriate value in [15%,30% ]. It can be understood that the larger the number of the preset number is, the more the total number of the questions in the to-be-matched question set is obtained through screening.

In step S33, the selected standard problem to which the first similarity belongs and the set of corresponding similarity problems are used as the to-be-matched problem set.

Further, referring to fig. 4, fig. 4 is a flowchart of a second embodiment of step S13 in fig. 1. In the second specific embodiment of the step S13, the screening of the questions in the question-answering library based on the obtained first similarities may specifically include steps S41 to S42.

In step S41, from among the first similarities, a first similarity having a value equal to or greater than a third threshold value is selected.

Wherein the third threshold is less than the first threshold, and the first threshold is less than or equal to the second threshold.

In a specific embodiment, the second threshold may be set to 0.9; the first threshold may be set to an appropriate value in [0.8,0.9 ]; the third threshold value may be set to an appropriate value in [0.7, 0.8).

In another specific embodiment, the second threshold may be set to 0.98; the first threshold may be set to an appropriate value in [0.9,0.98 ]; the third threshold may be set to an appropriate value in [0.8,0.9).

It should be noted that, regarding the specific value settings of the first threshold, the second threshold, and the third threshold, on the premise of meeting the magnitude relation, the specific values may be set to other appropriate values according to the specific scene requirement, which is not limited in the embodiment of the present invention.

In step S42, the selected standard problem to which the first similarity belongs and the set of corresponding similarity problems are used as the to-be-matched problem set.

With continued reference to fig. 1, in the implementation of step S14, the specific method for determining the third sentence vector of each question in the set of questions to be matched may refer to the method for determining the first sentence vector and the second sentence vector in step S12, which is not described herein.

In the implementation of step S15, it is determined whether the maximum value of the second similarity (i.e., the maximum second similarity) between the first sentence vector of the input question and the third sentence vector of each question in the set of questions to be matched determined in step S14 is equal to or greater than a second threshold; and if the maximum second similarity is larger than or equal to the second threshold, at least one answer corresponding to the question to which the maximum second similarity belongs is taken as the final answer of the input question.

Further, if the maximum second similarity is smaller than the second threshold, confirming that no answer corresponding to the input question exists in the question-answer library.

Further, after confirming that no answer corresponding to the input question exists in the question-answer library, corresponding prompt information (or spam answer) can be output to the user so as to prompt the input question that no corresponding answer is found or prompt the user to ask from other channels (such as a manual window). The prompt information can be presented in the forms of words, voice, animation and the like.

Referring to fig. 5, fig. 5 is a schematic structural diagram of an apparatus for automatically replying to a question according to an embodiment of the present invention. The device for automatically replying to the question can comprise:

a question and answer library determining module 51, configured to determine a question and answer library, where the question and answer library includes a plurality of sets of questions and one or more answers corresponding to each set of questions, and each set of questions includes a standard question and one or more similar questions corresponding to the standard question;

A first similarity determining module 52, configured to determine a first sentence vector of an input question, determine a second sentence vector of each standard question in the question-answer library, and then determine a first similarity between the first sentence vector and the second sentence vector of each standard question, respectively;

A question screening module 53, configured to screen questions in the question-answering library based on each obtained first similarity if the obtained maximum first similarity is smaller than a first threshold, to obtain a to-be-matched question set;

a second similarity determining module 54, configured to determine a second similarity between the first sentence vector and a third sentence vector of each question in the set of questions to be matched;

An answer determining module 55, configured to, if the obtained maximum second similarity is greater than or equal to a second threshold, take at least one answer corresponding to the question to which the maximum second similarity belongs as a final answer of the input question;

wherein the first threshold is less than or equal to the second threshold.

For the principles, specific implementations and advantages of the device for automatically replying to a question, please refer to the foregoing and the related descriptions of the method for automatically replying to a question shown in fig. 1 to 4, which are not repeated herein.

The embodiment of the present invention also provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method for automatically replying to a question shown in fig. 1 to 4 described above. The computer readable storage medium may include non-volatile memory (non-volatile) or non-transitory memory, and may also include optical disks, mechanical hard disks, solid state disks, and the like.

Specifically, in the embodiment of the present invention, the processor may be a central processing unit (central processing unit, abbreviated as CPU), which may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, abbreviated as DSP), application Specific Integrated Circuits (ASIC), off-the-shelf programmable gate arrays (field programmable GATE ARRAY, abbreviated as FPGA), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an erasable programmable ROM (erasable PROM EPROM), an electrically erasable programmable ROM (ELECTRICALLY EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (random access memory, RAM for short) which acts as an external cache. By way of example, and not limitation, many forms of random access memory (random access memory, RAM) are available, such as static random access memory (STATIC RAM, SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double DATA RATE SDRAM, DDR SDRAM), enhanced synchronous dynamic random access memory (ENHANCED SDRAM, ESDRAM), synchronous link dynamic random access memory (SYNCHLINK DRAM, SLDRAM), and direct memory bus random access memory (direct rambus RAM, DR RAM).

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the method for automatically replying to the question shown in the figures 1 to 4 when running the computer program. The terminal can include, but is not limited to, terminal equipment such as a mobile phone, a computer, a tablet computer, a server, a cloud platform, and the like.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.

The term "plurality" as used in the embodiments of the present application means two or more.

The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order is used, nor is the number of the devices in the embodiments of the present application limited, and no limitation on the embodiments of the present application should be construed.

It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. A method of automatically replying to a question, comprising:

Determining a question-and-answer library, wherein the question-and-answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question; determining a first sentence vector of an input question, determining a second sentence vector of each standard question in the question-answering library, and then respectively determining a first similarity between the first sentence vector and the second sentence vector of each standard question;

If the obtained maximum first similarity is smaller than a first threshold value, screening the questions in the question-answering library based on the obtained first similarities to obtain a to-be-matched question set;

Respectively determining second similarity between the first sentence vector and a third sentence vector of each question in the to-be-matched question set;

If the obtained maximum second similarity is greater than or equal to a second threshold value, at least one answer corresponding to the question to which the maximum second similarity belongs is used as the final answer of the input question;

Wherein the first threshold is less than or equal to the second threshold;

The screening of the questions in the question-answering library based on the obtained first similarity to obtain a question set to be matched includes:

Sequencing the obtained first similarity from big to small according to the numerical value; selecting a preset number of first similarity degrees which are ranked ahead from the first similarity degrees which are ranked; taking the selected standard problem to which the first similarity belongs and a set of corresponding similarity problems as the problem set to be matched;

Or alternatively

Selecting the first similarity with the numerical value larger than or equal to a third threshold value from the first similarities; taking the selected standard problem to which the first similarity belongs and a set of corresponding similarity problems as the problem set to be matched; wherein the third threshold is less than the first threshold;

the determining the question-answer library comprises the following steps:

Determining an initial set of questions comprising a plurality of initial subsets of questions; determining the domain of a preset initial question-answer library containing one or more standard questions; determining an initial question subset which is the same as the domain to which the initial question-answering library belongs from the initial question set as a candidate question subset; for each standard question in the initial question-answer library, determining a third similarity between the standard question and each question in the candidate question subset; selecting each problem with the third similarity larger than or equal to a fourth threshold value from the candidate problem subset as a similarity problem corresponding to the standard problem; adding each selected similar problem to the initial question-answer library to obtain the question-answer library;

The method comprises the steps of selecting a plurality of initial problem subsets, wherein the plurality of initial problem subsets are obtained by performing domain division on the initial problem subsets in a multi-stage classification mode, and each initial problem subset respectively belongs to different domains.

2. The method of claim 1, wherein the determining a first sentence vector for the input question and determining a second sentence vector for each standard question in the question-answering library comprises:

Respectively inputting each standard question in the input question and the question-answering library into a preset language model to determine a first sentence vector of the input question and a second sentence vector of each standard question in the question-answering library;

3. The method of claim 2, wherein the initial language model is a chinese BERT model;

the training of fine tuning of the optimized language model using the training problem set includes:

And inputting the training problem set into the optimized language model for iterative training by adopting an unsupervised SimCSE contrast learning method to obtain the language model.

4. The method according to claim 1, wherein the method further comprises:

and if the maximum first similarity is greater than or equal to a first threshold value, taking at least one answer corresponding to the standard question to which the maximum first similarity belongs as a final answer of the input question.

5. The method according to claim 1, wherein the method further comprises:

And if the obtained maximum second similarity is smaller than the second threshold value, confirming that no answer corresponding to the input question exists in the question-answer library.

6. The method of claim 1, wherein after determining the question-answer library, the method further comprises:

Inputting a standard problem to be added into a trained text generation model to generate a plurality of similar problems corresponding to the standard problem to be added, wherein the trained text generation model is obtained by training a preset text generation model by adopting a data set formed by a plurality of groups of similar problems;

and adding at least part of the standard questions to be added and the corresponding multiple similar questions to the question-answering library.

7. An apparatus for automatically replying to a question, comprising:

the system comprises a question and answer library determining module, a question and answer library determining module and a question and answer library processing module, wherein the question and answer library determining module is used for determining a question and answer library, the question and answer library comprises a plurality of groups of questions and one or more answers corresponding to each group of questions, and each group of questions comprises a standard question and one or more similar questions corresponding to the standard question;

The first similarity determining module is used for determining a first sentence vector for inputting questions, determining a second sentence vector of each standard question in the question-answering library, and then respectively determining the first similarity between the first sentence vector and the second sentence vector of each standard question;

The question screening module is used for screening the questions in the question-answering library based on the obtained first similarity if the obtained maximum first similarity is smaller than a first threshold value, so as to obtain a question set to be matched;

a second similarity determining module, configured to determine a second similarity between the first sentence vector and a third sentence vector of each question in the set of questions to be matched, respectively;

The answer determining module is used for taking at least one answer corresponding to the question to which the maximum second similarity belongs as a final answer of the input question if the obtained maximum second similarity is larger than or equal to a second threshold value;

Wherein the first threshold is less than or equal to the second threshold;

The problem screening module also performs the steps of:

Or alternatively

The question and answer library determination module further performs the following steps:

8. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the method of automatically replying to a question of any one of claims 1 to 6.

9. A terminal comprising a memory and a processor, said memory having stored thereon a computer program capable of running on said processor, characterized in that said processor, when running said computer program, performs the steps of the method of automatically replying to a question according to any one of claims 1 to 6.