CN111753076B

CN111753076B - Dialogue method, dialogue device, electronic equipment and readable storage medium

Info

Publication number: CN111753076B
Application number: CN202010808251.8A
Authority: CN
Inventors: 侯政旭; 赵瑞辉; 黄展鹏; 赵博
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-08-12
Filing date: 2020-08-12
Publication date: 2022-08-26
Anticipated expiration: 2040-08-12
Also published as: CN111753076A

Abstract

The application relates to the technical field of artificial intelligence, and discloses a conversation method, a device, electronic equipment and a readable storage medium, wherein the conversation method comprises the following steps: acquiring a target question sentence input by a user; determining a feedback action corresponding to the target question based on the trained reinforcement learning model; wherein the reinforcement learning model is trained based on at least two sample conversations and the determined reward function of the at least two sample conversations; each round of sample dialogue comprises a sample question sentence and a corresponding sample answer sentence; and determining a target answer sentence corresponding to the feedback action, and outputting the target answer sentence. The dialogue method enables the prediction accuracy of the reinforcement learning model obtained through final training to be higher.

Description

Conversation method, conversation device, electronic equipment and readable storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a conversation method, a conversation device, electronic equipment and a readable storage medium.

Background

With the development of information technology, internet technology develops, and users often need to inquire various information through the internet to obtain corresponding answers, so that task-oriented conversations become increasingly popular, can be applied to telephone customer service, mobile phone customer service and mobile phone assistants, can complete basic task-oriented operations such as booking air tickets, booking hotels and the like, and greatly reduce the use of human resources.

In task-oriented dialogue, answers are predicted mainly by a reinforcement learning model, and the reinforcement learning model needs to be trained based on a reward function. At present, a reinforcement learning model is usually trained according to a conversation result, but the obtained reinforcement learning model cannot accurately describe which are correct steps and which are wrong actions in each round of the process, so that the accuracy of prediction of the trained reinforcement learning model is low.

Disclosure of Invention

The purpose of the application is to at least obtain the target answer sentence corresponding to the target question sentence more accurately, and the following technical scheme is provided:

in a first aspect, a dialog method is provided, including:

acquiring a target question sentence input by a user;

determining a feedback action corresponding to the target question sentence based on the trained reinforcement learning model; the reinforcement learning model is obtained by training based on the feature vectors respectively corresponding to at least two sample conversations and the reward functions determined by the feature vectors; each round of sample dialogue comprises a sample question sentence and a corresponding sample answer sentence; the feature vector is used for expressing the probability of selecting a sample answer sentence aiming at the sample question sentence in each pair of dialogs;

and determining a target answer sentence corresponding to the feedback action, and outputting the target answer sentence.

In an alternative embodiment of the first aspect, the trained reinforcement learning model is trained by:

acquiring feature vectors corresponding to at least two sample conversations respectively;

determining a reward function corresponding to at least two conversations based on the obtained at least two feature vectors;

and training the initial reinforcement learning model based on the sample question sentences in each sample conversation, the sample feedback actions corresponding to the sample answer sentences and the reward function to obtain the trained reinforcement learning model.

In an optional embodiment of the first aspect, obtaining feature vectors corresponding to at least two sample dialogues respectively comprises:

extracting a first feature vector of the last sample dialogue of at least two sample dialogues and extracting second feature vectors corresponding to other sample dialogues based on a plurality of first feature extraction networks and second feature extraction networks; wherein the other sample conversations are sample conversations of the at least two sample conversations except the last sample conversation.

In an optional embodiment of the first aspect, extracting the first feature vector of the last sample session of the at least two sample sessions comprises:

inputting the last sample dialogue into a corresponding first feature extraction network;

and inputting the output characteristics of the first characteristic extraction network into a corresponding second characteristic extraction network to obtain a first characteristic vector.

In an optional embodiment of the first aspect, extracting second feature vectors corresponding to other sample dialogues includes:

and respectively extracting the features of other sample conversations through a plurality of first feature extraction networks, and acquiring second feature vectors according to the extracted features through a plurality of cascaded second feature extraction networks.

In an optional embodiment of the first aspect, the feature vector is a confidence of a corresponding round of the sample conversation; and the confidence coefficient is used for expressing the probability of outputting the corresponding sample answer sentence aiming at the sample question sentence of the round by the second feature extraction network.

In an optional embodiment of the first aspect, determining a reward function corresponding to at least two rounds of dialog based on the obtained at least two feature vectors comprises:

determining a similarity between the first feature vector and the second feature vector;

a corresponding reward function is obtained based on the determined similarity.

In an optional embodiment of the first aspect, obtaining a corresponding reward function based on the determined similarity comprises:

and if the similarity is greater than a preset threshold value, acquiring a corresponding reward function based on the determined similarity.

In an optional embodiment of the first aspect, training the initial reinforcement learning model based on a sample question, a sample feedback action corresponding to the sample answer, and a reward function in each sample conversation to obtain a trained reinforcement learning model includes:

determining an environmental parameter corresponding to the sample dialogue; the environment parameters comprise at least one of conversation results and conversation fields where sample conversations are located;

and training the initial reinforcement learning model based on the environmental parameters, the sample question, the sample feedback action and the reward function to obtain the reinforcement learning model.

In a second aspect, a dialog device is provided, comprising:

the acquisition module is used for acquiring a target question sentence input by a user;

the first determination module is used for determining a feedback action corresponding to the target question based on the trained reinforcement learning model; the reinforcement learning model is obtained by training based on the feature vectors respectively corresponding to at least two sample conversations and the reward functions determined by the feature vectors; each round of sample dialogue comprises a sample question sentence and a corresponding sample answer sentence; the feature vector is used for expressing the probability of selecting a sample answer sentence aiming at the sample question sentence in each pair of dialogs;

and the second determining module is used for determining the target answer sentence corresponding to the feedback action and outputting the target answer sentence.

In an optional embodiment of the second aspect, further comprising a training module for:

In an optional embodiment of the second aspect, when the training module obtains feature vectors corresponding to at least two sample dialogues, the training module is specifically configured to:

extracting a first feature vector of the last sample conversation of at least two sample conversations and extracting second feature vectors corresponding to other sample conversations based on the plurality of first feature extraction networks and the plurality of second feature extraction networks; wherein the other sample conversations are sample conversations of the at least two sample conversations except the last sample conversation.

In an optional embodiment of the second aspect, the training module, when extracting the first feature vector of the last sample session of the at least two sample sessions, is specifically configured to:

In an optional embodiment of the second aspect, when extracting the second feature vectors corresponding to the other sample dialogues, the training module is specifically configured to:

In an optional embodiment of the second aspect, the feature vector is a confidence of a corresponding round of sample conversation; and the confidence coefficient is used for expressing the probability that the second feature extraction network outputs the corresponding sample question-answer for the round of sample question-sentences.

In an optional embodiment of the second aspect, the training module, when determining the reward function corresponding to at least two sessions based on the obtained at least two feature vectors, is specifically configured to:

a corresponding reward function is obtained based on the determined similarity.

In an optional embodiment of the second aspect, the training module, when obtaining the corresponding reward function based on the determined similarity, is specifically configured to:

In an optional embodiment of the second aspect, the training module is specifically configured to, when training the initial reinforcement learning model based on the sample question sentences in each sample conversation, the sample feedback actions corresponding to the sample answer sentences, and the reward function to obtain a trained reinforcement learning model:

In a third aspect, an electronic device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the electronic device implements the dialogue method shown in the first aspect of the present application.

In a fourth aspect, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the dialog method shown in the first aspect of the application.

The technical scheme provided by the application brings the beneficial effects that:

the method comprises the steps of determining a second feature vector through sample dialogues of at least two rounds except the last round of sample dialogues, determining a first feature vector based on the last round of sample dialogues, determining a reward function according to the first feature vector and the second feature vector, determining the reward function according to the number of sample dialogues or sample dialogue results, enabling the obtained reward function to be capable of fusing the feature vector of each round of sample dialogues, namely fusing the probability of selecting a sample answer for a sample question in each round of dialogues, enabling the obtained reward function to have a reference value, and enabling the prediction accuracy of a reinforcement learning model obtained through final training to be higher.

Further, when the similarity is greater than a preset threshold, a corresponding reward function is obtained based on the determined similarity, the higher the similarity between the first feature vector and the second feature vector is, the higher the value of the corresponding reward function is, the stronger the function of the reward function for predicting the next state is, and the higher the accuracy of the corresponding trained reinforcement learning model is.

Furthermore, the initial reinforcement learning model is trained based on the environment parameters, the sample question, the sample feedback action and the reward function, and corresponding rewards can be output in consideration of the environment of the field, so that the problem of field dependency is solved, and the trained reinforcement learning model can still keep high prediction accuracy rate for different fields of conversation.

Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.

Drawings

The foregoing and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is an application environment diagram of a dialog method according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a dialog method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a training dialogue model according to an embodiment of the present application;

fig. 4 is a schematic diagram of a scheme for obtaining a first feature vector according to an embodiment of the present application;

FIG. 5 is a schematic diagram of a scheme for obtaining a first feature vector according to an example of the present application;

FIG. 6 is a schematic diagram of a scheme for obtaining a reward function according to an embodiment of the present disclosure;

FIG. 7 is a diagram illustrating an exemplary scenario for obtaining a reward function according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a scheme for training a reinforcement learning model according to an embodiment of the present disclosure;

fig. 9 is a schematic flowchart illustrating a process of determining cosine similarity in a dialog method according to an example provided by the present application;

fig. 10 is a schematic structural diagram of a dialogue device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device for conversation according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are exemplary only for the purpose of explaining the present application and are not to be construed as limiting the present application.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the implementation method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

At present, task-oriented conversations become increasingly popular, which can be applied to telephone service, mobile phone service and mobile phone assistants, can complete basic task-oriented operations such as booking air tickets, booking hotels and the like, and greatly reduces the use of human resources. In task-oriented dialog, a modular-based system is mainly adopted, and the system can be divided into five common parts, namely input statement preprocessing, natural language understanding, dialog state tracking, dialog management and dialog response.

The application focuses on conversation management, and the mainstream models are learned by means of reinforcement learning, but the problem of how to learn data efficiently becomes a problem. The reinforcement learning is mainly divided into three parts, the first is an environment variable, the second is a reward function, and the third is an action made by the system. However, according to the conventional method, learning from a large amount of sample data becomes very inefficient according to the manually controlled reward function. The method and the device adopt a reverse reinforcement learning mode to learn the reinforcement learning reward function, so that the knowledge of conversation management can be learned better and more quickly.

The technical scheme who uses at present is extremely automatic convenient, is also a kind that the effect is not good. The primary mechanism is to design a function of the reward based on the number of turns of the conversation and the success of the conversation state. As shown in the following formula (1), the reward function is the first when the dialog fails, the second when the task succeeds, and Score is the Score after success, which is manually set.

Such a function can sufficiently summarize whether the current conversation is successful, but cannot accurately describe which are correct steps and which are wrong actions in each round of the process, so that a correct feedback cannot be provided for reinforcement learning, and thus the function is not deficient. The traditional reward function has the sparse reward property, and the system learning efficiency is low because the proper reward is provided for the system only at last.

In a simple way, a dialogue system and a go system are similar, in the go process, only the last time is the winner known, and similarly, in the dialogue system, the last time is the winner known, so that the problem is really a problem that feedback is sparse, and the existing technology has no way to successfully perform feedback in each stage, but only provides feedback in the last time, so that a machine becomes unaware of how to learn and how to perform optimal actions, thereby generating a series of problems. The dialogue management is the core decision part of the whole system, and if the dialogue management part cannot correctly learn the reward, the whole dialogue system becomes inefficient.

On the other hand, it is not enough to simply make a judgment by the number of rounds because in the normal course of the conversation, it is likely that the user has already finished ordering the hotel, but the user needs to make a query about the hotel information, and at this time, the machine needs to respond to the query, for example, detailed address information or detailed telephone information, etc. It is a normal dialogue when this query is made, but the reward function in the conventional sense takes the number of rounds into account, which adversely affects the system.

In the last aspect, when multi-domain questions are involved, the traditional reward function is not ideal, because as long as the multi-domain questions are available, the user is likely to ask more questions and do more things, and if the user is determined whether the user succeeds or not according to the number of turns, the learning effect of the model is affected.

The present application provides a dialog method, apparatus, electronic device, and computer-readable storage medium, which are intended to solve the above technical problems in the prior art.

The following describes the technical solution of the present application and how to solve the above technical problems in detail by specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

The dialog method provided by the application can be applied to the application environment shown in fig. 1. Specifically, the server 101 receives a target question which is sent by the terminal 102 and input by a user to the terminal, and the server 101 determines a feedback action corresponding to the target question and determines a target answer corresponding to the feedback action based on the trained reinforcement learning model; the server 101 transmits the determined target answer to the terminal 102.

Those skilled in the art will understand that the "terminal" used herein may be a Mobile phone, a tablet computer, a PDA (Personal Digital Assistant), an MID (Mobile Internet Device), etc.; a "server" may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.

It should be understood that fig. 1 shows an application scenario in an example, and the application scenario of the dialog method of the present application is not limited, in the above scenario, the server determines the target question-answer based on the target question, and in other application scenarios, the terminal may also receive the target question input by the user and determine the corresponding target question-answer based on the target question-answer.

A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 2, a dialog method is provided, which is described by taking an example that the method is applied to the server in fig. 1, and may include the following steps:

step S201, acquiring a target question input by a user;

step S202, determining feedback actions corresponding to the target question sentences based on the trained reinforcement learning model;

step S203, determining a target answer sentence corresponding to the feedback action, and outputting the target answer sentence.

The question (post) and the answer (response) form a dialog pair in the dialog, and the question is not necessarily in the form of a question, nor is the answer necessarily in the form of a reply.

Specifically, the user can input a target question to the terminal, and the terminal sends the target question to the server.

Among them, Reinforcement Learning (RL), also called refit Learning, evaluation Learning or Reinforcement Learning, is one of the paradigms and methodologies of machine Learning, and is used to describe and solve the problem that an agent (agent) achieves the maximum return or achieves a specific goal through a Learning strategy in the process of interacting with the environment; the reinforcement learning model in the application is used for outputting feedback actions corresponding to target question sentences based on received question sentences, so as to determine corresponding target answer sentences.

Specifically, the reinforcement learning model can be trained based on the determined reward functions of at least two sample conversations and at least two sample conversations; each round of sample dialogue includes a sample question and a corresponding sample answer, and the training process for the reinforcement learning model is described in further detail below.

Specifically, the trained reinforcement learning model determines corresponding feedback actions based on the target question, and each feedback action is provided with a target answer, so that the target answer corresponding to the target question is determined.

The training process of the dialogue model will be further explained in conjunction with the drawings and the specific embodiments.

A possible implementation manner is provided in the embodiment of the present application, as shown in fig. 3, the trained dialog model may be obtained by training in the following manner:

step S301, feature vectors corresponding to at least two sample dialogues are obtained.

Specifically, a feature extraction network may be employed to extract feature vectors corresponding to the sample session. For example, CNN (Convolutional Neural Networks) + LSTM (Long Short-Term Memory network) is used to extract feature vectors corresponding to the sample dialog.

The feature vector may be a confidence level (belief states) of a corresponding round of sample conversation, and the feature vector may be represented in the form of a unique code (one hot encoding).

Specifically, the confidence may be used to represent a probability that a corresponding sample question is output for the sample question of the round. That is, in each dialog pair, each portion can actually be represented by a feature vector, and the feature vector can summarize the current dialog information and make a reasonable embedded expression for the current dialog. The method is mainly represented by two parts, wherein the first part is confidence coefficient, the first part is guess of each module by the current system, the probability numerical value corresponding to each slot value is contained in the guess, the second part is unique hot code consisting of the system and user behaviors, and in the user and system behaviors, the number is fixed, so that the code form is easy to construct.

Specifically, the obtaining of the feature vectors corresponding to at least two sample dialogues in step S301 may include:

(1) and extracting a first feature vector of the last sample dialogue of at least two sample dialogues based on the plurality of first feature extraction networks and the second feature extraction network.

Wherein the first feature extraction network may be CNN and the second feature extraction network may be LSTM.

Specifically, as shown in fig. 4, extracting a first feature vector of a last sample session of at least two sample sessions may include:

a. inputting the last sample dialogue into a corresponding first feature extraction network;

a. and inputting the output characteristics of the first characteristic extraction network into a corresponding second characteristic extraction network to obtain a first characteristic vector.

As shown in fig. 5, for the last sample dialog, the sample dialog may be input to CNN first, and then the output features of CNN are input to LSTM, so as to obtain the feature vector corresponding to the one dialog output by LSTM.

(2) Extracting second feature vectors corresponding to other sample conversations; wherein the other sample conversations are sample conversations of the at least two sample conversations except the last sample conversation.

Specifically, extracting the second feature vectors corresponding to the other sample dialogues may include:

That is, as shown in fig. 6, for the first sample dialog in the other sample dialogs, the sample dialog is also directly input to the corresponding first feature extraction network; inputting the output characteristics of the first characteristic extraction network into a corresponding second characteristic extraction network to obtain characteristic vectors corresponding to the first round of sample conversation; for sample dialogues except for the first round of sample dialogues in other rounds of sample dialogues, inputting the sample dialogues into corresponding first feature extraction networks, inputting output features of the first feature extraction networks and output features of a previous round of second feature extraction networks into a round of second feature extraction networks, and taking the output of the second feature extraction networks of the (N-1) th round of sample dialogues as second feature vectors; and for the Nth round of sample dialogue, inputting the output features of the first feature extraction network into the corresponding second feature extraction network to obtain a corresponding first feature vector, and determining a reward function based on the first feature vector and the second feature vector, wherein N is a natural number greater than 1.

As shown in fig. 7, taking CNN and LSTM as examples, feature extraction is performed on one round of sample dialogue by using a plurality of CNNs, a second feature vector is obtained by using a plurality of cascaded LSTMs according to the features extracted by CNNs, and the output of the last round of LSTM corresponding to other rounds of sample dialogue is used as the second feature vector.

In the above embodiment, the second feature vector is determined through at least two sample conversations except the last sample conversation, the first feature vector is determined based on the last sample conversation, and the reward function is determined according to the first feature vector and the second feature vector, so that the reward function is determined according to the number of sample conversation turns or the sample conversation result, the obtained reward function can be fused with the feature vector of each sample conversation, that is, the probability of selecting a sample question and answer in each sample conversation turn can be fused, the obtained reward function has a reference value, and the prediction accuracy of the reinforcement learning model obtained through final training is higher.

Step S302, based on the obtained at least two feature vectors, determining a reward function corresponding to at least two dialogues.

In particular, a corresponding reward function is determined based on the first feature vector and the second feature vector.

Specifically, the step S302 of determining a reward function corresponding to at least two dialogues based on the obtained at least two feature vectors may include:

(1) determining a similarity between the first feature vector and the second feature vector;

(2) a corresponding reward function is obtained based on the determined similarity.

Specifically, the cosine similarity may be used to determine the similarity between the first feature vector and the second feature vector, and the specific calculation formula may be as follows:

wherein cos (θ) represents a cosine similarity; a represents a first feature vector; b denotes a second feature vector.

In other embodiments, the similarity may also be calculated in other manners, for example, using euclidean distance, and the specific manner of calculating the similarity is not limited herein.

In some embodiments, obtaining a corresponding reward function based on the determined similarity may include:

Specifically, the first feature vector is determined based on other sample conversations except the last sample conversation, the second feature vector is determined based on the last sample conversation, and the higher the similarity between the first feature vector and the second feature vector is, the higher the value of the corresponding reward function is, therefore, a preset threshold may be set to determine whether to use the similarity to obtain the reward function, and if the similarity is smaller, the reference value of the at least two sample conversations may not be high.

In the above embodiment, when the similarity is greater than the preset threshold, the corresponding reward function is obtained based on the determined similarity, and the higher the similarity between the first feature vector and the second feature vector is, the higher the value of the corresponding reward function is, the stronger the function of the reward function for predicting the next state is, and the higher the accuracy of the corresponding trained reinforcement learning model is.

Step S303, training the initial reinforcement learning model based on the sample question sentence in each sample conversation, the sample feedback action corresponding to the sample answer sentence and the reward function to obtain the trained reinforcement learning model.

During specific training, the data set part only has data sets with successful conversation, and the data sets are from the conversation between a human and a machine, namely sample conversation. The application carries out simulation through simulation learning, thereby improving the memory effect of the reward function. The reverse reinforcement learning is mainly divided into two parts, the first method achieves the purpose by simulating learning, and the second method achieves the purpose by generating countermeasure Networks (GAN). In the countercheck learning, the generator can be used as the reinforcement learning model of the application, and the classifier can be used for the reward function of the application, so that the trained reinforcement learning model can be obtained through iterative training. Since the data set is all from the dialogue sample and all is the positive sample, the strategy of the application adopts a mode of simulating learning. And learning a more accurate reward function expression according to the plurality of first feature extraction networks and the plurality of second feature extraction networks.

For example, D ═ T may be used ₁ ,T ₂ ,...,T _m Each group of sample sessions is denoted, and a group of sample sessions may include m sample sessions, i.e., T _i And representing the sample dialogue of the ith round, and performing model learning based on at least one group of sample dialogue to obtain a trained reinforcement learning model.

In the present application, a PPO (proximity Policy Optimization) reinforcement learning method may be adopted, and such an algorithm is generally designed in the form of CNN when implemented. The method of Sampling Importance (Importance Sampling) introduces another function of known probability distribution, as shown below. I.e. we can fully sample by q (x) and then improve the new strategy p (x), this process can be repeated N times in a round, instead of 1 time, and the gradient of the average reward value of N rounds will also be rewritten as follows:

in the formula:

a gradient representing the average award value for the N rounds; s _t Indicates a state a _t Representing motion and t time.

In the actual process, there is a clip operation, the clip operation represents that a limit is added, the value of the first item can be limited between the second and the third, and the value beyond the second or the third is directly calculated by the second or the third. In essence, this is a normalization operation to prevent the two probabilities from deviating too much, which results in the reinforcement learning effect becoming worse.

In the formula: e is used to represent the preset threshold, within the interval (1-e, 1+ e) there is a value, outside the interval (1-e, 1+ e) there is a fixed value.

Specifically, in the training process, in addition to adjusting the parameters of the initial reinforcement learning model based on the reward function, the parameters of the initial reinforcement learning model may also be adjusted based on parameters such as the dialogue field.

Specifically, as shown in fig. 8, training the initial reinforcement learning model to obtain a trained reinforcement learning model by using the sample question, the sample feedback action corresponding to the sample answer, and the reward function in each sample conversation may include:

(1) determining an environmental parameter corresponding to the sample dialogue; the environment parameter comprises at least one of a conversation result and a conversation field in which the sample conversation is positioned.

Wherein the conversation result comprises conversation success or conversation failure; the conversation domains are determined based on the data sets of the sample conversations, and when a sample conversation is obtained, a corresponding conversation domain may be determined based on the sample conversation.

(2) And training the initial reinforcement learning model based on the environmental parameters, the sample question, the sample feedback action and the reward function to obtain the reinforcement learning model.

Specifically, when a multi-domain question is involved, a user is likely to ask more questions and do more things, and the initial reinforcement learning model can be trained by combining with the dialogue domain.

In the above embodiment, the initial reinforcement learning model is trained based on the environment parameters, the sample question, the sample feedback action and the reward function, and the corresponding reward can be output in consideration of the environment of the field, so that the problem of field dependency is solved, and the trained reinforcement learning model can still maintain a high prediction accuracy rate for different field conversations.

For better understanding of the above-described dialog method, an example of determining the reward function in the dialog method of the present invention is set forth in detail below, as shown in fig. 9:

in an example, taking five sample dialogs, and combining CNN and LSTM as an example, the process of determining the reward function in the dialog method provided by the present application may include the following steps:

1) cascading four LSTM models, for a first sample session, the first sample session f ₁ Inputting the output characteristics of the CNN to the LSTM;

2) for any one of the second to fourth sample dialogues, firstly performing feature extraction through the corresponding CNN, then taking the output of the CNN and the output of the LSTM corresponding to the previous sample dialog as the input of the LSTM, and finally taking the output of the LSTM corresponding to the fourth sample dialog as a second feature vector;

3) for the fifth sample session, the fifth sample session f is conducted ₅ The output characteristics of the CNN are input to the LSTM, and the output of the LSTM is used as a first characteristic vector.

The effect of the dialogue method of the present application will be described below with reference to test data.

Comparing the traditional scheme with the dialogue method proposed in this application, two indexes are used to measure the maturity of the model: the first index is the final success rate; the second indicator is the number of data required for model convergence.

Table 1: comparison of test results of the present application and conventional protocols

Algorithm	Success rate	Number of conversations
			Traditional rewards	0.63	1200
Reward function	0.74	578

From the experimental results in table 1, the number of sample dialogues used in the process of training the reinforcement learning model is smaller, the success rate of the reinforcement learning model after training is higher, the method provided by the application can make the model converge faster, and the final effect of the model is better, which is a significant matter in the process of practical application.

The success rate test is based on a Simulated User form, and the test by the method can save both time and resources. And this training is relatively efficient.

According to the dialogue method, the second feature vector is determined through other sample dialogues except the last sample dialog in at least two sample dialogues, the first feature vector is determined based on the last sample dialog, the reward function is determined according to the first feature vector and the second feature vector, the reward function is determined according to the number of sample dialog turns or the sample dialog result, the obtained reward function can be fused with the feature vector of each sample dialog, namely the probability of selecting a sample sentence for a sample question sentence in each sample dialog can be fused, the obtained reward function has a reference value, and the prediction accuracy of the reinforcement learning model obtained through final training is higher.

Furthermore, the initial reinforcement learning model is trained based on the environment parameters, the sample question, the sample feedback action and the reward function, and corresponding rewards can be output in consideration of the environment of the field, so that the problem of field dependency is solved, and the trained reinforcement learning model can still keep high prediction accuracy rate for different field conversations.

A possible implementation manner is provided in the embodiment of the present application, and as shown in fig. 10, a dialog apparatus 100 is provided, where the dialog apparatus 100 may include: an acquisition module 101, a first determination module 102 and a second determination module 103, wherein,

an obtaining module 101, configured to obtain a target question sentence input by a user;

a first determining module 102, configured to determine, based on the trained reinforcement learning model, a feedback action corresponding to the target question;

the reinforcement learning model is obtained by training based on the feature vectors respectively corresponding to at least two sample conversations and the reward functions determined by the feature vectors; each round of sample dialogue comprises a sample question sentence and a corresponding sample answer sentence; the feature vector is used for expressing the probability of selecting a sample answer sentence aiming at the sample question sentence in each pair of dialogs;

and the second determining module 103 is configured to determine a target answer sentence corresponding to the feedback action and output the target answer sentence.

The embodiment of the present application provides a possible implementation manner, further including a training module, configured to:

The embodiment of the present application provides a possible implementation manner, and when obtaining feature vectors corresponding to at least two sample dialogues, the training module is specifically configured to:

extracting a first feature vector of the last sample conversation of at least two sample conversations and extracting second feature vectors corresponding to other sample conversations based on the plurality of first feature extraction networks and the plurality of second feature extraction networks; wherein the other sample conversations are sample conversations of at least two of the sample conversations except a last sample conversation.

In the embodiment of the present application, a possible implementation manner is provided, and when the training module extracts a first feature vector of a last sample dialog of at least two sample dialogs, the training module is specifically configured to:

The embodiment of the present application provides a possible implementation manner, and when the training module extracts the second feature vectors corresponding to other sample dialogues, the training module is specifically configured to:

The embodiment of the application provides a possible implementation mode, and the feature vector is the confidence of a corresponding round of sample conversation; and the confidence coefficient is used for expressing the probability that the second feature extraction network outputs the corresponding sample question-answer for the round of sample question-sentences.

In an embodiment of the present application, a possible implementation manner is provided, and when determining, by a training module, a reward function corresponding to at least two sessions based on at least two obtained feature vectors, the training module is specifically configured to:

a corresponding reward function is obtained based on the determined similarity.

In an embodiment of the present application, a possible implementation manner is provided, and when the training module obtains a corresponding reward function based on the determined similarity, the training module is specifically configured to:

The embodiment of the application provides a possible implementation manner, and the training module is used for training the initial reinforcement learning model based on the sample question sentences in each sample conversation, the sample feedback actions corresponding to the sample answer sentences and the reward function, and specifically used for:

According to the dialogue device, the second feature vector is determined through other sample dialogues except the last sample dialog in at least two sample dialogues, the first feature vector is determined based on the last sample dialog, the reward function is determined according to the first feature vector and the second feature vector, the reward function is determined according to the number of sample dialog turns or the sample dialog result, the obtained reward function can be fused with the feature vector of each sample dialog, the probability that a sample question and answer are selected in each sample dialog can be fused, the obtained reward function has a reference value, and the prediction accuracy of the reinforcement learning model obtained through final training is higher.

The picture dialog device of the embodiment of the present disclosure may execute the picture dialog method provided by the embodiment of the present disclosure, and the implementation principles thereof are similar, the actions executed by each module in the picture dialog device of each embodiment of the present disclosure correspond to the steps in the picture dialog method of each embodiment of the present disclosure, and for the detailed function description of each module of the picture dialog device, reference may be specifically made to the description in the corresponding picture dialog method shown in the foregoing, and details are not repeated here.

Based on the same principle as the method shown in the embodiments of the present disclosure, embodiments of the present disclosure also provide an electronic device, which may include but is not limited to: a processor and a memory; a memory for storing computer operating instructions; and the processor is used for executing the dialogue method shown in the embodiment by calling the computer operation instruction. Compared with the prior art, the dialogue method enables the prediction accuracy of the reinforcement learning model obtained through final training to be higher.

In an alternative embodiment, an electronic device is provided, as shown in fig. 11, the electronic device 4000 shown in fig. 11 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further comprise a transceiver 4004. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.

The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array) or other Programmable logic device, a transistor logic device, a hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computational function, including, for example, a combination of one or more microprocessors, a combination of a DSP and a microprocessor, or the like.

Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 11, but that does not indicate only one bus or one type of bus.

The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to these.

The memory 4003 is used for storing application codes for executing the scheme of the present application, and the execution is controlled by the processor 4001. Processor 4001 is configured to execute application code stored in memory 4003 to implement what is shown in the foregoing method embodiments.

Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), in-vehicle terminals (e.g., in-vehicle navigation terminals), and the like, and fixed terminals such as digital TVs, desktop computers, and the like. The electronic device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

The present application provides a computer-readable storage medium, on which a computer program is stored, which, when running on a computer, enables the computer to execute the corresponding content in the foregoing method embodiments. Compared with the prior art, the dialogue method in the application enables the prediction accuracy of the reinforcement learning model obtained through final training to be higher.

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.

The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the method shown in the above embodiments.

Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, the acquiring module may also be described as a "module that acquires a target question".

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.

Claims

1. A method of dialogues, comprising:

acquiring a target question sentence input by a user;

determining a feedback action corresponding to the target question based on the trained reinforcement learning model;

determining a target answer sentence corresponding to the feedback action, and outputting the target answer sentence;

the trained reinforcement learning model is obtained by training in the following way:

acquiring feature vectors corresponding to at least two sample conversations respectively; each round of sample dialogue comprises a sample question sentence and a corresponding sample answer sentence; the feature vector is used for representing the probability of selecting a sample answer sentence aiming at the sample question sentence in each pair of dialogs;

determining a reward function corresponding to the at least two rounds of conversations based on the obtained at least two feature vectors;

training an initial reinforcement learning model based on the sample question sentences in each sample conversation, the sample feedback actions corresponding to the sample answer sentences and the reward function to obtain the trained reinforcement learning model;

the obtaining of the feature vectors corresponding to the at least two sample dialogues respectively includes:

extracting a first feature vector of a last sample dialogue of the at least two sample dialogues based on the first feature extraction network and the second feature extraction network;

for a first round of sample dialogue, inputting the sample dialogue into a corresponding first feature extraction network and a second feature extraction network to obtain a feature vector corresponding to the first round of sample dialogue;

and for sample dialogs except for the first round and the last round, inputting the sample dialogs to corresponding first feature extraction networks, inputting the output features of the first feature extraction networks and the output features of the second feature extraction networks in the previous round to a second feature extraction network in the first round, and taking the output of the second feature extraction network of the last but one sample dialogs in at least two rounds of dialogs as a second feature vector.

2. The dialog method of claim 1 wherein said extracting a first feature vector for a last sample dialog of at least two sample dialogs comprises:

and inputting the output characteristics of the first characteristic extraction network into a corresponding second characteristic extraction network to obtain the first characteristic vector.

3. The dialog method according to claim 1, characterized in that the feature vector is a confidence of a corresponding round of sample dialog; and the confidence is used for expressing the probability that the second feature extraction network outputs the corresponding sample answer sentence aiming at the sample question sentence of the round.

4. The dialog method of claim 1 wherein determining a reward function corresponding to the at least two rounds of dialog based on the obtained at least two feature vectors comprises:

a corresponding reward function is obtained based on the determined similarity.

5. The dialog method of claim 4 wherein said deriving a corresponding reward function based on the determined similarity comprises:

6. The method according to claim 1, wherein the training an initial reinforcement learning model based on the sample question sentences, the sample feedback actions corresponding to the sample question sentences, and the reward function in each round of sample dialogue to obtain the trained reinforcement learning model comprises:

7. A dialogue apparatus, comprising:

the training module is used for acquiring feature vectors corresponding to at least two sample conversations respectively; each round of sample dialogue comprises a sample question sentence and a corresponding sample answer sentence; the feature vector is used for representing the probability of selecting a sample answer sentence aiming at the sample question sentence in each pair of dialogs; determining a reward function corresponding to the at least two rounds of conversations based on the obtained at least two feature vectors; training an initial reinforcement learning model based on the sample question sentences in each sample conversation, the sample feedback actions corresponding to the sample answer sentences and the reward function to obtain the trained reinforcement learning model;

a first determining module, configured to determine, based on the trained reinforcement learning model, a feedback action corresponding to the target question;

the second determining module is used for determining a target answer sentence corresponding to the feedback action and outputting the target answer sentence;

when the training module obtains the feature vectors corresponding to at least two sample dialogues respectively, the training module is specifically configured to:

extracting a first feature vector of a last sample dialogue of at least two sample dialogues based on the first feature extraction network and the second feature extraction network;

for a first round of sample conversation, inputting the sample conversation into a corresponding first feature extraction network and a second feature extraction network to obtain a feature vector corresponding to the first round of sample conversation;

8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the dialog method of any of claims 1-6 when executing the program.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when being executed by a processor, carries out the dialog method of any one of claims 1 to 6.