CN113111241B

CN113111241B - Multi-turn conversation method based on conversation history and reinforcement learning in game conversation

Info

Publication number: CN113111241B
Application number: CN202110378191.5A
Authority: CN
Inventors: 庄越挺; 汤斯亮; 程广钊; 谭炽烈; 肖俊; 李晓林; 蒋韬
Original assignee: Zhejiang University ZJU; Tongdun Holdings Co Ltd
Current assignee: Zhejiang University ZJU; Tongdun Holdings Co Ltd
Priority date: 2021-04-08
Filing date: 2021-04-08
Publication date: 2022-12-06
Anticipated expiration: 2041-04-08
Also published as: CN113111241A

Abstract

The invention discloses a multi-turn dialogue method based on dialogue history and reinforcement learning in game dialogue, belonging to the field of intelligent agents and reinforcement learning models. The method comprises the following steps: firstly, taking multiple rounds of conversations as a limited repeated game process, storing the finished complete multiple rounds of conversations, and constructing a historical conversation information base; then in a new multi-turn conversation, an opponent action estimation model is built based on a memory network, the turn of the current conversation is used for retrieving a conversation history information base, and an estimation vector of the opponent next-step strategy is generated through multi-step estimation; and finally, fusing the information and the estimation vector of the current conversation based on the coding-decoding model, and making a response of the next step. In the multi-turn conversation process, the estimation vector of the existing conversation history and the response vector of the current conversation history are fused, so that the history information can be more fully utilized, and the conversation robot (intelligent agent) has higher adaptability and can make better response.

Description

Multi-turn conversation method based on conversation history and reinforcement learning in game conversation

Technical Field

The invention relates to the field of intelligent agents and reinforcement learning models, in particular to a method for multi-turn dialogue of an intelligent agent.

Background

Having a virtual assistant or a chat partner system with sufficient intelligence appears to be fantasy, possibly existing only in science fiction movies. However, in recent years, human-machine conversations have received increased attention from researchers due to their potential and attractive commercial value. With the development of big data and deep learning techniques, it would no longer be a fantasy to create an automated man-machine dialog system as our personal assistant or chat partner. Currently, people pay more and more attention to the dialog system in various fields, and the development of the dialog system is greatly promoted by the continuous progress of deep learning technology. For conversational systems, deep learning techniques may utilize large amounts of data to learn feature representation and reply generation strategies, where only a small amount of manual work is required. Today, we can easily access the "big data" of a conversation over a network, and we may be able to learn how to reply, and how to reply to almost any input, which would greatly allow us to build a data-driven, open conversation system between humans and computers. On the other hand, deep learning techniques have proven effective, can capture complex patterns in large data, and possess a large number of areas of research, such as computer vision, natural language processing and recommendation systems, and the like.

From an application point of view, dialog systems can be roughly divided into two categories: (1) task-oriented systems; (2) non-task-oriented systems (chat-type dialog systems). Real-world dialogue systems (such as bargaining and bargaining) are challenging tasks. An adversary typically has a different pattern and is a number of dialog turns, but the number of turns is limited. However, current research rarely uses previous interactions (historical information).

The multiple rounds of conversation can be viewed as a process of limited repeat gaming, with the history of the conversation comprising two parts, the first part being the complete multiple rounds of conversation that have ended (referred to as the past conversation history) and the second part being the rounds of the current multiple rounds of conversation that have been conducted (referred to as the current conversation history). Current dialog systems only focus on the utilization of the current dialog history, but ignore the past dialog history. Therefore, how to make full use of the historical information and respond better in the game conversation process is a technical problem to be solved urgently at present.

The history of the previous conversation is a complete conversation process, and the history information base stores the complete conversation when facing different opponents, so that the history information is obviously important. In a new multi-turn conversation (e.g., conversational gaming, bargaining, etc.) these past conversation histories can be utilized to infer a policy on the type of adversary in order to better respond.

Disclosure of Invention

The invention aims to provide a multi-turn conversation method based on conversation history and reinforcement learning in a game conversation, so that an intelligent agent has the capability of quickly adapting to a round-turn conversation robot, and the type and strategy of an opponent are deduced more quickly so as to respond.

In order to achieve the purpose of the invention, the invention specifically adopts the following technical scheme:

a multi-turn conversation method based on conversation history and reinforcement learning in game conversation comprises the following steps:

s1: taking multiple rounds of conversations as a limited repeated game process, storing the finished complete multiple rounds of conversations, and constructing an existing conversation history information base;

s2: in a current multi-turn conversation which is already carried out but not completed, the number of turns which are already carried out in the current multi-turn conversation is obtained as current conversation history, and a plurality of complete multi-turn conversations which are most similar to the current conversation history are searched in the past conversation history information base to serve as past history data; then in an opponent action estimation model constructed based on a memory network as a framework, the current conversation history is used as query, the past history data is used as queried content, and an estimation vector of the subsequent action of the opponent is generated through multi-step reasoning;

the opponent action estimation model is trained in advance, so that the output estimation vector of the subsequent action of the opponent can represent the actual vector of the subsequent action of the opponent;

s3: inputting the current conversation history and the estimation vector of the subsequent action of the opponent into a trained coding-decoding model, and making a response of the next step.

Preferably, the opponent action estimation model is a one-step opponent action estimation model, and an output estimation vector of the opponent action estimation model is an estimation vector representing a next action of the opponent.

Preferably, the opponent action estimation model is a multi-step opponent action estimation model, and an estimation vector output by the opponent action estimation model is an estimation vector representing all subsequent actions of the opponent in the current multiple rounds of multiple lines.

Preferably, when a new multi-turn dialogue starts, the first few turns of dialogue give responses directly according to the multi-turn dialogue model without responding based on the current dialogue history; and in the rest conversation turns, taking the turn already carried out in the current multiple turns of conversations as the current conversation history, and carrying out the next response according to S2 and S3.

Further, when a new multi-turn dialogue starts, the turn of response is directly given according to the multi-turn dialogue model as the first 3 to 5 turns.

Preferably, in the coding-decoding model, a vector obtained based on the current dialogue history and the estimated vector of the subsequent action of the opponent are subjected to fusion coding, and then decoded into a natural language or an action by using a neural network to make a response of the next step.

Further, in the encoding-decoding model, the fusion encoding mode is to splice vectors directly or fuse vectors through a self-attention mechanism.

Preferably, in the encoding-decoding model, the encoding part adopts a hierarchy-based encoder, and the decoding part adopts a multilayer feedforward neural network.

Preferably, when the opponent action estimation model is trained, the current dialogue history is input into the opponent action estimation model to generate an estimation vector of an opponent follow-up action, and the follow-up action of each multi-turn dialogue in the past history data is input into a fusion net neural network to generate an actual vector of the opponent follow-up action, and the two vectors are infinitely close to each other by optimizing model parameters.

Preferably, the multiple rounds of conversations are task-type conversations and chat-type conversations.

In the multi-turn conversation process, the estimation vector of the existing conversation history and the response vector of the current conversation history are fused, so that the history information can be more fully utilized, and the conversation robot (intelligent agent) has higher adaptability and can make better response. The method can be used as an architecture, is used for a method or a model which can completely merge previous multiple rounds of conversations, can also be combined with the latest research method in the current conversation field, and has better expansibility.

Drawings

Fig. 1 is a flow chart of a method for multiple rounds of dialogue based on dialogue history and reinforcement learning in a gaming dialogue.

FIG. 2 is a diagram of a one-step opponent estimation model.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

In a gaming session (e.g., bargaining), the type of opponents each confronted may be different, and the strategy of the opponents may vary. In new conversations, it is a challenge how to quickly infer the types and policies of the adversaries, and to make the most favorable responses.

The method provided by the invention is particularly applicable in the case of various strategies and types of the game dialogs or the opponents involved. In repeat games, game history is the primary basis for decision-making and defeat of an opponent. Historical information is a special form of knowledge upon which to judge the type of adversary, infer incomplete information, and predict the behavior of the adversary. The historical information can be said to be the only basis without other additional information from the adversary. Interaction in the real world (e.g. bargaining and bargaining) is a challenging task, with the adversary often having a different way and the rounds of interaction often being multiple but limited in number. However, current research rarely uses previous interactions (historical information). In the face of a wide variety of adversaries or strategies, how to quickly adapt an agent is an important issue. Many new policies are either integration or variants of old policies, so we can use historical information to make the agent quickly adaptive.

The limited repeat game is a primary game (or a stage game) repeated for a limited number of times, and the multiple rounds of conversations can be regarded as a process of the limited repeat game, and the history of the conversations comprises two types, wherein the first type is a complete multiple round of conversations which are ended (called a past conversation history), and the second type is a round which is already performed in the current multiple round of conversations (called a current conversation history). Thus, for multiple sessions, the information available not only is the history of the previous sessions, but also the previous session information. The invention effectively utilizes the two types of historical information at the same time, so that the conversation robot (intelligent agent) has higher adaptability and can make better response. The following will specifically explain the implementation of the present invention in detail.

Referring to fig. 1, in a preferred embodiment of the present invention, a method for multiple rounds of dialogue based on dialogue history and reinforcement learning in a gaming dialogue is provided, which comprises the following steps:

s1: and taking multiple rounds of conversations as a limited repeated game process, collecting and storing the completed multiple rounds of conversations in the agent, and constructing a historical information base of the previous conversations. In consideration of the limitation of storage capacity, complete rounds of conversations can be screened, the final scores can be obtained after the whole limited repeated game is ended, and the typical complete rounds of conversations with high scores are selected and stored. The conversation history which is finished before is used as the basis for the decision of the intelligent agent, so that the prior complete conversation information is stored and marked at the moment when the prior conversation history information base is constructed, and the rest current conversation history can be used for the inquiry of the next step.

S2: in a plurality of rounds of conversations (marked as current rounds of conversations) which are already carried out but not completed, the rounds which are already carried out in the current rounds of conversations are obtained as current conversation histories, a plurality of complete rounds of conversations which are most similar to the current conversation histories are searched in the past conversation history information base to serve as past history data, and only m rounds of similarities of previous rounds of the past histories are compared during searching (m is the number of rounds of conversations which are already carried out in the current rounds of conversations). The complete multiple rounds of conversations with the highest historical similarity to the current conversation can be used as retrieval results, wherein the similarity calculation method can adopt text similarity and other modes, firstly, the conversation (text) is converted into word vectors (word embedding), then, the cosine similarity among the word vectors is calculated, and certainly, other modes can also be adopted to calculate the similarity. Then, an Opponent Action estimation model (OAE) is constructed and trained by using a Memory Network (Memory Network) as a framework, in the Opponent Action estimation model, the current conversation history can be used as a query (query), the past history data can be used as queried content, and an estimation vector of the subsequent Action of the Opponent is generated through multi-step reasoning.

It should be noted that the above-mentioned opponent action estimation model needs to be trained in advance before being actually used, so that the output estimation vector of the opponent subsequent action can represent the actual vector of the opponent subsequent action, that is, two vectors are infinitely close.

In this embodiment, when the hand action estimation model is trained, the current dialogue history in the training data is input into the hand action estimation model (i.e. the memory network framework) to generate an estimation vector of the subsequent action of the hand, and the subsequent action of each multi-turn dialogue in the existing history data is input into the fusion net neural network to generate an actual vector of the subsequent action of the hand, and the two vectors are infinitely close to each other by optimizing the model parameters. Assuming that the number of turns that have been performed in the current session is m, the follow-up action of the input fusion net neural network is the m +1 th turn of the session or the m +1 th turn and all subsequent session histories in the completed multiple turns of the session.

The subsequent actions of the Opponent are specifically determined according to the subsequent Action condition which needs to be predicted by the agent, and if the number of the executed dialog turns in the current Multi-turn dialog is M, only predicting the M +1 th turn is called One-Step Opponent Action estimation (O-OAE), and predicting the M +1 th turn and all the actions later is called Multi-Step Opponent Action estimation (M-OAE).

Therefore, in the multi-step reasoning of the invention, the estimation model of the opponent action is constructed and trained by taking the memory network as a framework. As shown in fig. 2, in the adversary action estimation model, the current dialogue history is used as a query, the past history data is used as the queried content, and the memory network can perform inference in three steps (even multiple steps), and the inference process is as follows: firstly, after word vectors of the past history and the current history are obtained through a coding matrix, softmax operation is carried out to calculate the similarity of the word vectors, the related weight of the past history is obtained, then the past history is coded by different coding matrixes again and is subjected to weighted summation with the related weight, and the one-step reasoning is carried out. The multi-step reasoning will repeat the above operations, but each step of reasoning is different for the past history of the coding matrix. Finally, an estimated vector of the subsequent action of the opponent is generated.

S3: and inputting the current conversation history and the estimation vector of the subsequent action of the opponent output by the action estimation model of the opponent in the S2 into a trained coding-decoding (Encoder-Decoder) model together, and making a response of the next step.

In the above processes of S1 to S3, the current multiple rounds of conversations need to depend on the number of already performed rounds as the current conversation history, but when a new multiple rounds of conversations are started, the information in the first few rounds is too little, so that the current conversation history is not accurate. Thus, when a new multi-turn conversation starts, the first few turns of the conversation can give a response directly according to the multi-turn conversation model, without responding based on the current conversation history; and in the rest conversation turns, taking the turn already carried out in the current multiple turns of conversations as the current conversation history, and carrying out the next response according to S2 and S3. Here, the so-called multi-turn dialogue model is an agent owned by the dialogue robot before the present invention is applied, which can generate a response according to an existing method and model.

The number of rounds m in which the response is given directly according to the model of the multi-round dialog at the start of a new multi-round dialog can be determined from the total number of rounds of dialog and can be set to typically 3 to 5 rounds. The first 3 to 5 wheel conversations can respond according to the existing method and model, after the 3 to 5 wheel conversations, K previous conversation histories (K values are optimized and adjusted according to the reality) which are most similar to the current m wheel conversation histories are searched in a previous conversation history information base by utilizing the framework of the S2-S3, and then the estimation is carried out on the basis of the hand action estimation model.

The encoding-decoding model has the functions of encoding the current conversation history into vectors, performing fusion encoding on the vectors and the estimated vectors of the subsequent actions of the opponent obtained in the step S2, then decoding the vectors into natural language or actions (specifically, the natural language or the actions need to be determined according to the form of the conversation) by using a neural network, and making the next response. The fusion coding can be performed in different ways, for example, the vectors can be directly concatenated (concat) or the fusion coding can be performed through a self-attention mechanism (self-attentions). The specific form of the encoding-decoding model may be various as long as the corresponding function can be achieved. In the encoding-decoding model of the present embodiment, the encoding portion employs a hierarchy-based encoder, and the decoding portion employs a multi-layer feedforward neural network. And the current history is encoded by a hierarchical encoder and then fused with the current history, and finally, the next action is generated by a multi-layer feedforward neural network. Therefore, the invention is a framework, and the method or the model of previous multiple rounds of conversations can be completely fused in the framework, and the method or the model can also be combined with the latest research method in the current conversation field. When different types of game problems are faced, the income matrixes (income functions) of different game problems are different, but the opponent action estimation model is independent of the specific game problems, and the module can be reused as long as the past history and the current history belong to the same game problem.

The multi-round conversation method provided by the invention has the advantages that the historical information is more fully utilized, and the adaptability and the response accuracy of the conversation robot (intelligent agent) are improved. The current conversation robots are mainly classified into task-type conversation robots and chat-type conversation robots.

The task-based dialogue robot aims to help the user to complete a specific task (such as ordering food, booking tickets, etc.), and the less the number of turns of dialogue, the better the task is. The invention can make full use of the prior historical information base, can solve the problems of pertinence and adaptability under the guidance of the historical information, shortens the conversation turns, helps the user to complete the task more quickly and improves the use experience of the user. Moreover, the invention can rapidly migrate on different kinds of dialogue robots, such as ticket booking robots which are rapidly constructed by using meal booking-based dialogue robots.

The chat type conversation robot is mainly used for chatting with users, but the current chat conversation robot mainly has the problems of single response, repeated language, too short turn and the like. The invention is different from other chat type conversation robots based on retrieval models, can provide more diversified responses through rich historical information bases, enables the conversation robots to have the capabilities of simple logical reasoning and problem migration through a multi-step reasoning model (opponent action estimation model), and enables the conversation robots to be more intelligent and humanized by making responses of specific styles aiming at different user types.

Practical application results show that the multi-turn conversation method based on conversation history and reinforcement learning in the game conversation provided by the invention can enable the two conversation robots to have higher adaptability and make better responses.

The above-described embodiments are merely preferred embodiments of the present invention, which should not be construed as limiting the invention. Various changes and modifications may be made by one of ordinary skill in the pertinent art without departing from the spirit and scope of the present invention. Therefore, the technical scheme obtained by adopting the mode of equivalent replacement or equivalent transformation is within the protection scope of the invention.

Claims

1. A multi-turn conversation method based on conversation history and reinforcement learning in game conversation is characterized by comprising the following steps:

the opponent action estimation model is trained in advance, so that the output estimation vector of the subsequent action of the opponent can represent the actual vector of the subsequent action of the opponent; when the adversary action estimation model is trained, inputting the current dialogue history into the adversary action estimation model to generate an estimation vector of an adversary follow-up action, and simultaneously inputting the follow-up action of each multi-turn dialogue in the past history data into a fusion Net neural network to generate an actual vector of the adversary follow-up action, and enabling the two vectors to approach infinitely through optimizing model parameters;

s3: inputting the current dialogue history and the estimation vector of the adversary follow-up action into a trained coding-decoding model, and making a response of the next step.

2. The method of claim 1, wherein the opponent action estimation model is a one-step opponent action estimation model that outputs an estimation vector representing the next action of the opponent.

3. The method of claim 1, wherein the opponent action estimation model is a multi-step opponent action estimation model that outputs an estimation vector representing all subsequent actions of the opponent in the current multiple rounds of multiple lines.

4. The method of claim 1, wherein when a new multi-turn session is initiated, the first plurality of turns of the session respond directly according to the multi-turn session model without responding based on the current session history; and in the rest conversation turns, taking the turn already carried out in the current multiple turns of conversations as the current conversation history, and carrying out the next response according to S2 and S3.

5. The method of claim 4 wherein when a new session starts, the turn to give a response directly according to the multi-turn session model is the first 3~5 turns.

6. A multi-turn dialogue method based on dialogue history and reinforcement learning in a gaming dialogue as recited in claim 1, wherein in the coding-decoding model, vectors obtained based on the current dialogue history and estimated vectors of subsequent actions of the opponent are encoded in a fusion manner, and then decoded into natural language or actions by using a neural network, and responses are made in the next step.

7. A multiple-turn dialogue method based on dialogue history and reinforcement learning in gaming dialogues as recited in claim 6, wherein in the encoding-decoding model, the fusion encoding is performed by splicing vectors directly or by a self-attention mechanism.

8. The method for multiple rounds of dialogue based on dialogue history and reinforcement learning in gaming dialogue as recited in claim 1, wherein the encoding component employs a hierarchy-based encoder and the decoding component employs a multi-layer feed-forward neural network.

9. A method of multiple rounds of conversation based on conversation history and reinforcement learning in a betting conversation as claimed in claim 1, wherein said multiple rounds of conversation are task type conversation and chat type conversation.