CN110413754B

CN110413754B - Conversational (in) reward evaluation and conversational methods, media, apparatuses, and computing devices

Info

Publication number: CN110413754B
Application number: CN201910663167.9A
Authority: CN
Inventors: 黄民烈; 高信龙一
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2019-07-22
Filing date: 2019-07-22
Publication date: 2023-01-13
Anticipated expiration: 2039-07-22
Also published as: CN110413754A

Abstract

Embodiments of the present invention provide a conversational (mid) reward evaluation and conversational method. The dialog reward evaluation method comprises the steps of obtaining one or more rounds of dialog in which a target intelligent agent participates; determining a reward for the one or more rounds of dialog in which the target agent participates based on the one or more rounds of dialog acquired and the corresponding human dialog in the real scene. By learning the reward function from a real human conversation, rather than being designed manually, a large amount of parameter tuning work is eliminated. In addition, the method of evaluating the reward in the dialog and the dialog method can provide the reward of the dialog in real time in the dialog session so as to guide the dialog strategy in each dialog turn.

Description

Conversational (in) reward evaluation and conversational methods, media, apparatuses, and computing devices

Technical Field

Embodiments of the present invention relate to the field of human-computer conversation, and more particularly, embodiments of the present invention relate to a conversation (middle) reward evaluation and conversation method, medium, apparatus, and computing device.

Background

The intelligent man-machine conversation system is an intelligent system capable of carrying out conversation interaction with a user. Among them, the dialogue policy is a module in the whole system that decides how to reply to the user. The earliest design approach to dialog strategy was for designers to design different logic rules based on different user inputs. The disadvantage of this method is that the dialogue strategy can not be optimized continuously with the feedback of the user, and the self-adaptive ability to the user and the environment is enhanced.

In recent years, a deep reinforcement learning method is increasingly used for optimization of a dialogue strategy. In this method, the dialogue strategy is represented by a neural network and is intensively trained by using a reward signal (reward), and the method has the advantage that the performance (such as the dialogue success rate) of the system is continuously improved along with the continuous use of the user.

Current dialog systems require elaborate reward functions and pre-specified user goals. As the demand for systems to handle complex goals across multiple domains continues to grow, the complexity of handling real-world tasks is overwhelming for such artificially designed reward functions.

Disclosure of Invention

The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention, and it is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In view of the problems set forth above, the present invention proposes a conversation incentive assessment method, the conversation comprising a plurality of rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the method comprising:

obtaining one or more of the multiple sessions in which the target agent participates;

determining a reward of the one or more rounds of conversations in which the target agent participates based on the acquired one or more rounds of conversations and the corresponding human conversations in the real scene.

The invention also provides a method of assessing rewards in a conversation, the conversation comprising multiple rounds of conversations between two parties of the conversation, one party of the two parties of the conversation being a target agent and the other party being a user or opponent agent, the method comprising:

obtaining a current turn of the multiple turns of conversations in which the target agent participates, wherein the target agent has not output any utterance in the current turn of conversations;

predicting a conversation strategy to be adopted based on the current state of the target agent in the multiple rounds of conversations;

calculating the reward of the dialog of the target agent including the estimated dialog strategy to be taken based on the method as described above.

The invention also provides a dialogue method, wherein the dialogue comprises multiple rounds of dialogue between two dialogue parties, one of the two dialogue parties is a target agent, and the other dialogue party is a user agent or an opponent agent, the method comprises the following steps:

tracking each turn of dialogue in the multiple turns of dialogue and generating a dialogue state of a corresponding turn of a corresponding object;

determining at least one dialog strategy that the target agent can adopt based on the dialog state of the current turn of the target agent;

respectively determining the reward of the target agent for adopting the corresponding conversation strategy based on the determined at least one conversation strategy and the conversation state of the current turn of the target agent;

selecting an optimal conversation strategy from at least one conversation strategy according to the reward and generating a corresponding utterance;

wherein the reward is calculated by a method as described in any of the preceding.

The present invention also provides a dialog bonus assessment apparatus, said dialog comprising a plurality of rounds of dialog between two parties of said dialog, one party of said two parties of said dialog being a target agent and the other party being a user or opponent agent, said apparatus comprising:

a conversation acquisition module configured to acquire one or more of the multiple rounds of conversations in which a target agent participates;

a reward determination module configured to determine a reward for the one or more rounds of dialog in which the target agent participates based on the acquired one or more rounds of dialog and the corresponding human dialog in the real scene.

The present invention also provides an apparatus for assessing rewards in a conversation, the conversation comprising multiple rounds of conversation between two parties to the conversation, one party of the two parties to the conversation being a target agent and the other party being a user or opponent agent, the apparatus comprising:

a dialog acquisition module configured to acquire a current turn of dialog of the multiple turns of dialog in which a target agent participates, wherein the target agent does not output any utterance in the current turn of dialog;

a conversation strategy prediction module configured to predict a conversation strategy to be taken based on a current state of the target agent in the plurality of rounds of conversations;

calculating a reward for a conversation based on the apparatus as previously described for the target agent including the estimated conversation strategy to be undertaken.

The present invention also provides a dialog device including a plurality of rounds of dialogs between two parties of a dialog, one of the two parties of the dialog being a target agent and the other being a user or opponent agent, the device comprising:

a dialog tracking module configured to track each of the plurality of turns of dialog and generate a dialog state for a corresponding turn of the corresponding object;

a conversation policy sampling module configured to determine at least one conversation policy that a target agent is capable of adopting based on a conversation state of a current turn of the target agent;

the reward determination module is configured to respectively determine that the target intelligent agent takes the reward of the corresponding conversation strategy based on the determined at least one conversation strategy and the conversation state of the current turn of the target intelligent agent;

the sentence generation module is configured to select an optimal conversation strategy from at least one conversation strategy according to the reward and generate a corresponding utterance;

wherein the reward is calculated by the apparatus as described in any of the preceding.

The present invention also provides a computer-readable storage medium having stored thereon a computer program for executing any of the methods described above.

The present invention also provides a computing device, comprising: a processor; a memory for storing the processor-executable instructions;

the processor is configured to perform any one of the methods described above.

The dialogue (middle) reward evaluation and dialogue method, medium, apparatus and computing device according to embodiments of the present invention eliminate a large amount of parameter adjustment work by learning a reward function from a real human dialogue, rather than by manual design. In addition, the method of evaluating the reward in the dialog and the dialog method can provide the reward of the dialog in real time in the dialog session so as to guide the dialog strategy in each dialog turn.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present invention will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a flowchart illustrating a method for assessing a conversational prize according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a method for assessing rewards during a conversation according to an embodiment of the invention;

fig. 3 is a schematic flowchart of a dialog method according to an embodiment of the present invention;

FIG. 4 is a block diagram of a device for assessing conversational rewards according to an embodiment of the invention

FIG. 5 is a block diagram of a device for assessing rewards during a session according to an embodiment of the invention

Fig. 6 is a schematic block diagram of a dialog device according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computing device provided by an embodiment of the invention;

in the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

Embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the following description is only exemplary and is not intended to limit the present invention. Further, in the following description, the same reference numerals will be used to refer to the same or like parts in different drawings. The different features in the different embodiments described below can be combined with each other to form further embodiments within the scope of the invention.

A method for assessing a reward of a dialog according to an exemplary embodiment of the invention is described below with reference to fig. 1, the dialog comprising a plurality of rounds of dialog between two parties of the dialog, one party of the two parties of the dialog being a target agent and the other party being a user or an opponent agent, the method comprising:

step S110, acquiring one or more of the multiple rounds of conversations in which the target agent participates;

the method disclosed by the invention can be applied to a plurality of preset fields, and the preset fields can include but are not limited to one or more of the following fields: ordering; booking tickets; shopping online; booking a taxi; reserving a hotel; and to look for music, movies, or certain products, etc.

According to an embodiment of the invention, multiple rounds of conversations between an adversary and an agent included in a conversation may involve only one and the same domain, or alternatively, multiple domains, but only one domain per round of conversation.

For example, it is assumed that the at least one preset domain includes two domains of ordering food and finding products.

Further, each round of conversation in the multiple rounds of conversations between the adversary (user/agent) and the agent means that each round of the adversary conversation contents and the agent conversation contents (input text contents or voice contents).

As an example, the domain corresponding to the current round (i.e., the current domain) is any one of the at least one preset domain described above.

For example, U (i) represents opponent dialog content of the ith turn, and S (i) represents target agent dialog content of the ith turn, where i represents turn number, i =1,2,3, …. For example, U (1) represents the first round of opponent dialog content, S (1) represents the first round of target agent dialog content, and so on. Thus, multiple rounds of dialog can be represented as: u (1) → S (1) → U (2) → S (2) → U (3) → S (3) → … ….

It should be noted that the first round of dialog of the multi-round dialog may be set as the first round of speaking content of the opponent (i.e. the first round of opponent dialog content) and the first round of speaking content immediately after the target agent (i.e. the first round of agent dialog content), the second round of dialog of the multi-round dialog may be set as the second round of opponent dialog content and the second round of target agent dialog content, and so on.

It should be noted that in each round of conversation, the speaking time of the opponent's conversation content precedes the speaking time of the target agent's conversation content.

As an example, if the actual first speaker of the entire multi-turn conversation is the target agent, the first turn of the speaking content of the opponent (i.e., the first turn of the opponent conversation content U (1)) may be set to null.

When processing to a certain round (for example, the t-th round, t being a positive integer, i.e., t =1,2, …), the round is taken as the current round.

And step S120, determining the reward of the one or more conversations participated by the target agent based on the acquired one or more conversations and the corresponding human conversations in the real scene.

In an embodiment of the method, two reward calculation methods are proposed, one of which is:

comparing the acquired one or more rounds of conversations with corresponding human conversations in a real scene;

specifically, behavior tracks of human conversation and behavior tracks of intelligent agent conversation under corresponding scenes are respectively obtained; the action track can represent the conversation strategy taken by the corresponding object, and any object in the conversation can achieve a specific intention by taking the corresponding conversation strategy; for example, given a set of collected human dialog segments

Each dialog τ can be viewed as a trace of state-action pairs

Wherein

Is a user (opponent) conversation turn, { s _i ,a _i Is a system dialogue turn. It will be appreciated that the user (opponent) dialog may be given by another agent (user simulator), for example user simulator μ (a) ^u ,t ^u |s ^u ) According to the user's dialog state s ^u Giving action a ^u Wherein t is ^u A binary termination signal is indicated indicating whether the user wants to end the current session. The quality of the reply to the target agent dialog is then evaluated by comparing the target agent dialog to a sample of human dialog segments from the corpus.

Wherein the state of any one of the objects (target agent or user) is updated based on the acquired one or more rounds of conversation, and at least comprises the opponent action in the current round of conversation and the action of the object in the previous round of conversation, for example, in the conversationRound t, target agent status

User actions including (1) a current round

(2) Target agent action a of previous round _t-1 。

It should be noted that, in order to support the dialog strategy to output a plurality of intentions in one dialog turn, in an embodiment of the present embodiment, each target agent action a or user action au is a dialog action set

Rather than a single element. In addition, a dialog action is an abstract representation of an intent (mainly including announcements and inquiries), and in a multi-domain setting, may be represented by a quadruple of domains, intents, slot types and slot values (e.g., [ restaurant (restaurant), announcement (info), food (food), italian (Italian) ], and it should be noted that one dialog action may include a plurality of slots (e.g., a plurality of announceable slots and interrogatable slots or other types of slots, etc.).

In a dialog output by the agent, the slot value may be retrieved by the agent from an external database, and a count placeholder may be substituted for the slot value before the agent refills its truth value with the selected entity from the external database.

Thus, the state of the object also includes the confidence state of all slots and the embedded vector of the number of results queried from the external database, specifically, at dialog round t, the target agent state

Including (1) user actions for the current round

(2) Target agent action a of previous round _t-1 (3) Confidence status of all slots b _t (ii) a And (4) counting from the outsideDatabase query result number q _t The embedded vector of (2).

It should be noted that since the present method makes decisions at the dialog action level, updating the state can be achieved simply by extracting the slot value from the action.

The second way of calculating the prize is described below:

learning from human dialogue under corresponding real scene to obtain a reward function;

specifically, the reward function can be learned based on the current optimal strategy viewed by the human dialogue behavior track in the corresponding real scene, the essence of the method is reverse reinforcement learning (IRL), the IRL does not know what the reward function is specifically, and only pi is the optimal strategy from the human or the optimal strategy ^* Expert demonstration of well-drawn (expert demonstrations)

The IRL aims to infer potential reward functions R from these expert presentations and to train the strategy accordingly. This is different from simple emulation learning (emulation learning), which simply learns the state-to-action mapping, the most fundamental difference being that IRL can let learned strategies understand human goals. Demonstration of experts in view of IRL

Viewed as being from the optimal strategy pi ^* The IRL problem may be used as a solution to the maximum likelihood estimation, specifically, the parameter θ about the dialog strategy is solved by the maximum likelihood estimation to determine the reward function:

in the above formula, p _θ (τ) contains a parameterized reward function r _θ (s _t ,a ₀ ) Wherein gamma is epsilon [0,1]Is a discount (discount) factor that controls the weight of the reward. Intuitively, it can be understood that the more closely the trajectory of human behavior, the more rewards are obtained, i.e., the higher the similarity to the trajectory of human behavior, the higher the reward is obtained by the target agent.

Determining a reward for the one or more conversations in which the target agent participates based on the one or more conversations and the reward function.

It is to be understood that, in an embodiment of the present invention, the dialog reward estimation method does not need to be performed during the learning of the reward function, which is obtained during the training phase.

In one embodiment of this embodiment, the session-level rewards are calculated by maximizing the likelihood of human dialog segments being observed to infer potential goals based on a maximum entropy IRL algorithm,

where f models the human dialog as a boltzmann distribution, R represents the return of the conversation, i.e., the gamma-discounted jackpot, and Z is the corresponding scoring function used to normalize the probabilities.

Consider that the IRL is similar to a network of discriminators in countermeasure learning (AL), which evaluates the authenticity of a sample. In one example of this embodiment, a strong correlation between GAN and the maximum entropy causal IRL is derived by replacing the boltzmann distribution in the IRL with an estimate of the true data density in AL:

in the above equation, c (·) = -r (·) is a penalty function, and q (τ) is the probability that a generator in the RL or GAN generates the trajectory τ. In this embodiment, the IRL algorithm is combined with the AL algorithm to achieve better automatic reward estimation.

With reference to fig. 2, an embodiment of the present invention further provides a method of assessing rewards in a conversation, the conversation including multiple rounds of conversation between two parties to the conversation, one of the parties to the conversation being a target agent and the other party being a user or opponent agent, the method including:

step S210, acquiring a current round of conversation in the multiple rounds of conversations participated by the target agent, wherein the target agent does not output any words in the current round of conversation;

step S220, predicting a conversation strategy to be adopted based on the current state of the target agent in the multiple rounds of conversations;

referring to what has been described in the previous embodiment, at the dialog turn t, the target agent states

Based on the current state s of the target agent _t Deciding on the corresponding dialog strategy, i.e. taking the corresponding action a _t 。

Step S230, calculating the reward of the dialog of the target agent including the estimated dialog strategy to be taken based on the method as described in any of the previous embodiments.

In one aspect, a dialog behavior of a target agent can be traced

And comparing the real behavior tracks of human conversation in the corresponding scene, then obtaining the similarity between the two tracks, and obtaining the corresponding reward based on the similarity result.

Alternatively, the reward r may be calculated using the reward function of any of the previous embodiments _θ (s _t ,a _t )。

In addition, in an embodiment of the present embodiment, an agent may be configured to follow a random strategy pi, and the function of the (state) value of the agent estimates the expected return obtained from the current state, defined as

To evaluate the current state. The Q (state-action) value of its state-action pair (s, a) is defined as

To estimate the expected return that can be achieved after taking action a in state s. The only difference between the two is that the former action a is extracted from the strategy pi, while the latter action a is given. Therefore, by comparing the two, it can be determined whether the action a taken in the state s receives more expected return than before, and therefore, in an embodiment of the present embodiment, the method further includes:

determining an advantage of the target agent adopting the estimated policy based on the reward of the dialog of the target agent including the estimated dialog policy to be adopted;

and taking the advantage of adopting the estimation conversation strategy as the reward of adopting the estimation strategy by the target intelligent agent in the current conversation.

From the above, it is clear that direct optimization advantage is more stable and efficient than directly maximizing return without changing the expectations of the original returns, and in one embodiment of the present embodiment, the reward function is constructed based on strategic advantages.

In one embodiment of this embodiment, the reward function is updated by a reward estimator to calculate the reward, specifically in the manner of AL, the reward estimator aims to distinguish between an actual live conversation and a conversation generated by a dialog strategy. Thus, it minimizes the KL distance to the real data while maximizing the KL distance to the strategy distribution pre-trained on a real-person dialogue based on maximum likelihood estimation.

In addition, to recover interpretable and robust rewards from a live conversation, a reward estimator f _ω Convertible to reward approximation g _ω And reward shaping (shaping) item h _ω The concrete formula is as follows:

f _ω (s _t ，a _t ，s _t+1 )＝g _ω (s _t ，a _t )+γh _ω (s _t+1 )-h _ω (s _t )，

wherein the state-action pairs(s) _t ,a _t ) Replacement by State-action-State triplets(s) _t ,a _t ,s _t+1 ) As input to the reward estimator.

In the present embodiment, for the reward estimator f (s, a), it is divided into two networks g (s, a) and h(s), each being a hidden layer MLP (multi-layer perceptron), according to the above description.

Referring to fig. 3, an embodiment of the present invention further provides a dialog method, where the dialog includes multiple rounds of dialogs between two dialog parties, one of the two dialog parties is a target agent, and the other is a user or opponent agent, and the method includes:

tracking each turn of the multiple turns of conversations and generating a conversation state of a corresponding turn of the corresponding object;

in this embodiment, the State in the Dialog may be tracked based on a multi-domain Dialog State Tracker DST (Dialog State Tracker) at one Dialog action level, e.g., during a session, the DST records the Dialog action of one party and returns the State to the other party to decide what action to take next.

in this step, the dialog strategy may be determined by a preset dialog strategy selector that determines actions that the target agent can take based on the dialog state of the current turn of the target agent; wherein the dialog strategy selector is constructed based on a multi-layer perceptron.

Wherein the conversation policy selector encourages the conversation policy to emulate a human conversation action. It should be noted that, here, the human behavior is not directly copied, but the probability of the human behavior is made by means of the estimated reward raising strategy. Maximizing the expected entropy of the regularization yield by minimizing the KL divergence between the strategic distribution and the Boltzmann distribution according to the maximum entropy principle

In addition, in one embodiment of the present invention, the dialog strategy is optimized by using a near-end strategy optimization method.

Respectively determining the reward of the target intelligent agent for adopting the corresponding conversation strategy based on the determined at least one conversation strategy and the conversation state of the current turn of the target intelligent agent;

wherein the reward may be calculated by a method as described in any of the previous embodiments.

Next, a specific dialogue process is given to explain the above steps, first, in a real person dialogue

Based on a Maximum Likelihood Estimation (MLE) pre-training dialogue strategy pi, the pre-training dialogue strategy pi is used for interacting with a user simulator mu to serve as negative sampling

Then using real person dialogue

And negative sampling dialogue

Pre-training the reward estimator f, followed by iterative training:

1) Slave data

Middle random sampling real person dialogue section D _H ；

2) Implementing a dialog strategy π and interacting with a user simulator μ, a ^u ～μ(·|s ^u ) A. Pi. (. DELTA.s) for collecting dialog D _Π ；

3) Updating the reward estimator f, optimizing ω to maximize the penalty function J of the reward estimate based on a counterlearning approach _f With reference to the formula:

where ω represents the potential intent of the dialog, the likelihood of an observed human dialog segment may be maximized by the reward estimator to infer a potential goal;

4) Calculating D _∏ Reward evaluation value for each state-action pair in the set

5) Updating the dialog strategy pi and the value function V, optimizing theta to maximize J pi, J _V Wherein

In the above formula, V _θ Is an approximation function (value function), β t = π θ (a) _t |s _t )/πθ _old (a _t |s _t ) Is the probability ratio between the new policy and the old policy,

is an estimated advantage, δ is the TD residual term (residual), λ (GAE factor, which can be set to 0.95) and ∈ (clipping factor, which can be set to 0.2) are hyper-parameters.

In addition, in the training phase of the dialog strategy selector and the reward estimator, a plurality of processes are conducted in parallel, and finally data of all the processes are integrated to accelerate training. Specifically, a strategy pi is paralleled to N processes, exploration and sampling are respectively carried out, all collected data are combined, and then the operation is repeatedly carried out based on an algorithm optimization strategy. By the method, the training speed can be effectively accelerated, and the variance brought in the sampling process can be reduced.

With reference to fig. 4, an embodiment of the present invention further provides a conversational reward evaluation device 40, the conversation including multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the device 40 including:

a conversation acquisition module 410 configured to acquire one or more of the multiple rounds of conversations in which the target agent participates;

a reward determination module 420 configured to determine a reward for the one or more rounds of dialog in which the target agent participates based on the acquired one or more rounds of dialog and the corresponding human dialog in the real scene.

In one embodiment of the present invention, the reward determination module 420 comprises:

a comparison unit configured to compare the acquired one or more turns of dialog with corresponding human dialogs in a real scene;

a reward determination unit configured to determine a reward for the one or more rounds of conversation in which the target agent participates based on the comparison result.

In one embodiment of the present invention, the comparison unit includes:

the behavior track acquisition subunit is configured to respectively acquire a behavior track of human conversation and a behavior track of intelligent agent conversation in corresponding scenes;

and the similarity result determining unit is configured to compare the behavior tracks between the two to determine a similarity result.

In one embodiment of the invention, the reward determination module 420 comprises:

a reward function determination unit configured to learn a reward function from a human conversation in a corresponding real scene;

a reward determination unit configured to determine a reward for the one or more rounds of dialog in which the target agent participates, in dependence on the one or more rounds of dialog and the reward function.

In an embodiment of the invention, the reward function determination unit is further configured to consider human dialogue behavior in the corresponding real scene as the current best strategy to learn the reward function.

In one embodiment of the invention, the reward function determination unit is further configured to learn the reward function based on a trajectory of human dialog behavior in a corresponding real scene.

In one embodiment of the invention, the reward function is learned for a training phase.

In one embodiment of the invention, the reward function determination unit is further configured to solve the parameter regarding the dialog strategy by maximum likelihood estimation to determine the reward function.

In one embodiment of the invention, the action track can represent the conversation strategy adopted by the corresponding object, and the corresponding conversation strategy adopted by any object in the conversation can achieve a specific intention.

In one embodiment of the invention, the behavior trace comprises at least one state-action pair of the object to which it belongs.

In an embodiment of the present invention, the state of the object is updated based on the acquired one or more rounds of conversations, and includes at least an opponent action in a current round of conversation and an action of the object in a previous round of conversation.

In one embodiment of the invention, the actions of the objects are representations of specific intents, including domains, intents, slot types, and slot values.

In one embodiment of the invention, the intent includes at least a notification and/or a query.

In one embodiment of the invention, the slot value is obtained by the agent from an external database.

In one embodiment of the invention, the state of the object also includes the confidence state of all slots and the embedded vector of the number of queries from the external database.

In one embodiment of the invention, the reward function further determines a dialog reward for the current round as input from the state-action pair of the current round of the subject and the state of the next round.

In one embodiment of the invention, the higher the similarity to the human behavior trajectory, the higher the reward the target agent receives.

With reference to fig. 5, an embodiment of the present invention further provides an apparatus 50 for assessing rewards in a conversation, the conversation including multiple rounds of conversation between two parties of the conversation, one party of the two parties of the conversation being a target agent and the other party being a user or opponent agent, the apparatus 50 comprising:

a dialog acquisition module 510 configured to acquire a current turn of dialog of the multiple turns of dialog in which the target agent participates, wherein the target agent does not output any utterance in the current turn of dialog;

a conversation strategy prediction module 520 configured to predict a conversation strategy to be taken based on a current state of the target agent in the plurality of conversations;

the reward for a conversation of the target agent including the estimated conversation strategy to be undertaken is calculated based on the means 40 as previously described.

In one embodiment of the invention, the reward is calculated based on a reward function as described previously.

In one embodiment of the present invention, the apparatus 50 further comprises:

a policy advantage determination module configured to determine an advantage of a target agent's adoption of an estimated policy based on a reward of a dialog of the target agent including the estimated dialog policy to be taken;

and the reward determining module is configured to take the advantage of the estimation dialogue strategy as the reward of the target intelligent agent for taking the estimation strategy in the current dialogue.

In one embodiment of the present invention, the policy advantage determination module includes:

the random strategy acquisition unit is configured to acquire a random strategy of the conversation of the target intelligent agent in the current state;

calculating the reward adopting a random strategy under the target agent front state by adopting the reward function;

and determining the advantages of the estimation strategy adopted by the target agent based on the rewards obtained by the estimation strategy and the random strategy respectively adopted by the target agent.

In one embodiment of the invention, the reward function is updated by a reward estimator to calculate the reward.

In one embodiment of the invention, the reward estimator updates the reward function by simultaneously maximizing the KL distance from the strategic distribution pre-trained on maximum likelihood estimation on a live dialogue and minimizing the KL distance from the real data.

Referring to fig. 6, an embodiment of the present invention further provides a dialog apparatus 60, where the dialog includes multiple rounds of dialogs between two parties of the dialog, one party of the two parties of the dialog being a target agent and the other party being a user or opponent agent, the apparatus 60 including:

a dialog tracking module 610 configured to track each of the multiple turns of dialog and generate a dialog state for a corresponding turn of the corresponding object;

a conversation policy sampling module 620 configured to determine at least one conversation policy that a target agent is capable of taking based on a conversation state of a current turn of the target agent;

a reward determination module 630 configured to determine rewards of the target agent for adopting corresponding conversation strategies based on the determined at least one conversation strategy and the conversation state of the current turn of the target agent, respectively;

a sentence generating module 640 configured to select an optimal conversation strategy from at least one conversation strategy according to the reward and generate a corresponding utterance;

wherein the reward is calculated by the device 40 or 50 as described above.

In one embodiment of the invention, the dialog policy is determined by a preset dialog policy selector that determines actions that the target agent is capable of taking based on the dialog state of the current turn of the target agent.

In one embodiment of the invention, the dialog strategy selector is built based on a multi-tier perceptron.

In one embodiment of the invention, the dialogue strategy is optimized by a near-end strategy optimization method.

In one embodiment of the invention, the conversation strategy selector and reward estimator are pre-trained based on real person conversation data.

In one embodiment of the invention, in the training phase of the dialog strategy selector and the reward estimator, a plurality of processes are paralleled, and finally, data of all the processes are integrated to accelerate training.

In addition, it is noted that the components of the above system may be configured by software, firmware, hardware or a combination thereof. The specific means or manner in which the configuration can be used is well known to those skilled in the art and will not be described further herein. When the software or firmware is implemented, a program constituting the software is installed from a storage medium or a network to a computer (for example, a general-purpose computer 700 shown in fig. 7) having a dedicated hardware configuration, and the computer can execute various functions and the like when various programs are installed.

FIG. 7 shows a schematic block diagram of a computer that may be used to implement methods and systems according to embodiments of the present invention.

In fig. 7, a Central Processing Unit (CPU) 701 performs various processes in accordance with a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 to a Random Access Memory (RAM) 703. In the RAM703, data necessary when the CPU701 executes various processes and the like is also stored as necessary. The CPU701, the ROM702, and the RAM703 are connected to each other via a bus 704. An input/output interface 705 is also connected to the bus 704.

The following components are connected to the input/output interface 705: an input section 706 (including a keyboard, a mouse, and the like), an output section 707 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker and the like), a storage section 708 (including a hard disk and the like), a communication section 709 (including a network interface card such as a LAN card, a modem, and the like). The communication section 709 performs communication processing via a network such as the internet. A driver 710 may also be connected to the input/output interface 705, as desired. A removable medium 711 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like can be mounted on the drive 710 as necessary, so that the computer program read out therefrom is mounted in the storage section 708 as necessary.

In the case where the above-described series of processes is realized by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 711.

It should be understood by those skilled in the art that such a storage medium is not limited to the removable medium 711 shown in fig. 7 in which the program is stored, distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 711 include a magnetic disk (including a flexible disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a Digital Versatile Disk (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be the ROM702, a hard disk included in the storage section 708, or the like, in which programs are stored and which are distributed to users together with the apparatus including them.

The invention also provides a program product with machine readable instruction codes stored. The instruction codes are read by a machine and can execute the method according to the embodiment of the invention when being executed.

Accordingly, storage media carrying the above-described program product having machine-readable instruction code stored thereon are also within the scope of the present invention. Including, but not limited to, floppy disks, optical disks, magneto-optical disks, memory cards, memory sticks, and the like.

It should be noted that the method of the present invention is not limited to being performed in the chronological order described in the specification, and may be performed sequentially in other orders, in parallel, or independently. Therefore, the order of execution of the methods described in this specification does not limit the technical scope of the present invention.

The foregoing description of the various embodiments of the invention is provided for the purpose of illustration only and is not intended to be limiting of the invention. It should be noted that in the above description, features described and/or illustrated with respect to one embodiment may be used in the same or similar manner in one or more other embodiments, in combination with or instead of the features of the other embodiments. It will be understood by those skilled in the art that various changes and modifications may be made to the above-described embodiments without departing from the inventive concept of the present invention.

In summary, in the embodiments according to the present invention, the present invention provides the following technical solutions.

1. A method of dialogue incentive assessment, the dialogue including multiple rounds of dialogue between two parties of the dialogue, one party being a target agent and the other party being a user or opponent agent, the method comprising:

determining a reward for the one or more rounds of dialog in which the target agent participates based on the one or more rounds of dialog acquired and the corresponding human dialog in the real scene.

2. The method of claim 1, wherein determining the reward for the one or more conversations in which the target agent participates based on the one or more acquired conversations and the corresponding human conversation in the real scene comprises:

determining a reward for the one or more conversations in which the target agent participates based on the comparison.

3. The method of claim 2, wherein comparing the one or more acquired conversations with a corresponding human conversation in a real scene comprises:

respectively acquiring a behavior track of human conversation and a behavior track of intelligent agent conversation in corresponding scenes;

and comparing the behavior tracks between the two to determine a similarity result.

4. The method of scheme 1, wherein determining the reward of the one or more conversations in which the target agent participates based on the acquired one or more conversations and the corresponding human conversation in the real scene comprises:

5. The method according to scheme 4, wherein the human dialogue behavior in the corresponding real scene is regarded as the current best strategy to learn the reward function.

6. The method of claim 5, wherein the reward function is learned based on a trajectory of human dialog behavior in a corresponding real scene.

7. The method of claim 6, wherein the reward function is learned for a training phase.

8. The method of claim 7, wherein the parameters relating to the dialog strategy are solved by maximum likelihood estimation to determine the reward function.

9. The method of claim 3 or 8, wherein the action track can represent a dialog strategy taken by a corresponding object, and a specific intention can be reached by any object in the dialog taking the corresponding dialog strategy.

10. The method of claim 9, wherein the behavior trace comprises at least one state-action pair of the object to which it belongs.

11. The method according to claim 10, wherein the state of the object is updated based on the acquired one or more rounds of conversations, and includes at least an opponent action in a current round of conversation and an action of the object in a previous round of conversation.

12. The method of claim 11, wherein the action of the object is a representation of a particular intent, including a domain, an intent, a slot type, and a slot value.

13. The method of claim 12, wherein the intent comprises at least a notification and/or a query.

14. The method of claim 13, wherein the slot value is obtained by the agent from an external database.

15. The method of scheme 14, wherein the state of the object further includes a confidence state for all slots and an embedded vector of the number of queries from the external database for results.

16. The method of claim 15, wherein the reward function further determines a conversational reward for a current round as input from the state-action pair for the current round of the subject and the state of the next round.

17. The method of any of scenarios 10-16, wherein the higher the similarity to the human behavior trajectory, the higher the reward earned by the target agent.

18. A method of assessing rewards in a conversation, the conversation comprising multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the method comprising:

obtaining a current round of conversations in the multiple rounds of conversations in which the target agent participates, wherein the target agent does not output any utterance in the current round of conversations;

calculating the reward of the dialog of the target agent including the estimated dialog strategy to be taken based on the method according to any of the schemes 1-17.

19. The method of claim 18, wherein the reward is calculated based on a reward function according to any of claims 4-17.

20. The method of scheme 19, wherein the method further comprises:

21. The method of claim 20, wherein determining an advantage of a targeted agent to adopt a conversation strategy based on a reward for a conversation including the pre-estimated targeted agent to adopt the conversation strategy comprises:

acquiring a random strategy of conversation of a target intelligent agent in the current state;

calculating the reward of adopting a random strategy under the target agent pre-state by adopting the reward function according to any one of the schemes 4 to 17;

22. The method of claim 19, wherein the reward function is updated by a reward estimator to calculate the reward.

23. The method of claim 22, wherein the reward estimator updates the reward function by simultaneously maximizing a KL distance from a strategic distribution that is pre-trained on a human dialog based on maximum likelihood estimates and minimizing a KL distance from real data.

24. A method of conversation, the conversation comprising multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the method comprising:

wherein the reward is calculated by a method as described in any of schemes 1-17 or 18-23.

25. The method of claim 24, wherein the conversation policy is determined by a preset conversation policy selector that determines actions that the target agent can take based on the conversation state of the target agent's current turn.

26. The method of claim 25, wherein the dialog strategy selector is constructed based on a multi-tier perceptron.

27. The method of claim 25, wherein the conversational strategy is optimized using a near-end strategy optimization method.

28. The method of claim 27, wherein the conversation strategy selector and reward estimator is pre-trained based on real person conversation data.

29. The method of claim 28, wherein during the training phase of the dialog strategy selector and reward estimator, multiple processes are run in parallel, and finally the data of all processes are integrated to speed up training.

30. A conversation incentive assessment device, said conversation comprising a plurality of rounds of conversation between two parties to the conversation, one party being a target agent and the other party being a user or opponent agent, said device comprising:

31. The apparatus of claim 30, wherein the reward determination module comprises:

32. The apparatus of claim 31, wherein the comparing unit comprises:

33. The apparatus of claim 30, wherein the reward determination module comprises:

34. The apparatus of claim 33, wherein the reward function determination unit is further configured to consider human dialogue behavior in the corresponding real scene as a current best strategy to learn the reward function.

35. The apparatus of claim 34, wherein the reward function determination unit is further configured to learn the reward function based on a trajectory of human dialog behavior in a corresponding real scene.

36. The apparatus of claim 35, wherein the reward function is learned for a training phase.

37. The apparatus of claim 36, wherein the reward function determination unit is further configured to solve parameters regarding dialog strategies through maximum likelihood estimation to determine the reward function.

38. The apparatus of claim 32 or 37, wherein the behavior trace can represent a dialog strategy taken by a corresponding object, and a specific intention can be reached by any object in the dialog taking the corresponding dialog strategy.

39. The apparatus of claim 38, wherein the behavior trace comprises at least one state-action pair of an object to which it belongs.

40. The apparatus of claim 39, wherein the state of the object is updated based on the acquired one or more rounds of dialog and includes at least an opponent action in a current round of dialog and an action of the object in a previous round of dialog.

41. The apparatus of scheme 40, wherein the action of the object is a representation of a particular intent, including a domain, an intent, a slot type, and a slot value.

42. The apparatus of claim 41, wherein the intent comprises at least a notification and/or a query.

43. The apparatus of scheme 42, wherein the slot value is obtained by the agent from an external database.

44. The apparatus of scheme 43, wherein the state of the object further comprises a confidence state for all slots and an embedded vector of the number of queries from the external database for results.

45. The apparatus of claim 44, wherein the reward function further determines a conversational reward for a current turn entered with the subject's state-action pair for the current turn and the state for the next turn.

46. The apparatus of any of schemes 39-45, wherein the higher the similarity to the human behavior trace, the higher the reward earned by the target agent.

47. An apparatus for assessing rewards in a conversation, the conversation comprising a plurality of rounds of conversation between two parties to the conversation, one party being a target agent and the other party being a user or opponent agent, the apparatus comprising:

calculating a reward for a conversation based on the device as in any of scenarios 30-46 including the estimated target agent of the conversation strategy to be undertaken.

48. The apparatus of claim 47 wherein the reward is calculated based on a reward function according to any of schemes 33-46.

49. The apparatus of scheme 48, wherein the apparatus further comprises:

50. The apparatus of scheme 49, wherein the policy advantage determination module comprises:

calculating the reward of adopting a random strategy in a target agent front state by adopting a reward function according to any scheme 33-46;

51. The apparatus of claim 48, wherein the reward function is updated by a reward estimator to calculate the reward.

52. The apparatus of claim 51, wherein the reward estimator updates the reward function by simultaneously maximizing a KL distance from a strategic distribution that is pre-trained on a human dialog based on a maximum likelihood estimation and minimizing a KL distance from real data.

53. A conversation apparatus, the conversation comprising multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the apparatus comprising:

a dialog tracking module configured to track each of the plurality of dialog turns and generate a dialog state for a corresponding turn of the corresponding object;

wherein the reward is calculated by a device according to any of schemes 30-46 or 47-52.

54. The apparatus of claim 53 wherein the conversation policy is determined by a preset conversation policy selector that determines actions that the target agent can take based on the conversation state of the target agent's current turn.

55. The apparatus of scheme 54 wherein the dialog strategy selector is constructed based on a multi-tier perceptron.

56. The apparatus of scheme 54, wherein the conversational strategy is optimized using a near-end strategy optimization approach.

57. The apparatus of claim 56, wherein the conversation strategy selector and reward estimator are pre-trained based on real person conversation data.

58. The apparatus of claim 57, wherein during the training phase of the dialog strategy selector and reward estimator, multiple processes are performed in parallel, and finally the data of all processes are integrated to speed up training.

59. A computer-readable storage medium storing a computer program for performing the method of any of the above aspects 1-17 and/or 18-23 and/or 24-29.

60. A computing device, the computing device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to perform the method of any of the above schemes 1-17 and/or 18-23 and/or 24-29.

Claims

determining a reward for the one or more rounds of dialog in which the target agent participates based on the acquired one or more rounds of dialog and the corresponding human dialog in the real scene, including: learning a reward function from human dialogue under a corresponding real scene; determining a reward for the one or more rounds of dialogue in which the target agent participates based on the one or more rounds of dialogue and the reward function;

the human dialogue behavior under the corresponding real scene is taken as the current optimal strategy to learn the reward function; learning the reward function based on the human dialogue behavior track under the corresponding real scene;

solving parameters on the dialog strategy by maximum likelihood estimation to determine the reward function:

，

wherein, the first and the second end of the pipe are connected with each other,

for a set of segments of a human dialog,τin the form of a segment of a human dialog,𝑝 _𝜃 (𝜏) Involving parameterized reward functions𝑟 _𝜃 (𝑠 _𝑡 ,𝑎 _𝑡 ) Wherein𝛾∈[0,1]Is a discount factor, controls the weight on the reward,𝑡in order to have a turn in a conversation,𝑠in the state of being in the first place,𝑎is an action;

dialog-level rewards are calculated based on a maximum entropy IRL algorithm, maximizing the likelihood of human dialog segments being observed to infer potential goals,

where, ω represents the potential intent of the dialog,𝑓modeling the human dialog as a boltzmann distribution,𝑝 _ω (τ) is a parameter𝜃As the potential intention of the dialog, ω, there is a probability that a human dialog segment τ,𝑅indicating the return of a session, i.e.𝛾-a discount in the form of a jackpot prize,𝑍is a corresponding partition function for normalizing the probabilities;

by replacing the boltzmann distribution in the IRL with an estimate of true data density in the AL, a strong correlation between GAN and the maximum entropy causal IRL is derived:

in the above formula, D is the discriminator and (signature) = - (-) -is the penalty function, -the discount jackpot, () is the probability of the strategy in RL or the producer in GAN generating a trace, the corresponding allocation function;

the behavior track can represent a conversation strategy taken by a corresponding object, and any object in the conversation can achieve a specific intention by taking the corresponding conversation strategy;

the behavior track at least comprises a state-action pair of an object to which the behavior track belongs;

the state of the object is updated based on the acquired one or more rounds of conversations, and at least comprises an opponent action in the current round of conversation and an action of the object in the previous round of conversation;

the actions of the objects are representations of specific intents, including domains, intents, slot types, and slot values.

2. The method of claim 1, wherein determining the reward for the one or more rounds of dialog in which the targeted agent participates based on the one or more rounds of dialog acquired and the corresponding human dialog in the real scene comprises:

3. The method of claim 2, wherein comparing the one or more acquired rounds of dialog with the corresponding human dialog in the real scene comprises:

respectively acquiring a behavior track of human conversation and a behavior track of intelligent agent conversation in corresponding scenes; and comparing the behavior tracks between the two to determine a similarity result.

4. The method of claim 1, wherein the reward function is learned for a training phase.

5. The method of claim 1, wherein the intent comprises at least a notification and/or a query.

6. The method of claim 5, wherein the slot value is obtained by the agent from an external database.

7. The method of claim 6, wherein the state of the object further comprises a confidence state for all slots and an embedded vector of the number of queries from the external database for results.

8. The method of claim 7, wherein the reward function further determines a conversational reward for a current turn entered with the subject's state-action pair for the current turn and the state for the next turn.

9. The method of any one of claims 5-8, wherein the higher the similarity to a human behavior trace, the higher the reward the target agent receives.

10. A method of assessing rewards in a conversation, the conversation comprising multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the method comprising:

a reward for a dialog comprising a pre-estimated target agent of a dialog strategy to be undertaken is calculated based on the method of any of claims 1-9.

11. A method according to claim 10, wherein the reward is calculated based on a reward function according to any of claims 1-9.

12. The method of claim 11, wherein the method further comprises:

13. The method of claim 12, wherein determining an advantage of a targeted agent to adopt a conversation strategy based on a reward for a conversation including the pre-estimated targeted agent to adopt the conversation strategy comprises:

calculating a reward for adopting a random strategy in a target pre-agent state using a reward function as claimed in any one of claims 1 to 9;

14. The method of claim 11, wherein the reward function is updated by a reward estimator to calculate the reward.

15. The method of claim 14, wherein the reward estimator updates the reward function by simultaneously maximizing a KL distance from a strategic distribution that is pre-trained on a human dialog based on maximum likelihood estimation and minimizing a KL distance from real data.

16. A method of conversation, the conversation comprising multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the method comprising:

tracking each turn of the multiple turns of conversations and generating a conversation state of a corresponding turn of the corresponding object; determining at least one dialog strategy that the target agent can adopt based on the dialog state of the current turn of the target agent;

wherein the reward is calculated by a method according to any one of claims 1-9 or 10-15.

17. The method of claim 16, wherein the conversation policy is determined by a preset conversation policy selector that determines actions that a target agent can take based on the conversation state of the target agent's current turn.

18. The method of claim 17, wherein the dialog strategy selector is constructed based on a multi-tier perceptron.

19. The method of claim 17, wherein the conversational strategy is optimized using a near-end strategy optimization method.

20. The method of claim 19, wherein the conversation strategy selector and reward estimator is pre-trained based on real person conversation data.

21. The method of claim 20, wherein during the training phase of the dialog strategy selector and reward estimator, multiple processes are performed in parallel, and finally the data of all processes are integrated to speed up training.

22. A conversation incentive assessment device, said conversation comprising a plurality of rounds of conversation between two parties to the conversation, one party being a target agent and the other party being a user or opponent agent, said device comprising:

a conversation acquisition module configured to acquire one or more of the multiple rounds of conversations in which the target agent participates;

a reward determination module configured to determine a reward for the one or more rounds of dialog in which the target agent participates based on the acquired one or more rounds of dialog and the corresponding human dialog in the real scene;

the reward determination module includes:

a reward determination unit configured to determine a reward of the one or more rounds of dialog in which the target agent participates, in dependence on the one or more rounds of dialog and the reward function;

wherein the reward function determination unit is further configured to regard human dialogue behavior in the corresponding real scene as a current optimal strategy to learn the reward function;

the reward function determination unit is further configured to learn the reward function based on a trajectory of human dialog behavior in a corresponding real scene;

the reward function determination unit is further configured to solve parameters on a dialog strategy through maximum likelihood estimation to determine the reward function;

，

wherein the content of the first and second substances,

for a set of segments of a human dialog,τin the form of a segment of a human dialog,𝑝 _𝜃 (𝜏) Involving parameterized reward functions𝑟 _𝜃 (𝑠 _𝑡 ,𝑎 _𝑡 ) Wherein𝛾∈[0,1]Is a discount factor, controls the weight on the reward,𝑡in order to make a turn of the conversation,𝑠in the state of being in the first place,𝑎is an action;

where, ω represents the potential intent of the dialog,𝑓modeling the human dialog as a boltzmann distribution,𝑝 _ω (τ) is a parameter𝜃As the potential intention of the dialog, ω, there is a probability that a human dialog segment τ,𝑅indicating the return of a session, i.e.𝛾-a discount cumulative reward for the discount,𝑍is a corresponding partition function for normalizing the probabilities;

in the above formula, D is the discriminator, (+ -) - (-dash) is the penalty function, -discount jackpot, () is the probability that the strategy in RL or the generator in GAN generates the trace, which is the corresponding scoring function;

23. The apparatus of claim 22, wherein the reward determination module comprises:

a reward determination unit configured to determine a reward of the one or more rounds of dialogue in which the target agent participates, based on the comparison result.

24. The apparatus of claim 23, wherein the comparing unit comprises:

25. The apparatus of claim 22, wherein the reward function is learned for a training phase.

26. The apparatus of claim 22, wherein the intent comprises at least a notification and/or a query.

27. The apparatus of claim 26, wherein the slot value is obtained by an agent from an external database.

28. The apparatus of claim 27, wherein the state of the object further comprises a confidence state for all slots and an embedded vector of the number of queries from an external database for results.

29. The apparatus of claim 28, wherein the reward function further determines a conversational reward for a current turn entered with the subject's state-action pair for the current turn and the state for the next turn.

30. The apparatus of any of claims 26-29, wherein the higher the similarity to the human behavior trace, the higher the reward the target agent receives.

31. An apparatus for assessing rewards in a conversation, the conversation comprising a plurality of rounds of conversation between two parties to the conversation, one party being a target agent and the other party being a user or opponent agent, the apparatus comprising:

a dialog acquisition module configured to acquire a current turn of dialog of the multiple turns of dialog in which the target agent participates, wherein the target agent does not output any utterance in the current turn of dialog;

a conversation strategy estimation module configured to estimate a conversation strategy to be adopted based on the current state of the target agent in the multiple rounds of conversations;

calculating a reward for a dialog based on the device of any of claims 22-30 including pre-estimated target agents for the dialog strategy to be undertaken.

32. The apparatus of claim 31, wherein the reward is calculated based on a reward function of any of claims 22-30.

33. The apparatus of claim 32, wherein the apparatus further comprises:

a policy advantage determination module configured to determine an advantage of the target agent's adoption of the pre-estimated policy based on a reward of a dialog of the target agent including the pre-estimated dialog policy to be taken;

34. The apparatus of claim 33, wherein the policy advantage determination module comprises: a random strategy acquisition unit configured to acquire the random of the dialog of the target agent in the current state

A policy;

calculating a reward for adopting a random strategy in a target pre-agent state using a reward function according to any one of claims 22-30;

35. The apparatus of claim 32, wherein the reward function is updated by a reward estimator to calculate the reward.

36. The apparatus of claim 35, wherein the reward estimator updates the reward function by simultaneously maximizing KL distances from a strategic distribution that is pre-trained on human dialogue based on maximum likelihood estimates and minimizing KL distances from real data.

37. A conversation apparatus, the conversation comprising multiple rounds of conversation between two parties of the conversation, one party being a target agent and the other party being a user or opponent agent, the apparatus comprising:

the reward determination module is configured to respectively determine that the target intelligent agent takes the reward of the corresponding conversation strategy based on the determined at least one conversation strategy and the conversation state of the current turn of the target intelligent agent; a sentence generation module configured to select a most preferred one of the at least one dialog strategy based on the reward

Optimizing the conversation strategy and generating a corresponding utterance;

wherein the reward is calculated by an apparatus according to any of claims 22-30 or 31-36.

38. The apparatus of claim 37 wherein the conversation policy is determined by a preset conversation policy selector that determines actions that a target agent can take based on the conversation state of the target agent's current turn.

39. The apparatus of claim 38, wherein the dialog strategy selector is constructed based on a multi-tier perceptron.

40. The apparatus of claim 38, wherein the conversational strategy is optimized using a near-end strategy optimization method.

41. The apparatus of claim 40, wherein the conversation strategy selector and reward estimator is pre-trained based on real person conversation data.

42. The apparatus of claim 41, wherein in the training phase of the dialog strategy selector and reward estimator, a plurality of processes are performed in parallel, and finally data of all processes are integrated to speed up training.

43. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-9 and/or 10-15 and/or 16-21.

44. A computing device, the computing device comprising: a processor;

a memory for storing the processor-executable instructions;

the processor configured to perform the method of any of the preceding claims 1-9 and/or 10-15 and/or 16-21.