CN117194640A

CN117194640A - User simulator construction method based on generation of countermeasure network

Info

Publication number: CN117194640A
Application number: CN202311221966.3A
Authority: CN
Inventors: 孙鹏飞; 戴新宇; 常晶; 江秀
Original assignee: Nanjing University; Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Nanjing University; Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-09-21
Filing date: 2023-09-21
Publication date: 2023-12-08

Abstract

The application discloses a user simulator construction method based on generating an countermeasure network, which is applied to a dialogue system, wherein the user simulator comprises the steps of generating the countermeasure network and comprising the following steps: acquiring dialogue content between a user and a dialogue robot from a dialogue system; the dialogue content is input into a generation countermeasure network for training, a trained generator and an evaluation discriminator are obtained, the generator is used for generating replies based on historical dialogue of a user, and the evaluation discriminator is used for distinguishing real replies from generated replies and evaluating the quality of the generated replies. In the method, the replies generated by the generator are more real, and the evaluation discriminator can evaluate the quality of the replies generated by the generator so as to diagnose the dialogue system and realize the dynamic evaluation of the dialogue system.

Description

User simulator construction method based on generation of countermeasure network

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to a user simulator construction method based on a generated countermeasure network.

Background

User simulators can be divided into two main categories depending on the technology used: rule-based methods and model-based learning methods.

(1) The most widely used of rule-based methods is the agenda-based method. For example, document [1,2] proposes building agenda-based user simulators, which are typically used to build task-oriented dialog systems.

(2) The method based on model learning can be divided into a method based on statistical modeling, a method based on end-to-end supervised learning and a method based on joint policy optimization. For statistical modeling-based methods, in 1997, document [3] proposed a user simulator, which treats the user simulator as a bi-gram model. The subsequently proposed Levin Model [4], scheffler Model [5], pietquin Model [6] all make certain improvements on the bi-gram Model so that the generation of user actions is constrained by user goals. Recently, researchers have proposed establishing end-to-end supervised learning-based methods in a data driven manner. For example, document [7] proposes a dialogue behavior level seq2seq user simulation model considering a dialogue background. Furthermore, in order to increase the efficiency of task-based dialog systems, researchers have tried to train simulators with target dialog systems based on a method of joint policy optimization, which can be regarded as a multi-agent approach. For example, document [8] proposes training a dialogue system and a simulator on the basis of a dialogue library first by supervised learning, and then fine-tuning both models by reinforcement learning.

Although the current methods have achieved good results at the time, the following problems are faced:

(1) For rule-based approaches, it is advantageous that cold starts can be made and user behavior can be controlled. However, for more complex tasks, it is not feasible to define an explicit agenda structure, and the utterances also lack the linguistic variation of human dialogs, lack flexibility, which may lead to sub-optimal performance in practical applications.

(2) For statistical-based methods, although model parameters are simpler, there are significant drawbacks: because the model only focuses on the last behavior of the system, and if the user's goal changes, the behavior may not be logical, the model cannot produce consistent user behavior.

(3) Based on end-to-end supervised learning methods, the benefit of such methods is that they do not require extensive feature engineering, but they typically require extensive tagging data to well generalize and handle user states not included in training data.

(4) Based on the method of joint policy optimization, this process typically requires a large amount of interaction between the system and the user, however, obtaining a real human user to interact with the system is time consuming and laborious.

Reference to the literature

[1]Jost Schatzmann and Steve J.Young.“The Hidden Agenda User Simulation Model”.In:IEEE Trans.Speech Audio Process.17.4(2009),pp.733–747.doi:10.1109/TASL.2008.2012071.url:https://doi.org/10.1109/TASL.2008.2012071.

[2]Jost Schatzmann et al.“Agenda-Based User Simulation for Bootstrapping a POMDP Dialogue System”.In:Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics,Proceedings,April 22-27,2007,Rochester,New York,USA.Ed.by Candace L.Sidner et al.The Association for Computational Linguistics,2007,pp.149–152.url:https://aclanthology.org/N07-2038/.

[3]Wieland Eckert,Esther Levin,and Roberto Pieraccini.“User modeling for spoken dialogue system evaluation”.In:1997IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.IEEE.1997,pp.80–87.

[4]Esther Levin,Roberto Pieraccini,and Wieland Eckert.“A stochastic model of human-machine interaction for learning dialog strategies”.In:IEEE Trans.Speech Audio Process.8.1(2000),pp.11–23.doi:10.1109/89.817450.url:https://doi.org/10.1109/89.817450.

[5]Konrad Scheffler and Steve Young.“Corpus-based dialogue simulation for automatic strategy learning and evaluation”.In:Proc.NAACL Workshop on Adaptation in Dialogue Systems.2001,pp.64–70.

[6]Olivier Pietquin.A framework for unsupervised learning of dialogue strategies.Presses univ.de Louvain,2005.

[7]Layla El Asri,Jing He,and Kaheer Suleman.“A Sequence-to-Sequence Model for User Simulation in Spoken Dialogue Systems”.In:Interspeech 2016,17th Annual Conference of the International Speech Communication Association,San Francisco,CA,USA,September8-12,2016.Ed.by Nelson Morgan.ISCA,2016,pp.1151–1155.doi:10.21437/Interspeech.2016-1175.url:https://doi.org/10.21437/Interspeech.2016-1175.

[8]Bing Liu and Ian R.Lane.“Iterative policy learning in end-to-end trainable task-oriented neural dialog models”.In:2017IEEE Automatic Speech Recognition and Understanding Workshop,ASRU 2017,Okinawa,Japan,December 16-20,2017.IEEE,2017,pp.482–489.doi:10.1109/ASRU.2017.8268975.url:https://doi.org/10.1109/ASRU.2017.8268975.

[9]Zhang Y,Sun S,Galley M,et al.Dialogpt:Large-scale generative pre-training for conversational response generation[J].arXiv preprint arXiv:1911.00536,2019.

[10]Adiwardana D,Luong M T,So D R,et al.Towards a human-like open-domain chatbot[J].arXiv preprint arXiv:2001.09977,2020.

Abbreviations;

ABUS: agenda Based User Simulator user simulator based on agenda

bi-gram: double alphabet group

seq2seq: sequence to Sequence, sequence-to-sequence

Disclosure of Invention

The application aims to: the application aims to solve the technical problem of providing a user simulator construction method based on generation of an countermeasure network aiming at the defects of the prior art.

In order to solve the technical problem, the application discloses a user simulator construction method based on generating an countermeasure network, which is applied to a dialogue system, wherein the user simulator comprises the steps of generating the countermeasure network and comprising the following steps:

acquiring dialogue content between a user and a dialogue robot from a dialogue system;

the dialogue content is input into a generation countermeasure network for training, a trained generator and an evaluation discriminator are obtained, the generator is used for generating replies based on historical dialogue of a user, and the evaluation discriminator is used for distinguishing real replies from generated replies and evaluating the quality of the generated replies.

Further, the generator for generating a reply based on the historical dialog of the user comprises: splicing all history dialogues of the current sentence into a single sentence text, inputting the single sentence text into a generator, and outputting a generated target response by the generator; the generator is G _θ The above in the history dialogue is X, the corresponding real target response is Y, and the generated target response is G _θ (X)；

The evaluation arbiter is configured to distinguish between a true reply and a generated reply, and evaluate the quality of the generated reply includes: the context X, the corresponding real target response Y and the generated target response G in the history dialogue _θ (X) input to an evaluation discriminator which judges whether the above X is in response with the true target Y or in response with the generated target G _θ (X) whether there is a sentence relationship between them, making true and false target response discrimination, and making target response G _θ The quality of (X) was evaluated.

Further, the generator employs an autoregressive pre-training language model; the evaluation discriminator adopts a discriminator based on a language model on the basis of a pre-training language model and is marked as D _φ 。

Further, the evaluation discriminator D _φ Converting assessment tasks of a pre-trained language model into two classesNatural language reasoning tasks of (1):

wherein n is _k E {1,0}, i.e. 1 represents the next sentence where Y is X; 0 represents the next sentence in which Y is not X; s () represents a similarity calculation function.

Further, the evaluation discriminator D _φ Training with maximum true sample return and minimum sample return, the minimum sample return index generator G _θ The low quality samples generated can be evaluated by the arbiter D _φ Identifying and obtaining lower returns; the report finger generator G for maximizing real samples _θ The high quality samples generated can be rewarded with high prizes and the low quality samples can be penalized to encourage the evaluation of the arbiter D _φ Giving a high prize to text that looks like real world data;

evaluation discriminator D _φ Is a loss function formula L _D The following are provided:

wherein pdata represents real data.

Further, the generator G _θ Training was performed using a near-end policy optimization (PPO, proximal Policy Optimization) algorithm.

Further, evaluate discriminant D _φ Response G to the generated target _θ The quality evaluation of (X) includes:

when D is _φ (n _k |X,Y)<D _φ (n _k |X,G _θ (X)) and a target response G generated _θ (X) is a high quality sample;

when D is _φ (n _k |X,Y)≥D _φ (n _k |X,G _θ (X)) and a target response G generated _θ (X) is a low quality sample.

Further, the generator G _θ Using an evaluation discriminator D _φ As a source of punishment and punishment, the punishment and punishment value R _θ，φ The following are provided:

further, the generator G _θ Gradient of the objective function of (2)Expressed as:

wherein, rewarding and punishing value R _θ，φ Is a control update item, remembers the true target response y= (Y) ₀ ,y ₁ ,…,y _i ,…,y _n ) N represents the length of the target response, n is greater than or equal to 1, y _i Representing a word in the target response Y, 0.ltoreq.i.ltoreq.n.

Further, inputting the dialogue content to generate an countermeasure network for training, and obtaining a trained generator and an evaluation discriminator comprises: using maximum likelihood estimation (MLE, maximum likelihood estimation) as a loss function pair generator G _θ Pre-training, alternate training and evaluating the discriminant D _Φ Sum generator G _θ Until training is stopped; the alternate training evaluation discriminator D _Φ Sum generator G _θ Comprising the following steps:

calculating a punishment value R by using a formula (4) in the form of an (X, Y) pair of the dialogue contents _θ,Φ ，

Updating the evaluation discriminator D by the formula (2) _Φ ；

Updating generator G by equation (5) using a near-end policy optimization algorithm _θ 。

The beneficial effects are that: the application provides a user simulator construction method based on a generated countermeasure network, which effectively utilizes the idea of generating the countermeasure network to construct a user simulator. When the evaluation arbiter recognizes that the generated sample quality is higher than the true sample, the reward signal is masked. By using a masked rewarding mechanism, the countermeasure training is more stable and the replies generated by the generator are more realistic. The evaluation discriminator can also evaluate the quality of the replies generated by the generator, so as to diagnose the dialogue system and realize the dynamic evaluation of the dialogue system.

Drawings

The foregoing and/or other advantages of the application will become more apparent from the following detailed description of the application when taken in conjunction with the accompanying drawings and detailed description.

Fig. 1 is a schematic flow chart of a user simulator construction method based on generating an countermeasure network according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a generator architecture in a method for constructing a user simulator based on generating an countermeasure network according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described below with reference to the accompanying drawings.

The dialogue system is a complex process, if interaction is performed by means of human power, the method is unreasonable and unrealistic, however, high-quality man-machine interaction data is crucial to the dialogue system, so that the exploration user simulator simulates real users to generate a large amount of interaction data, diagnosis of the dialogue system is facilitated, dynamic evaluation of the dialogue system is achieved, and the method is a valuable research point. The method for constructing the user simulator based on the generation of the countermeasure network can be applied to a chat dialogue system, the user simulator comprises the generation of the countermeasure network, and the construction method is shown in figure 1 and comprises the following steps:

1 generator

The goal of the generator is to learn the data distribution and then generate an indistinguishable reply based on the user's historical conversations, including the reply being as realistic as possible, such that the assessment arbiter cannot distinguish between true and false. In this embodiment, in order to enable the generator to support multiple rounds of conversations and generate a conversation content with consistent semantics in the process of conversation interaction, all the history conversations of the current sentence are directly spliced into a single sentence text, and the single sentence text is added to the sequence to serve as input information of the generator. In addition, since the dialogue history length is variable and the required length is freely variable, in the present embodiment, an autoregressive pretrained language model is employed as the generator G _θ I.e., a one-way language model. FIG. 2 illustrates a generator architecture based on a one-way language model, generator G _θ GPT-2 (generating Pre-trained Transformer-2) was used.

Specifically, we define X and Y to represent the above of the dialog and its target response, respectively. For the above X, each dialog is used whether it is a single-talk dialog or a multi-turn dialog</s>Taken together, we consider a complete Token sequence, i.e., x= (X) ₀ ，x ₁ ，x ₂ ，…，x _m ) Wherein m represents the length of the above, m.gtoreq.0, and X represents one word in X above. For the target response Y, y= (Y) is defined ₀ ，y ₁ ，y ₂ ，…，y _n ) A Token sequence representing a target response, where n represents the length of the target response, n.gtoreq.1, and Y represents a word in the target response Y.

2 evaluation discriminator

In order to avoid the problems of sparse information and incomplete information easily caused by a single scalar reward signal of the traditional discriminator, the evaluation discriminator proposed by the embodiment can not only distinguish a real sample from a generated sample,the quality of the generated sample can be evaluated, so that the information provided by the evaluation discriminator can be fully utilized to provide more diversified rewards for the generator. For this purpose, the evaluation discriminator based on the pre-trained language model is provided as D _φ In the concrete implementation process, the evaluation discriminator D _φ

Can employ a bi-directional pre-training language model (BERT, bidirectional Encoder Representation from)

Transgramers). Unlike the conventional arbiter, the evaluation arbiter D of the present embodiment _φ Converting the true and false sample discrimination task of the original pre-training language model into a classified natural language reasoning task, and enabling the model to judge whether the model X is matched with the real target response Y or the generated target response G _θ (X) whether there is a context relationship between them,

to discriminate true and false samples and to use the evaluation discriminator D _φ And evaluating the quality of the generated samples. Specifically, the above X and the corresponding target response Y and the corresponding generated target response G _θ (X) as evaluation discriminator D _φ Input of (D), i.e. D _φ (X, Y) or D _φ (X,G _θ (X)) on the basis of which a two-classifier is built, i.e.

3 training strategy

Before introducing the training strategy based on the User Simulator (US-GAN, user Simulator-Generative Adversarial Network) for generating an countermeasure network proposed in the present embodiment, the evaluation discriminator D is defined _φ Is a function of the loss of (2). Here, the evaluation arbiter is trained in consideration of maximizing the return of the real sample and minimizing the return of the generated sample. This is because by minimizing the return on generating samples, it is expected that the generator G _θ The generated low quality samples may be evaluated by an evaluator discriminator D _φ Identify and obtain a lower return. While the motivation for maximizing the return of real samples is that it is desirable to have high quality generated samples get high rewards, and low quality samples get some degree of penalty to encourage the assessment arbiter D _φ A high prize is given to text that looks like real world data. So evaluate the discriminant D _φ The loss function formula of (2) is as follows:

wherein pdata represents real data.

For the generator, assume G _θ (Y|X) is the probability of generating dialog context Y in the face of entering context X, splitting context X into word-by-word concatenation, each word appearing will be affected by the previous context X, so G _θ (Y|X) is defined as the joint probability of all words:

the resulting sequence may be sampled from the distribution by the above formula. However, the evaluation arbiter cannot pass the gradient to the generator in the face of such discrete data due to the gradient-based generation countermeasure network (GAN, generative Adversarial Network). Thus, the generator is trained using a near-end policy optimization algorithm.

Evaluation discriminator D _φ Response G to the generated target _θ The quality evaluation of (X) includes:

Considering the nature of the chat conversation robot, i.e. the actual replies are not necessarily more realistic than the generated replies. Therefore, a discriminator is designed with which the relative authenticity between the generated reply and the actual reply is estimated as a punishment value. In particular, we use the arbiter D in GAN _φ As a reward and punishment value R _θ，φ Will be to generate the punishment value R of the text _θ，φ The calculations can be converted into the following form:

in text generation, the goal of the generator is to generate a sequence of text to maximize the desired rewards. So using likelihood ratio, generator G _θ The gradient of the objective function of (2) is expressed as:

wherein, rewarding and punishing value R _θ，φ Is a control update item, remembers the true target response y= (Y) ₀ ,y ₁ ,…,y _i ,…,y _n )，y _i Representing a word in the target response Y, 0.ltoreq.i.ltoreq.n.

Inputting the dialogue content into a generating countermeasure network for training, and obtaining a trained generator and an evaluation discriminator comprises: first, to reduce the bias of the generic pre-training model from the pre-training model in the dialog scenario, the generator is pre-trained by using the MLE (Maximum likelihood estimation ) as a loss function. We alternate training the discriminant and the generator until training ceases. The summary of the algorithm is shown below.

In a specific implementation, the present application provides a computer storage medium and a corresponding data processing unit, where the computer storage medium is capable of storing a computer program, where the computer program when executed by the data processing unit may perform some or all of the steps of the method for creating a user simulator based on an antagonistic network provided by the present application. The storage medium may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (random access memory, RAM), or the like.

It will be apparent to those skilled in the art that the technical solutions in the embodiments of the present application may be implemented by means of a computer program and its corresponding general hardware platform. Based on such understanding, the technical solutions in the embodiments of the present application may be embodied essentially or in the form of a computer program, i.e. a software product, which may be stored in a storage medium, and include several instructions to cause a device (which may be a personal computer, a server, a single-chip microcomputer, MUU or a network device, etc.) including a data processing unit to perform the methods described in the embodiments or some parts of the embodiments of the present application.

The present application provides a method for constructing a user simulator based on generating an countermeasure network, and the method and the way for realizing the technical scheme are numerous, the above description is only a specific implementation mode of the present application, and it should be noted that, for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present application, and the improvements and modifications should also be regarded as the protection scope of the present application. The components not explicitly described in this embodiment can be implemented by using the prior art.

Claims

1. A method of constructing a user simulator based on generating an countermeasure network, the user simulator being applied to a dialog system, the user simulator including generating the countermeasure network, the method comprising:

2. The method of claim 1, wherein the generating a reply based on the historical dialog of the user comprises: splicing all history dialogues of the current sentence into a single sentence text, inputting the single sentence text into a generator, and outputting a generated target response by the generator; the generator is G _θ The above in the history dialogue is X, the corresponding real target response is Y, and the generated target response is G _θ (X)；

3. A method of constructing a user simulator based on generating an countermeasure network according to claim 2, wherein the generator employs an autoregressive pre-training language model; the evaluation discriminator adopts a discriminator based on a language model on the basis of a pre-training language model and is marked as D _φ 。

4. A method of constructing a user simulator based on generating an countermeasure network according to claim 3, wherein the evaluationDistinguishing device D _φ Converting the assessment task of the pre-training language model into a two-classification natural language reasoning task:

5. The method for constructing a user simulator based on generating an countermeasure network according to claim 4, wherein the evaluation discriminator D _φ Training with maximum true sample return and minimum sample return, the minimum sample return index generator G _θ The low quality samples generated can be evaluated by the arbiter D _φ Identifying and obtaining lower returns; the report finger generator G for maximizing real samples _θ The high quality samples generated can be rewarded with high prizes and the low quality samples can be penalized to encourage the evaluation of the arbiter D _φ Giving a high prize to text that looks like real world data;

wherein pdata represents real data.

6. The method for constructing a user simulator based on a generation countermeasure network according to claim 5, wherein the generator G _θ Training by adopting near-end strategy optimization algorithmAnd (5) training.

7. The method for constructing a user simulator based on generating an countermeasure network according to claim 6, wherein the evaluation discriminator D _φ Response G to the generated target _θ The quality evaluation of (X) includes:

when D is _φ (n _k |X，Y)＜D _φ (n _k |X，G _θ (X)) and a target response G generated _θ (X) is a high quality sample;

when D is _φ (n _k |X，Y)＜D _φ (n _k |X，G _θ (X)) and a target response G generated _θ (X) is a low quality sample.

8. A method of constructing a user simulator based on generating an countermeasure network according to claim 7, wherein the generator G _θ Using an evaluation discriminator D _φ As a source of punishment and punishment, the punishment and punishment value R _θ,φ The following are provided:

9. a method of constructing a user simulator based on generating an countermeasure network according to claim 8, wherein the generator G _θ Gradient of the objective function of (2)Expressed as:

wherein, rewarding and punishing value R _θ，φ Is a control update item, remembers the true target response y= (Y) ₀ ,y ₁ ,…,y _i ,…,y _n ) N represents the length of the target response, n is not less than 1,y _i representing a word in the target response Y, 1.ltoreq.i.ltoreq.n.

10. The method of claim 9, wherein inputting the dialog content into the generated countermeasure network for training, obtaining a trained generator and an evaluation arbiter comprises: using maximum likelihood estimates as a loss function pair generator G _θ Pre-training, alternate training and evaluating the discriminant D _Φ Sum generator G _θ Until training is stopped; the alternate training evaluation discriminator D _Φ Sum generator G _θ Comprising the following steps:

Updating the evaluation discriminator D by the formula (2) _Φ ；