CN113392956B - GP-based deep Dyna-Q method for dialogue strategy learning - Google Patents

GP-based deep Dyna-Q method for dialogue strategy learning Download PDF

Info

Publication number
CN113392956B
CN113392956B CN202110532520.7A CN202110532520A CN113392956B CN 113392956 B CN113392956 B CN 113392956B CN 202110532520 A CN202110532520 A CN 202110532520A CN 113392956 B CN113392956 B CN 113392956B
Authority
CN
China
Prior art keywords
fact
model
experience
world
simulation experience
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110532520.7A
Other languages
Chinese (zh)
Other versions
CN113392956A (en
Inventor
方文其
曹江
吴冠霖
平洋
栾绍童
闫顼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanhu Laboratory
Original Assignee
Nanhu Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanhu Laboratory filed Critical Nanhu Laboratory
Priority to CN202110532520.7A priority Critical patent/CN113392956B/en
Publication of CN113392956A publication Critical patent/CN113392956A/en
Application granted granted Critical
Publication of CN113392956B publication Critical patent/CN113392956B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a GP-based deep Dyna-Q method for dialogue strategy learning, which comprises the following steps: s1, generating simulation experience by a world model based on GP; s2, performing quality detection on the simulation experience by using a quality detector based on KL divergence; and S3, training the dialogue strategy model by using the simulation experience qualified in quality detection. The traditional DNN model is abandoned by the world model, and the world model is constructed into a Gaussian process model, so that the method has the advantage of easy analysis; and the quality detector based on the KL divergence can effectively control the quality of simulation experience, check the distribution of experience by introducing the KL divergence, and design and train a complex quality detector without extra work, so that the quality of the simulation experience is evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.

Description

GP-based deep Dyna-Q method for dialogue strategy learning
Technical Field
The invention belongs to the technical field of machine learning, and particularly relates to a GP-based deep Dyna-Q method for dialogue strategy learning.
Background
Task completion model dialogue strategy learning aims at building a task-targeted dialogue system that can help users accomplish specific single or multi-domain tasks through several rounds of natural language interaction. It has been widely used in chat robots and personal voice assistants, such as Siri by apple and Cortana by microsoft.
In recent years, reinforcement learning has become a mainstream method of dialogue strategy learning. Based on reinforcement learning, the dialog system can gradually adjust and optimize strategies by natural language interaction with the user to improve performance. However, the original reinforcement learning method requires a lot of human-machine interaction before the available dialogue strategy is available, which not only increases the training cost, but also deteriorates the user experience in the early training phase.
In order to solve the above problems and accelerate the learning process of the dialogue strategy, researchers have proposed a Deep Dyna-Q (DDQ, Deep reinforcement learning) framework on the basis of the Dyna-Q framework. The DDQ framework introduces a world model that is trained using real user experience to generate simulation experience in a dynamic environment in order to make it more similar to real users. During the learning process of the dialogue strategy, the dialogue agent is trained by using real experience collected from actual interaction and simulation experience collected from interaction with the world model. By introducing the world model, only a small amount of real user interaction is needed, and the learning efficiency of the conversation strategy can be obviously improved. However, DDQ also faces two important hurdles in further optimizing dialog strategy learning based on limited dialog interactions:
first, in DDQ, the world model is built as a Deep Neural Network (DNN), whose performance depends largely on the amount of data used for training. In an initial training phase with relatively little real experience, the problem of high dependence of the DNN on data may cause a world model to generate low-quality simulation experience, and if the model is required to generate high-quality simulation experience, a large amount of real experience is required. That is, the world model implemented by a model with a large data demand, such as DNN, will weaken the advantages of the Dyna-Q framework and make DDQ less efficient in reality;
secondly, the simulation experience generated by the world model does not necessarily improve the performance, and the low-quality simulation experience even has a serious negative effect on the performance. Some recent studies, in order to solve this problem, attempt to differentiate low quality experience using a generative countermeasure network (GAN) to control the quality of the simulation experience. However, training GAN has a great instability problem, which may result in non-convergence of the dialogue strategy learning with a high probability, and is highly sensitive to selection of the hyper-parameters, so that the dialogue learning performance is severely restricted. Therefore, the problem of how to effectively screen out the low quality experience in the dialog strategy learning process still remains to be solved and is very important.
Disclosure of Invention
The invention aims to solve the problems and provides a GP-based deep Dyna-Q method for dialogue strategy learning.
In order to achieve the purpose, the invention adopts the following technical scheme:
a GP-based deep Dyna-Q method for dialogue strategy learning, comprising the steps of:
s1, generating simulation experience by a world model based on GP;
s2, performing quality detection on the simulation experience by using a quality detector based on KL divergence;
and S3, training the dialogue strategy model by using the simulation experience qualified in quality detection.
In the GP-based depth Dyna-Q method for dialogue strategy learning described above, in step S2, simulation experiences qualified for quality detection are stored to a buffer for training a dialogue strategy model.
In the GP-based depth Dyna-Q method for dialogue strategy learning, the GP-based world model comprises a plurality of GP models, and the world model is composed of W (s, a; theta)w) Meaning s is the current state, a is the last response action, θwRepresenting the parameters of the respective GP model.
In the GP-based depth Dyna-Q method for dialogue strategy learning described above, at least one set of simulation experiences is generated through a plurality of GP model predictions in step S1, and each set of simulation experiences includes a response action auA prize r and a variable t.
In the above GP-based deep Dyna-Q method for dialogue strategy learning, the world model includes three GP models, and the three GP models are respectively used for generating the response action auA reward r and a variable t;
the world model generates a meta-simulation experience e in a prediction stage through three GP modelsi=(au i, ri, ti) And obtaining a response action au iPrize riAnd variable tiTo a 50% confidence interval, the upper limit simulation experience e is obtainedl =(au l, rl, tl) And lower limit simulation experience eb =(au b, rb, tb) Separately detecting the upper limit simulation experience e by the mass detectorlLower limit simulation experience ebAnd meta-simulation experience eiThe quality of (c).
In the GP-based depth Dyna-Q method for dialogue strategy learning described above, in step S1, when a predicted response action auWhen not an integer, auTo the nearest integer;
when predicted response action auAnd when the action domain exceeds the defined action domain, directly selecting the upper limit or the lower limit of the action domain.
In the above-described GP-based deep Dyna-Q method for dialogue strategy learning, the model of the GP model is as follows:
Figure 649181DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 875894DEST_PATH_IMAGE002
Figure 697220DEST_PATH_IMAGE003
represents the mean value;
Figure 233243DEST_PATH_IMAGE004
is a kernel function;
Figure 140019DEST_PATH_IMAGE005
is a gaussian noise, and is a noise,
Figure 901914DEST_PATH_IMAGE006
Figure 843326DEST_PATH_IMAGE007
is the variance of the received signal and the received signal,
Figure 284671DEST_PATH_IMAGE008
is an identity matrix;
the kernel function takes the following form:
Figure 678743DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure 247259DEST_PATH_IMAGE010
and
Figure 839915DEST_PATH_IMAGE011
amplitude and length scale parameters, respectively;
Figure 921003DEST_PATH_IMAGE012
is a gamma function;
Figure 599109DEST_PATH_IMAGE013
is a second type of modified Bessel function;
v is a positive parameter of covariance;
Figure 830370DEST_PATH_IMAGE014
representing the distance between the observed target values.
In the GP-based depth Dyna-Q method for dialogue strategy learning described above, step S2 specifically includes:
s21, storing user actions generated by the world model into a word stock world-fact, and storing user actions generated by a real user into a word stock real-fact;
s22, measuring the similarity between the real-fact lexicon and the world-fact lexicon by using the KL divergence, and accordingly evaluating the quality of the simulation experience;
the primary keys of the word stock real-fact and the word stock world-fact are user actions, and the corresponding values are the frequencies corresponding to the user actions.
In the GP-based depth Dyna-Q method for dialogue strategy learning described above, step S22 specifically includes:
s221. by a predefined variable KLpreTracking KL divergence between a word stock real-fact and a word stock world-fact;
s222, storing frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks in a pre-established word stock same-fact;
s223, calculating the current KL divergence based on the thesaurus same-fact, and if the current KL divergence is smaller than or equal to KLpreThen the current experience is detected as a qualified experience.
In the GP-based deep Dyna-Q method for dialogue strategy learning described above, in step S22, frequency values of intersection primary keys of the lexicon real-fact and the lexicon world-fact in the two lexicons are stored in the lexicon same-fact established in advance, and when the length of the lexicon same-fact is smaller than the constant C, it is determined that the current experience is a qualified experience.
The invention has the advantages that:
1. the traditional DNN model is abandoned in the world model of the scheme, and the world model is constructed into a Gaussian process model, so that the scheme has the advantage of easy analysis;
2. the world model based on the Gaussian process can avoid the problem that the quality of simulation experience generated by the traditional DNN model needs to depend on the amount of training data, and can generate high-quality simulation experience to supplement limited real experience;
3. according to the scheme, the quality detector based on the KL divergence can effectively control the quality of simulation experience, the distribution of experience is checked by introducing the KL divergence, and the complicated quality detector is designed and trained without extra work, so that the quality of the simulation experience is evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.
Drawings
FIG. 1 is an architecture diagram of the dialogue learning method of the present invention;
FIG. 2 is a flow chart of a training phase of a world model in the dialogue learning method of the present invention;
FIG. 3 is a flow chart of the prediction phase of the world model in the dialogue learning method of the present invention;
FIG. 4 is a flow chart of KL divergence calculation in the dialogue learning method of the present invention;
fig. 5 is a learning curve for DDQ and GPDDQ under different parameter settings, wherein,
(a) DDQ at M = 5000; n = 16; learning curve at K =0, 2, 5, 10, 20;
(b) GPDDQ at M = 5000; n = 16; learning curve at K =0, 2, 5, 10, 20;
(c) DDQ at M = 5000; n = 4; learning curve at K =0, 2, 5, 10, 20;
(d) GPDDQ at M = 5000; n = 4; learning curve at K =0, 2, 5, 10, 20;
fig. 6 is a learning curve of DDQ/DQN and GPDDQ/GPDQN at M =5000, K =10, N =16, wherein,
(a) a learning curve of DDQ/DQN is obtained;
(b) a learning curve of GPDDQ/GPDQN is obtained;
fig. 7 is a learning curve of DDQ and KL-GPDDQ at different parameter settings, wherein,
(a) DDQ at M =5000, 3500, 2000, 1000; k = 20; learning curve at N = 4;
(b) KL-GPDDQ at M =5000, 3500, 2000, 1000; k = 20; learning curve at N = 4;
(c) DDQ at M =5000, 3500, 2000, 1000; k = 30; learning curve at N = 4; (ii) a
(d) KL-GPDDQ at M =5000, 3500, 2000, 1000; k = 30; learning curve at N = 4;
FIG. 8 is a graph of the learning curves of D3Q, DDQ, GPDDQ, UN-GPDDQ, KL-GPDDQ at different parameter settings, wherein,
(a) learning curves for M =5000, K =20, N = 4;
(b) the learning curve is M =5000, K =30, and N = 4.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.
As shown in fig. 1, the present solution proposes a GP-based deep Dyna-Q method for dialogue-policy learning, the basic method of which is consistent with the prior art, such as initializing dialogue-policy model and world model using human session data and starting dialogue-policy learning accordingly. The dialogue strategy learning of the dialogue strategy model mainly comprises two parts of direct reinforcement learning and indirect reinforcement learning (also called planning). And (3) directly strengthening learning, adopting Deep Q-network (DQN) to improve a conversation strategy according to real experience, interacting a conversation strategy model with a User, and selecting an action a to be executed by the conversation strategy model according to an observed conversation state s and a maximized value function Q in each step. The dialogue strategy model then receives a reward r, an action a of the real userr uAnd updates the current state to s', and then applies the true experience (s, a, r, a)r uT) is stored to a real user experience library, t being used to indicate whether the session is terminated.
Maximizing a cost function
Figure 356161DEST_PATH_IMAGE015
Approximated by DNN, by optimization
Figure 280254DEST_PATH_IMAGE016
The updating is continuously iterated to reduce the mean square loss. The loss function is as follows:
Figure 773552DEST_PATH_IMAGE017
wherein the content of the first and second substances,
Figure 808505DEST_PATH_IMAGE018
is a function of the reduction factor,
Figure 188801DEST_PATH_IMAGE019
is an independent network, and in each iteration, small batch deep learning pairs are used
Figure 283796DEST_PATH_IMAGE019
And (5) carrying out improvement. Depth can be trained by using several optimization algorithms such as Adam algorithm, random gradient descent and RMSprop
Figure 264391DEST_PATH_IMAGE019
A network.
During indirect reinforcement learning, the dialogue strategy model improves its dialogue strategy by interacting with the world model to reduce training costs, with the frequency of planning controlled by the parameter K, which means that K steps are planned to be performed in each step of direct reinforcement learning. When the world model can accurately capture the features of the real environment, the value of K tends to be large. At each step of the planning, the world model responds to action a according to the current state sw uGenerating simulation experience (s, a, r, a) in the planning processw u,t’)。
In particular, the scheme proposes to construct a world model into a gaussian process model on the basis of the above prior art, which has the advantage of easier analysis compared with the conventional DNN model and can generate high-quality simulation experience to supplement the limited real experience. In addition, the scheme also provides a brand-new method for evaluating the quality of the simulation experience, the quality of the simulation experience can be effectively controlled by directly comparing the simulation experience with the real experience based on the Kullback-Leibler divergence, the distribution of the experience is checked by introducing the KL divergence, and a quality detector is not required to be trained more, so that the quality of the real simulation experience can be evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.
Specifically, as shown in fig. 1, the method includes the following steps:
s1, generating simulation experience by world model prediction based on GP;
s2, performing quality detection on the simulation experience by using a quality detector based on KL divergence;
and S3, storing the simulation experience qualified in the quality detection into a buffer, and training the dialogue strategy model by using the simulation experience stored into the buffer.
Specifically, the world model of the present embodiment is represented by W (s, a; θ)w) Meaning s is the current state, a is the last response action, θwRepresenting the parameters of the respective GP model. And as shown in fig. 2 and 3, the world model consists of three GP models GP1、GP2、GP3Composition, in combination with different thetawAnd (4) parameterizing. Using three GP models for generating response actions a, respectivelyuA reward r and a variable t, and expressing the simulation experience as e = (a)u, r, t)。
Specifically, the present embodiment generates the meta-simulation experience e through three GP modelsi=(au i, ri, ti) And obtaining a response action au iPrize riAnd variable tiTo a 50% confidence interval, the upper limit simulation experience e is obtainedl =(au l, rl, tl) And lower limit simulation experience eb =(au b, rb, tb). I.e. three simulation experiences e per predictioni、el、ebThe quality of these three simulation experiences is measured by the KL divergence, which is described in detail below.
Unlike DDQ, in this model, the world model is essentially one for generating user actions auConsidering that the user operation should be an integer and have a limited action domain, the present scheme further processes the actions generated by the world model:
first, when the predicted response action auWhen the number is not an integer (the GP-based world model of the scheme is a regression model, and response actions are not integers which are more common in the case of regression), auApproximated to the nearest integer by the ratio au lSubstitution of a by the most recent integeru lIn combination with the ratio au bSmall nearest integer substitution au b(ii) a When predicted response action auAnd when the action domain exceeds the defined action domain, directly selecting the upper limit or the lower limit of the action domain.
In particular, in the GP regression problem of the world model, the Slave function is generated by adding independent Gaussian noise
Figure 837455DEST_PATH_IMAGE020
Generating an observation target
Figure 164986DEST_PATH_IMAGE021
Figure 430882DEST_PATH_IMAGE022
Wherein the content of the first and second substances,
Figure 633193DEST_PATH_IMAGE023
Figure 275527DEST_PATH_IMAGE024
represents the mean value;
Figure 364837DEST_PATH_IMAGE025
is a kernel function;
Figure 801634DEST_PATH_IMAGE026
is independent Gaussian noise, with mean of 0 and variance of
Figure 491242DEST_PATH_IMAGE027
Figure 406108DEST_PATH_IMAGE028
And I is an identity matrix.
According to Bayes principle
Figure 881083DEST_PATH_IMAGE029
And its test input value x*The conditional mean and covariance of the posterior distribution are as follows:
Figure 488782DEST_PATH_IMAGE030
Figure 337789DEST_PATH_IMAGE031
wherein the content of the first and second substances,
Figure 384243DEST_PATH_IMAGE032
GP1generating an action a by the modeluAt this time, action auIs the observation target y, GP2The reward r is generated through the model, and the reward r is the observation target y and GP3A variable t is generated by the model, where t is the observation target y.
Preferably, the kernel function is a matrix:
Figure 635095DEST_PATH_IMAGE033
wherein the content of the first and second substances,
Figure 289062DEST_PATH_IMAGE034
and
Figure 828627DEST_PATH_IMAGE035
amplitude and length scale parameters, respectively;
Figure 475509DEST_PATH_IMAGE036
is a gamma function;
Figure 784131DEST_PATH_IMAGE013
is a second type of modified Bessel function;
Figure 606069DEST_PATH_IMAGE037
is a positive parameter of covariance;
Figure 101772DEST_PATH_IMAGE038
representing the distance between the observed target values. For multidimensional input cases, an automatic decision on relevance (ARD) version thereof may be introduced to handle this situation.
In each round of learning of the world model, the current state s and the last subject action a are concatenated as inputs to the world model. Here, all GP priors are set with mean and Matern kernel functions, and the world model W (s, a; θ) is trainedw) To simulate a real dialog environment. Specifically, as shown in FIG. 2, the penalty function here is set as the sum of the negative log marginal likelihood (NLL) of the Three GP models, denoted as "drawn with summation of Three NLLs" in FIG. 2, each NLL can be solved analytically due to the conjugated nature, and its general formula can be written as:
Figure 552345DEST_PATH_IMAGE039
wherein the content of the first and second substances,
Figure 715473DEST_PATH_IMAGE040
representing the determinant of the matrix, n is the number of training data. In the training phase, the world model W (s, a; θ)w) The refinement can be done at the end of each iteration using real experience with the L-BFGS-B algorithm.
Further, in the present embodiment, the structure of the quality detector based on KL divergence is shown in fig. 4, and the detection method includes:
user actions a generated from a world modelu wStoring the user action a generated by the real user into a word library world-factu rStoring the obtained word into a word stock real-dit; the primary keys of the word stock real-fact and the word stock world-fact are user actions, and the corresponding values are the frequencies corresponding to the user actions;
frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks are stored in a word stock same-fact established in advance, and similarity is measured by KL divergence (KL divergence);
in the initial stage, the word stock world-fact has only limited behaviors/actions, so the word stock same-fact length is also very small, and in order to preheat the world model, preferably, when the word stock same-fact length is smaller than a constant C, the simulation experience is regarded as qualified. The constant C is determined by one skilled in the art on a case-by-case basis and is not limited herein.
The similarity measure is that a variable KL is defined in advancepreThe variable KLpreIs set to a larger value for tracking the KL divergence between the lexicon real-fact and the lexicon world-fact. When the length of the thesame same-dit reaches a certain value, namely is larger than or equal to the constant C, calculating the current KL divergence based on the thesame same-dit, and if the current KL divergence is smaller than or equal to KLpreThen it means that the current experience is detected as a qualified experience since the current experience makes the world model more similar to the real user, and the qualified experience is pushed into the buffer MpFor training a dialogue strategy model.
The method starts from two aspects to improve the quality of simulation experience, on one hand, high-quality simulation experience is generated in the simulation experience generation stage of the world model, on the other hand, the quality of the simulation experience is effectively evaluated in the checking and evaluating process, the quality of the generated simulation experience can be further detected while the quality of the generated simulation experience is improved from the source, and unqualified simulation experience is eliminated so as not to influence the performance of the model by the low-quality experience.
To demonstrate the effectiveness and superiority of the present solution, it was tested in the movie ticket purchase task and compared with other methods in two ways:
1) variation of performance under different superparameters
2) Comparison of Performance
1.1 data set
The same raw data as the conventional DDQ method was used, collected by Amazon Mechanical turn, which had been manually labeled according to a schema defined by a domain expert, which contains 11 dialog behaviors and 16 slots, which contains a total of 280 annotated dialogs, with an average length of about 11.
1.2 dialog Agents used as reference
Providing task completion type dialogue agents of different versions as performance benchmarks of the scheme:
GPDDQ (M, K, N) is the agent learned by the GPDDQ method of the present scheme, M is the buffer size, K is the number of planning steps, and N is the batch size. The original world model was pre-trained with human session data. There is no uncertainty attribute used (i.e., no calculation of confidence interval is performed), nor is a KL divergence check used;
UN-GPDDQ (M, K, N) is similar to GPDDQ (M, K, N), but uncertainty is taken into account here, returning e in the forethought phase of the world modell, ei, eb
The KL-GPDDQ (M, K, N) is brought into KL divergence check on the basis of the UN-GPDDQ (M, K, N);
• GPDDQ(M, K, N, rand-init θ W ) Is an agent that learns by the GPDDQ method, but its initialization of the world model is random. r and t are randomly sampled from the corresponding GP models, and for action auUniformly sampling from the defined action domain;
• GPDDQ(M, K, N, fixed θ w ) Only in the preheating stage, the human dialogue data is used for correction, and then the world model is not changed;
GPDQN (M, K, N) is obtained by direct reinforcement learning, whose performance can be seen as the upper bound of GPDDQ (M, K, N) on the assumption that its world model perfectly matches the real user.
1.3 analysis of parameters
In order to show the advantage of the model of the scheme in terms of the sensitivity to the change of the hyper-parameters, the scheme performs a series of experiments, and continuously changes the corresponding parameters, such as the batch size, the planning step number, the parameter updating strategy, the buffer size and the like.
1.3.1 batch size and planning step
In this set of experiments, setting the batch sizes to 16 and 4, training the agent with different planning step numbers K, the main results are shown in fig. 5, and it can be seen that, statistically, GPDDQ fully surpasses the performance of DDQ. As is clear from fig. 5(a) and 5(b), the success rate convergence value of GPDDQ is far better than DDQ at the same K value. The success rate of GPDDQ converges around 0.8, while DDQ is 0.74. With the increase of the planning steps, the learning speed basically becomes faster, and the phenomenon accords with the visual cognition, namely, a large number of planning steps can bring higher learning speed. Nevertheless, it can be seen that at K =20 and K =10, the learning curves do not differ particularly, since the quality of the simulation experience is degraded due to an excessively large value of K. In practical applications, in order to achieve the best balance between the quantity and quality of the simulation experience, the best value of K needs to be found.
Since the GP method is more robust in terms of the influence of the hyperparameters, and it can be presumed that it has better performance in the case of small batches, in this set of experiments, small batch tests were further performed, as shown in fig. 5(c) and 5(d), to reduce the batch size to 4, while the other parameters were unchanged, and in the case of K =0, the performance of the GPDDQ still exceeded the DDQ. More importantly, there was no significant degradation in performance when compared to the results for the batch size of 16. In contrast, the DDQ method only has the learning curve K =10 stronger than K =0 in terms of success rate, and the performance of the DDQ method is greatly reduced when K is increased to 20, which is caused by insufficient training of DNN when the batch size is too small.
1.3.2 parameter update policy
In this set of experiments, M =5000, K =10, and N =16 are set, and certain changes are made to the parameter updating strategy, and the results are shown in fig. 6, and the experimental results show that the quality of the world model has a great influence on the performance of the agent. The DQN and GPDQN methods are completely model-independent methods with K times the amount of training data than other methods, as shown in fig. 6. Due to the randomness of the two, the curves are slightly different but are identical in nature, and it is clear that the world model, which is fixed after the preheating stage, produces the worst results. The large drop of the DDQ learning curve after 250 iterations is caused by the lack of training data, and the highest value of each learning curve of the GPDDQ method is basically the same as that of DQN, even though different parameter updating strategies are used, the final success rate does not fluctuate greatly.
1.3.3 buffer size
In this set of experiments, the KL-GPDDQ method was evaluated by varying the size of the buffer. As shown in fig. 7, from the viewpoint of global performance, the proposed method is more stable under different conditions, including but not limited to different buffer sizes and planning steps. After the size of the buffer is reduced from 5000 to 1000, the learning curve of the method of the scheme is not obviously changed, but the performance of the DDQ method is obviously changed. This occurs because the world model built with DNN in DDQ generates low quality experience during the planning process, but the high quality experience is unexpectedly the dominant one in the buffer due to the smaller buffer size, resulting in improved performance.
In contrast to the problem of convergence, the success rate of the KL-DPDDQ method is converged to about 0.8 after 200 iterations when K =20, the DDQ method is not converged after 200 iterations, the floating range of the success rate is basically under the method, and the success rate of the final convergence is lower than that of the method. The experimental result fully proves that the method has better performance and stronger robustness when a relatively small buffer is used.
1.4 Performance alignment
To demonstrate the performance of the present protocol, it was compared to other algorithms, as shown in table 1. From table 1 it can be found that the DDQ method still performed the worst of all 5. From the operation results of the GPDDQ, UN-GPDDQ and KL-GPDDQ agents, it can be obviously seen that the KL divergence check of the scheme is very helpful for improving the performance, and the success rate and the reward are obviously improved. Compared with DDQ, the method can improve the success rate by about 20 percent under the condition of less interaction with the user
Figure 711242DEST_PATH_IMAGE041
Table 1: experimental results for different agent training iterations {100,200,300} times with K =20 for buffer size 5000
In the table above, Su (Success), Tu (Turns), Re (Reward).
In addition, as can be seen from fig. 8, the learning speed of the method proposed by the present scheme is much higher than that of DDQ and D3Q. It should be noted that the curve of D3Q is very fluctuant and very unstable, especially when K =30, D3Q cannot converge even to an optimal value, so even if D3Q could cull low quality experience, it is still difficult to implement in reality because GAN is too unstable.
From the above experiments, we can see that compared with the method based on the DDQ framework in the prior art, the scheme has obvious advantages, such as improving the system efficiency and the robustness.
The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.
Although the terms simulation experience, real experience, quality detector, human session data, GP model, world model, buffer, dialogue strategy model, real user experience base, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims (7)

1. A GP-based deep Dyna-Q method for dialogue strategy learning, characterized by comprising the steps of:
s1, generating simulation experience by a world model based on GP;
s2, performing quality detection on the simulation experience by using a quality detector based on KL divergence;
s3, training a dialogue strategy model by using simulation experience qualified in quality detection;
in step S1, the world model includes three GP models, and the three GP models are respectively used for generating the response action auA reward r and a variable t;
the world model generates a meta-simulation experience e through three GP modelsi=(au i, ri, ti) And obtaining a response action au iPrize riAnd variable tiTo a 50% confidence interval, the upper limit simulation experience e is obtainedl =(au l, rl, tl) And lower limit simulation experience eb=(au b, rb, tb) And detecting the upper limit simulation experience e by the mass detector respectivelylLower limit simulation experience ebAnd meta-simulation experience eiThe mass of (c);
when predicted response action auWhen not an integer, auTo the nearest integer; when predicted response action auWhen the action domain exceeds the defined action domain, directly selecting the upper limit or the lower limit of the action domain;
step S2 specifically includes:
s21, storing user actions generated by the world model into a word stock world-fact, and storing user actions generated by a real user into a word stock real-fact;
s22, using the KL divergence to measure the similarity between the word stock real-fact and the word stock world-fact, and accordingly evaluating the quality of simulation experience.
2. The GP-based depth Dyna-Q method for dialog strategy learning of claim 1, wherein in step S2, the simulation experience qualified for quality detection is stored to a buffer for training the dialog strategy model.
3. The GP-based depth Dyna-Q method for dialogue strategy learning according to claim 1, wherein the world model is defined by W (s, a; θ)w) Meaning s is the current state, a is the last response action, θwRepresenting the parameters of the respective GP model.
4. The GP-based depth Dyna-Q method for dialogue strategy learning according to claim 3, wherein the model of the GP model is as follows:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE002
Figure DEST_PATH_IMAGE003
represents the mean value;
Figure DEST_PATH_IMAGE004
is a kernel function;
Figure DEST_PATH_IMAGE005
is a gaussian noise, and is a noise,
Figure DEST_PATH_IMAGE006
Figure DEST_PATH_IMAGE007
is the variance of the received signal and the received signal,
Figure DEST_PATH_IMAGE008
is an identity matrix;
the kernel function takes the following form:
Figure DEST_PATH_IMAGE009
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE010
and
Figure DEST_PATH_IMAGE011
amplitude and length scale parameters, respectively;
Figure DEST_PATH_IMAGE013
is a gamma function;
Figure DEST_PATH_IMAGE015
is a second type of modified Bessel function;
Figure DEST_PATH_IMAGE016
is a positive parameter of covariance;
Figure DEST_PATH_IMAGE017
representing the distance between the observed target values.
5. The GP-based deep Dyna-Q method for dialog strategy learning according to any one of claims 1 to 4, wherein primary keys of the thesaurus real-fact and the thesaurus world-fact are both user actions, and the corresponding value is a frequency corresponding to the user actions.
6. The GP-based depth Dyna-Q method for dialog strategy learning according to claim 5, wherein the step S22 specifically comprises:
s221. by a predefined variable KLpreTracking KL divergence between a word stock real-fact and a word stock world-fact;
s222, storing frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks in a pre-established word stock same-fact;
s223, calculating the current KL divergence based on the thesaurus same-fact, and if the current KL divergence is smaller than or equal to KLpreThen the current experience is detected as a qualified experience.
7. The GP-based deep Dyna-Q method for dialogue strategy learning according to claim 6, wherein in step S22, the frequency values of the intersection main key of the thesaurus real-fact and the thesaurus world-fact in the two thesauruses are stored in the thesaurus same-fact, and the current experience is judged to be qualified when the length of the thesaurus same-fact is smaller than a constant C.
CN202110532520.7A 2021-05-17 2021-05-17 GP-based deep Dyna-Q method for dialogue strategy learning Active CN113392956B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110532520.7A CN113392956B (en) 2021-05-17 2021-05-17 GP-based deep Dyna-Q method for dialogue strategy learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110532520.7A CN113392956B (en) 2021-05-17 2021-05-17 GP-based deep Dyna-Q method for dialogue strategy learning

Publications (2)

Publication Number Publication Date
CN113392956A CN113392956A (en) 2021-09-14
CN113392956B true CN113392956B (en) 2022-02-11

Family

ID=77617062

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110532520.7A Active CN113392956B (en) 2021-05-17 2021-05-17 GP-based deep Dyna-Q method for dialogue strategy learning

Country Status (1)

Country Link
CN (1) CN113392956B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114647986B (en) * 2022-04-18 2023-08-08 南湖实验室 Intelligent decision method and system for realizing continuity action decision based on GP and PPO
CN114492215A (en) * 2022-04-18 2022-05-13 南湖实验室 GP world model for assisting training by utilizing strategy model and training method thereof

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962238A (en) * 2018-04-25 2018-12-07 苏州思必驰信息科技有限公司 Dialogue method, system, equipment and storage medium based on structural neural networks
CN111795700A (en) * 2020-06-30 2020-10-20 浙江大学 Unmanned vehicle reinforcement learning training environment construction method and training system thereof

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11222283B2 (en) * 2018-10-23 2022-01-11 International Business Machines Corporation Hierarchical conversational policy learning for sales strategy planning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108962238A (en) * 2018-04-25 2018-12-07 苏州思必驰信息科技有限公司 Dialogue method, system, equipment and storage medium based on structural neural networks
CN111795700A (en) * 2020-06-30 2020-10-20 浙江大学 Unmanned vehicle reinforcement learning training environment construction method and training system thereof

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Switch-based Active Deep Dyna-Q:Efficient Adaptive Planning for Task-Completion Dialogue Policy Learning;YueXin Wu等;《arXiv》;20181119;第1-8页 *
任务型对话系统研究综述;赵阳洋等;《计算机学报》;20201031;第43卷(第10期);第1862-1886页 *
改进的DDPG对话策略优化算法;赵崟江等;《计算机工程与设计》;20210228;第42卷(第2期);第411-417页 *

Also Published As

Publication number Publication date
CN113392956A (en) 2021-09-14

Similar Documents

Publication Publication Date Title
CN113392956B (en) GP-based deep Dyna-Q method for dialogue strategy learning
CN109523029B (en) Self-adaptive double-self-driven depth certainty strategy gradient reinforcement learning method
CN112989017B (en) Method for generating high-quality simulation experience for dialogue strategy learning
US20190236482A1 (en) Training machine learning models on multiple machine learning tasks
CN108062331B (en) Incremental naive Bayes text classification method based on lifetime learning
Wu et al. Switch-based active deep dyna-q: Efficient adaptive planning for task-completion dialogue policy learning
JPWO2018051841A1 (en) Model learning apparatus, method thereof and program
CN110866101B (en) Dialogue generation method based on near-end strategy optimization and counterstudy
WO2007050622A2 (en) Weighted pattern learning for neural networks
EP4267267A1 (en) Adaptive audio mixing
US20200160842A1 (en) Dialog System Training using a Simulated User System
CN110414664A (en) For training the method and neural metwork training system of neural network
Rad et al. GP-RVM: Genetic programing-based symbolic regression using relevance vector machine
JP2021018683A (en) Information processing program, information processing method, and information processing device
Mykhaylov et al. Three methods for training on bandit feedback
US20230029590A1 (en) Evaluating output sequences using an auto-regressive language model neural network
CN112989016B (en) Method and system for detecting quality of experience of simulated user in dialogue strategy learning
Saputra et al. Analysis of the Resilient Method in Training and Accuracy in the Backpropagation Method
Wu et al. Gaussian process based deep dyna-q approach for dialogue policy learning
CN109815323B (en) Human-computer interaction training question-answer generation algorithm
CN113222105A (en) Meta-cooperation training paradigm
Iba et al. GP-RVM: Genetic programing-based symbolic regression using relevance vector machine
CN113268657A (en) Deep learning recommendation method and system based on comments and item description
Loyola et al. UNSL at eRisk 2022: Decision policies with history for early classification.
Carlsson et al. Alphazero to alpha hero: A pre-study on additional tree sampling within self-play reinforcement learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant