CN113392956B

CN113392956B - GP-based deep Dyna-Q method for dialogue strategy learning

Info

Publication number: CN113392956B
Application number: CN202110532520.7A
Authority: CN
Inventors: 方文其; 曹江; 吴冠霖; 平洋; 栾绍童; 闫顼
Original assignee: Nanhu Laboratory
Current assignee: Nanhu Laboratory
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2022-02-11
Anticipated expiration: 2041-05-17
Also published as: CN113392956A

Abstract

The invention provides a GP-based deep Dyna-Q method for dialogue strategy learning, which comprises the following steps: s1, generating simulation experience by a world model based on GP; s2, performing quality detection on the simulation experience by using a quality detector based on KL divergence; and S3, training the dialogue strategy model by using the simulation experience qualified in quality detection. The traditional DNN model is abandoned by the world model, and the world model is constructed into a Gaussian process model, so that the method has the advantage of easy analysis; and the quality detector based on the KL divergence can effectively control the quality of simulation experience, check the distribution of experience by introducing the KL divergence, and design and train a complex quality detector without extra work, so that the quality of the simulation experience is evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.

Description

GP-based deep Dyna-Q method for dialogue strategy learning

Technical Field

The invention belongs to the technical field of machine learning, and particularly relates to a GP-based deep Dyna-Q method for dialogue strategy learning.

Background

Task completion model dialogue strategy learning aims at building a task-targeted dialogue system that can help users accomplish specific single or multi-domain tasks through several rounds of natural language interaction. It has been widely used in chat robots and personal voice assistants, such as Siri by apple and Cortana by microsoft.

In recent years, reinforcement learning has become a mainstream method of dialogue strategy learning. Based on reinforcement learning, the dialog system can gradually adjust and optimize strategies by natural language interaction with the user to improve performance. However, the original reinforcement learning method requires a lot of human-machine interaction before the available dialogue strategy is available, which not only increases the training cost, but also deteriorates the user experience in the early training phase.

In order to solve the above problems and accelerate the learning process of the dialogue strategy, researchers have proposed a Deep Dyna-Q (DDQ, Deep reinforcement learning) framework on the basis of the Dyna-Q framework. The DDQ framework introduces a world model that is trained using real user experience to generate simulation experience in a dynamic environment in order to make it more similar to real users. During the learning process of the dialogue strategy, the dialogue agent is trained by using real experience collected from actual interaction and simulation experience collected from interaction with the world model. By introducing the world model, only a small amount of real user interaction is needed, and the learning efficiency of the conversation strategy can be obviously improved. However, DDQ also faces two important hurdles in further optimizing dialog strategy learning based on limited dialog interactions:

first, in DDQ, the world model is built as a Deep Neural Network (DNN), whose performance depends largely on the amount of data used for training. In an initial training phase with relatively little real experience, the problem of high dependence of the DNN on data may cause a world model to generate low-quality simulation experience, and if the model is required to generate high-quality simulation experience, a large amount of real experience is required. That is, the world model implemented by a model with a large data demand, such as DNN, will weaken the advantages of the Dyna-Q framework and make DDQ less efficient in reality;

secondly, the simulation experience generated by the world model does not necessarily improve the performance, and the low-quality simulation experience even has a serious negative effect on the performance. Some recent studies, in order to solve this problem, attempt to differentiate low quality experience using a generative countermeasure network (GAN) to control the quality of the simulation experience. However, training GAN has a great instability problem, which may result in non-convergence of the dialogue strategy learning with a high probability, and is highly sensitive to selection of the hyper-parameters, so that the dialogue learning performance is severely restricted. Therefore, the problem of how to effectively screen out the low quality experience in the dialog strategy learning process still remains to be solved and is very important.

Disclosure of Invention

The invention aims to solve the problems and provides a GP-based deep Dyna-Q method for dialogue strategy learning.

In order to achieve the purpose, the invention adopts the following technical scheme:

a GP-based deep Dyna-Q method for dialogue strategy learning, comprising the steps of:

s1, generating simulation experience by a world model based on GP;

s2, performing quality detection on the simulation experience by using a quality detector based on KL divergence;

and S3, training the dialogue strategy model by using the simulation experience qualified in quality detection.

In the GP-based depth Dyna-Q method for dialogue strategy learning described above, in step S2, simulation experiences qualified for quality detection are stored to a buffer for training a dialogue strategy model.

In the GP-based depth Dyna-Q method for dialogue strategy learning, the GP-based world model comprises a plurality of GP models, and the world model is composed of W (s, a; theta)_w) Meaning s is the current state, a is the last response action, θ_wRepresenting the parameters of the respective GP model.

In the GP-based depth Dyna-Q method for dialogue strategy learning described above, at least one set of simulation experiences is generated through a plurality of GP model predictions in step S1, and each set of simulation experiences includes a response action a^uA prize r and a variable t.

In the above GP-based deep Dyna-Q method for dialogue strategy learning, the world model includes three GP models, and the three GP models are respectively used for generating the response action a^uA reward r and a variable t;

the world model generates a meta-simulation experience e in a prediction stage through three GP models_i=(a^u _i, r_i, t_i) And obtaining a response action a^u _iPrize r_iAnd variable t_iTo a 50% confidence interval, the upper limit simulation experience e is obtained_l=(a^u _l, r_l, t_l) And lower limit simulation experience e_b =(a^u _b, r_b, t_b) Separately detecting the upper limit simulation experience e by the mass detector_lLower limit simulation experience e_bAnd meta-simulation experience e_iThe quality of (c).

In the GP-based depth Dyna-Q method for dialogue strategy learning described above, in step S1, when a predicted response action a^uWhen not an integer, a^uTo the nearest integer;

when predicted response action a^uAnd when the action domain exceeds the defined action domain, directly selecting the upper limit or the lower limit of the action domain.

In the above-described GP-based deep Dyna-Q method for dialogue strategy learning, the model of the GP model is as follows:

wherein the content of the first and second substances,

represents the mean value;

is a kernel function;

is a gaussian noise, and is a noise,

，

is the variance of the received signal and the received signal,

is an identity matrix;

the kernel function takes the following form:

wherein the content of the first and second substances,

and

amplitude and length scale parameters, respectively;

is a gamma function;

is a second type of modified Bessel function;

v is a positive parameter of covariance;

representing the distance between the observed target values.

In the GP-based depth Dyna-Q method for dialogue strategy learning described above, step S2 specifically includes:

s21, storing user actions generated by the world model into a word stock world-fact, and storing user actions generated by a real user into a word stock real-fact;

s22, measuring the similarity between the real-fact lexicon and the world-fact lexicon by using the KL divergence, and accordingly evaluating the quality of the simulation experience;

the primary keys of the word stock real-fact and the word stock world-fact are user actions, and the corresponding values are the frequencies corresponding to the user actions.

In the GP-based depth Dyna-Q method for dialogue strategy learning described above, step S22 specifically includes:

s221. by a predefined variable KL_preTracking KL divergence between a word stock real-fact and a word stock world-fact;

s222, storing frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks in a pre-established word stock same-fact;

s223, calculating the current KL divergence based on the thesaurus same-fact, and if the current KL divergence is smaller than or equal to KL_preThen the current experience is detected as a qualified experience.

In the GP-based deep Dyna-Q method for dialogue strategy learning described above, in step S22, frequency values of intersection primary keys of the lexicon real-fact and the lexicon world-fact in the two lexicons are stored in the lexicon same-fact established in advance, and when the length of the lexicon same-fact is smaller than the constant C, it is determined that the current experience is a qualified experience.

The invention has the advantages that:

1. the traditional DNN model is abandoned in the world model of the scheme, and the world model is constructed into a Gaussian process model, so that the scheme has the advantage of easy analysis;

2. the world model based on the Gaussian process can avoid the problem that the quality of simulation experience generated by the traditional DNN model needs to depend on the amount of training data, and can generate high-quality simulation experience to supplement limited real experience;

3. according to the scheme, the quality detector based on the KL divergence can effectively control the quality of simulation experience, the distribution of experience is checked by introducing the KL divergence, and the complicated quality detector is designed and trained without extra work, so that the quality of the simulation experience is evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.

Drawings

FIG. 1 is an architecture diagram of the dialogue learning method of the present invention;

FIG. 2 is a flow chart of a training phase of a world model in the dialogue learning method of the present invention;

FIG. 3 is a flow chart of the prediction phase of the world model in the dialogue learning method of the present invention;

FIG. 4 is a flow chart of KL divergence calculation in the dialogue learning method of the present invention;

fig. 5 is a learning curve for DDQ and GPDDQ under different parameter settings, wherein,

(a) DDQ at M = 5000; n = 16; learning curve at K =0, 2, 5, 10, 20;

(b) GPDDQ at M = 5000; n = 16; learning curve at K =0, 2, 5, 10, 20;

(c) DDQ at M = 5000; n = 4; learning curve at K =0, 2, 5, 10, 20;

(d) GPDDQ at M = 5000; n = 4; learning curve at K =0, 2, 5, 10, 20;

fig. 6 is a learning curve of DDQ/DQN and GPDDQ/GPDQN at M =5000, K =10, N =16, wherein,

(a) a learning curve of DDQ/DQN is obtained;

(b) a learning curve of GPDDQ/GPDQN is obtained;

fig. 7 is a learning curve of DDQ and KL-GPDDQ at different parameter settings, wherein,

(a) DDQ at M =5000, 3500, 2000, 1000; k = 20; learning curve at N = 4;

(b) KL-GPDDQ at M =5000, 3500, 2000, 1000; k = 20; learning curve at N = 4;

(c) DDQ at M =5000, 3500, 2000, 1000; k = 30; learning curve at N = 4; (ii) a

(d) KL-GPDDQ at M =5000, 3500, 2000, 1000; k = 30; learning curve at N = 4;

FIG. 8 is a graph of the learning curves of D3Q, DDQ, GPDDQ, UN-GPDDQ, KL-GPDDQ at different parameter settings, wherein,

(a) learning curves for M =5000, K =20, N = 4;

(b) the learning curve is M =5000, K =30, and N = 4.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

As shown in fig. 1, the present solution proposes a GP-based deep Dyna-Q method for dialogue-policy learning, the basic method of which is consistent with the prior art, such as initializing dialogue-policy model and world model using human session data and starting dialogue-policy learning accordingly. The dialogue strategy learning of the dialogue strategy model mainly comprises two parts of direct reinforcement learning and indirect reinforcement learning (also called planning). And (3) directly strengthening learning, adopting Deep Q-network (DQN) to improve a conversation strategy according to real experience, interacting a conversation strategy model with a User, and selecting an action a to be executed by the conversation strategy model according to an observed conversation state s and a maximized value function Q in each step. The dialogue strategy model then receives a reward r, an action a of the real user_r ^uAnd updates the current state to s', and then applies the true experience (s, a, r, a)_r ^uT) is stored to a real user experience library, t being used to indicate whether the session is terminated.

Maximizing a cost function

Approximated by DNN, by optimization

The updating is continuously iterated to reduce the mean square loss. The loss function is as follows:

wherein the content of the first and second substances,

is a function of the reduction factor,

is an independent network, and in each iteration, small batch deep learning pairs are used

And (5) carrying out improvement. Depth can be trained by using several optimization algorithms such as Adam algorithm, random gradient descent and RMSprop

A network.

During indirect reinforcement learning, the dialogue strategy model improves its dialogue strategy by interacting with the world model to reduce training costs, with the frequency of planning controlled by the parameter K, which means that K steps are planned to be performed in each step of direct reinforcement learning. When the world model can accurately capture the features of the real environment, the value of K tends to be large. At each step of the planning, the world model responds to action a according to the current state s_w ^uGenerating simulation experience (s, a, r, a) in the planning process_w ^u，t’）。

In particular, the scheme proposes to construct a world model into a gaussian process model on the basis of the above prior art, which has the advantage of easier analysis compared with the conventional DNN model and can generate high-quality simulation experience to supplement the limited real experience. In addition, the scheme also provides a brand-new method for evaluating the quality of the simulation experience, the quality of the simulation experience can be effectively controlled by directly comparing the simulation experience with the real experience based on the Kullback-Leibler divergence, the distribution of the experience is checked by introducing the KL divergence, and a quality detector is not required to be trained more, so that the quality of the real simulation experience can be evaluated more easily, and the calculation efficiency is greatly improved while the robustness and the effectiveness of a conversation strategy are ensured.

Specifically, as shown in fig. 1, the method includes the following steps:

s1, generating simulation experience by world model prediction based on GP;

and S3, storing the simulation experience qualified in the quality detection into a buffer, and training the dialogue strategy model by using the simulation experience stored into the buffer.

Specifically, the world model of the present embodiment is represented by W (s, a; θ)_w) Meaning s is the current state, a is the last response action, θ_wRepresenting the parameters of the respective GP model. And as shown in fig. 2 and 3, the world model consists of three GP models GP₁、GP₂、GP₃Composition, in combination with different theta_wAnd (4) parameterizing. Using three GP models for generating response actions a, respectively^uA reward r and a variable t, and expressing the simulation experience as e = (a)^u, r, t）。

Specifically, the present embodiment generates the meta-simulation experience e through three GP models_i=(a^u _i, r_i, t_i) And obtaining a response action a^u _iPrize r_iAnd variable t_iTo a 50% confidence interval, the upper limit simulation experience e is obtained_l=(a^u _l, r_l, t_l) And lower limit simulation experience e_b =(a^u _b, r_b, t_b). I.e. three simulation experiences e per prediction_i、e_l、e_bThe quality of these three simulation experiences is measured by the KL divergence, which is described in detail below.

Unlike DDQ, in this model, the world model is essentially one for generating user actions a^uConsidering that the user operation should be an integer and have a limited action domain, the present scheme further processes the actions generated by the world model:

first, when the predicted response action a^uWhen the number is not an integer (the GP-based world model of the scheme is a regression model, and response actions are not integers which are more common in the case of regression), a^uApproximated to the nearest integer by the ratio a^u _lSubstitution of a by the most recent integer^u _lIn combination with the ratio a^u _bSmall nearest integer substitution a^u _b(ii) a When predicted response action a^uAnd when the action domain exceeds the defined action domain, directly selecting the upper limit or the lower limit of the action domain.

In particular, in the GP regression problem of the world model, the Slave function is generated by adding independent Gaussian noise

Generating an observation target

：

Wherein the content of the first and second substances,

represents the mean value;

is a kernel function;

is independent Gaussian noise, with mean of 0 and variance of

，

And I is an identity matrix.

According to Bayes principle

And its test input value x^*The conditional mean and covariance of the posterior distribution are as follows:

wherein the content of the first and second substances,

。

GP₁generating an action a by the model^uAt this time, action a^uIs the observation target y, GP₂The reward r is generated through the model, and the reward r is the observation target y and GP₃A variable t is generated by the model, where t is the observation target y.

Preferably, the kernel function is a matrix:

wherein the content of the first and second substances,

and

amplitude and length scale parameters, respectively;

is a gamma function;

is a second type of modified Bessel function;

is a positive parameter of covariance;

representing the distance between the observed target values. For multidimensional input cases, an automatic decision on relevance (ARD) version thereof may be introduced to handle this situation.

In each round of learning of the world model, the current state s and the last subject action a are concatenated as inputs to the world model. Here, all GP priors are set with mean and Matern kernel functions, and the world model W (s, a; θ) is trained_w) To simulate a real dialog environment. Specifically, as shown in FIG. 2, the penalty function here is set as the sum of the negative log marginal likelihood (NLL) of the Three GP models, denoted as "drawn with summation of Three NLLs" in FIG. 2, each NLL can be solved analytically due to the conjugated nature, and its general formula can be written as:

wherein the content of the first and second substances,

representing the determinant of the matrix, n is the number of training data. In the training phase, the world model W (s, a; θ)_w) The refinement can be done at the end of each iteration using real experience with the L-BFGS-B algorithm.

Further, in the present embodiment, the structure of the quality detector based on KL divergence is shown in fig. 4, and the detection method includes:

user actions a generated from a world model^u _wStoring the user action a generated by the real user into a word library world-fact^u _rStoring the obtained word into a word stock real-dit; the primary keys of the word stock real-fact and the word stock world-fact are user actions, and the corresponding values are the frequencies corresponding to the user actions;

frequency values of intersection main keys of the word stock real-fact and the word stock world-fact in the two word stocks are stored in a word stock same-fact established in advance, and similarity is measured by KL divergence (KL divergence);

in the initial stage, the word stock world-fact has only limited behaviors/actions, so the word stock same-fact length is also very small, and in order to preheat the world model, preferably, when the word stock same-fact length is smaller than a constant C, the simulation experience is regarded as qualified. The constant C is determined by one skilled in the art on a case-by-case basis and is not limited herein.

The similarity measure is that a variable KL is defined in advance_preThe variable KL_preIs set to a larger value for tracking the KL divergence between the lexicon real-fact and the lexicon world-fact. When the length of the thesame same-dit reaches a certain value, namely is larger than or equal to the constant C, calculating the current KL divergence based on the thesame same-dit, and if the current KL divergence is smaller than or equal to KL_preThen it means that the current experience is detected as a qualified experience since the current experience makes the world model more similar to the real user, and the qualified experience is pushed into the buffer M^pFor training a dialogue strategy model.

The method starts from two aspects to improve the quality of simulation experience, on one hand, high-quality simulation experience is generated in the simulation experience generation stage of the world model, on the other hand, the quality of the simulation experience is effectively evaluated in the checking and evaluating process, the quality of the generated simulation experience can be further detected while the quality of the generated simulation experience is improved from the source, and unqualified simulation experience is eliminated so as not to influence the performance of the model by the low-quality experience.

To demonstrate the effectiveness and superiority of the present solution, it was tested in the movie ticket purchase task and compared with other methods in two ways:

1) variation of performance under different superparameters

2) Comparison of Performance

1.1 data set

The same raw data as the conventional DDQ method was used, collected by Amazon Mechanical turn, which had been manually labeled according to a schema defined by a domain expert, which contains 11 dialog behaviors and 16 slots, which contains a total of 280 annotated dialogs, with an average length of about 11.

1.2 dialog Agents used as reference

Providing task completion type dialogue agents of different versions as performance benchmarks of the scheme:

GPDDQ (M, K, N) is the agent learned by the GPDDQ method of the present scheme, M is the buffer size, K is the number of planning steps, and N is the batch size. The original world model was pre-trained with human session data. There is no uncertainty attribute used (i.e., no calculation of confidence interval is performed), nor is a KL divergence check used;

UN-GPDDQ (M, K, N) is similar to GPDDQ (M, K, N), but uncertainty is taken into account here, returning e in the forethought phase of the world model_l, e_i, e_b；

The KL-GPDDQ (M, K, N) is brought into KL divergence check on the basis of the UN-GPDDQ (M, K, N);

• GPDDQ(M, K, N, rand-init θ _W) Is an agent that learns by the GPDDQ method, but its initialization of the world model is random. r and t are randomly sampled from the corresponding GP models, and for action a^uUniformly sampling from the defined action domain;

• GPDDQ(M, K, N, fixed θ _w) Only in the preheating stage, the human dialogue data is used for correction, and then the world model is not changed;

GPDQN (M, K, N) is obtained by direct reinforcement learning, whose performance can be seen as the upper bound of GPDDQ (M, K, N) on the assumption that its world model perfectly matches the real user.

1.3 analysis of parameters

In order to show the advantage of the model of the scheme in terms of the sensitivity to the change of the hyper-parameters, the scheme performs a series of experiments, and continuously changes the corresponding parameters, such as the batch size, the planning step number, the parameter updating strategy, the buffer size and the like.

1.3.1 batch size and planning step

In this set of experiments, setting the batch sizes to 16 and 4, training the agent with different planning step numbers K, the main results are shown in fig. 5, and it can be seen that, statistically, GPDDQ fully surpasses the performance of DDQ. As is clear from fig. 5(a) and 5(b), the success rate convergence value of GPDDQ is far better than DDQ at the same K value. The success rate of GPDDQ converges around 0.8, while DDQ is 0.74. With the increase of the planning steps, the learning speed basically becomes faster, and the phenomenon accords with the visual cognition, namely, a large number of planning steps can bring higher learning speed. Nevertheless, it can be seen that at K =20 and K =10, the learning curves do not differ particularly, since the quality of the simulation experience is degraded due to an excessively large value of K. In practical applications, in order to achieve the best balance between the quantity and quality of the simulation experience, the best value of K needs to be found.

Since the GP method is more robust in terms of the influence of the hyperparameters, and it can be presumed that it has better performance in the case of small batches, in this set of experiments, small batch tests were further performed, as shown in fig. 5(c) and 5(d), to reduce the batch size to 4, while the other parameters were unchanged, and in the case of K =0, the performance of the GPDDQ still exceeded the DDQ. More importantly, there was no significant degradation in performance when compared to the results for the batch size of 16. In contrast, the DDQ method only has the learning curve K =10 stronger than K =0 in terms of success rate, and the performance of the DDQ method is greatly reduced when K is increased to 20, which is caused by insufficient training of DNN when the batch size is too small.

1.3.2 parameter update policy

In this set of experiments, M =5000, K =10, and N =16 are set, and certain changes are made to the parameter updating strategy, and the results are shown in fig. 6, and the experimental results show that the quality of the world model has a great influence on the performance of the agent. The DQN and GPDQN methods are completely model-independent methods with K times the amount of training data than other methods, as shown in fig. 6. Due to the randomness of the two, the curves are slightly different but are identical in nature, and it is clear that the world model, which is fixed after the preheating stage, produces the worst results. The large drop of the DDQ learning curve after 250 iterations is caused by the lack of training data, and the highest value of each learning curve of the GPDDQ method is basically the same as that of DQN, even though different parameter updating strategies are used, the final success rate does not fluctuate greatly.

1.3.3 buffer size

In this set of experiments, the KL-GPDDQ method was evaluated by varying the size of the buffer. As shown in fig. 7, from the viewpoint of global performance, the proposed method is more stable under different conditions, including but not limited to different buffer sizes and planning steps. After the size of the buffer is reduced from 5000 to 1000, the learning curve of the method of the scheme is not obviously changed, but the performance of the DDQ method is obviously changed. This occurs because the world model built with DNN in DDQ generates low quality experience during the planning process, but the high quality experience is unexpectedly the dominant one in the buffer due to the smaller buffer size, resulting in improved performance.

In contrast to the problem of convergence, the success rate of the KL-DPDDQ method is converged to about 0.8 after 200 iterations when K =20, the DDQ method is not converged after 200 iterations, the floating range of the success rate is basically under the method, and the success rate of the final convergence is lower than that of the method. The experimental result fully proves that the method has better performance and stronger robustness when a relatively small buffer is used.

1.4 Performance alignment

To demonstrate the performance of the present protocol, it was compared to other algorithms, as shown in table 1. From table 1 it can be found that the DDQ method still performed the worst of all 5. From the operation results of the GPDDQ, UN-GPDDQ and KL-GPDDQ agents, it can be obviously seen that the KL divergence check of the scheme is very helpful for improving the performance, and the success rate and the reward are obviously improved. Compared with DDQ, the method can improve the success rate by about 20 percent under the condition of less interaction with the user

Table 1: experimental results for different agent training iterations {100,200,300} times with K =20 for buffer size 5000

In the table above, Su (Success), Tu (Turns), Re (Reward).

In addition, as can be seen from fig. 8, the learning speed of the method proposed by the present scheme is much higher than that of DDQ and D3Q. It should be noted that the curve of D3Q is very fluctuant and very unstable, especially when K =30, D3Q cannot converge even to an optimal value, so even if D3Q could cull low quality experience, it is still difficult to implement in reality because GAN is too unstable.

From the above experiments, we can see that compared with the method based on the DDQ framework in the prior art, the scheme has obvious advantages, such as improving the system efficiency and the robustness.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Although the terms simulation experience, real experience, quality detector, human session data, GP model, world model, buffer, dialogue strategy model, real user experience base, etc. are used more often herein, the possibility of using other terms is not excluded. These terms are used merely to more conveniently describe and explain the nature of the present invention; they are to be construed as being without limitation to any additional limitations that may be imposed by the spirit of the present invention.

Claims

1. A GP-based deep Dyna-Q method for dialogue strategy learning, characterized by comprising the steps of:

s1, generating simulation experience by a world model based on GP;

s3, training a dialogue strategy model by using simulation experience qualified in quality detection;

in step S1, the world model includes three GP models, and the three GP models are respectively used for generating the response action a^uA reward r and a variable t;

the world model generates a meta-simulation experience e through three GP models_i=(a^u _i, r_i, t_i) And obtaining a response action a^u _iPrize r_iAnd variable t_iTo a 50% confidence interval, the upper limit simulation experience e is obtained_l=(a^u _l, r_l, t_l) And lower limit simulation experience e_b=(a^u _b, r_b, t_b) And detecting the upper limit simulation experience e by the mass detector respectively_lLower limit simulation experience e_bAnd meta-simulation experience e_iThe mass of (c);

when predicted response action a^uWhen not an integer, a^uTo the nearest integer; when predicted response action a^uWhen the action domain exceeds the defined action domain, directly selecting the upper limit or the lower limit of the action domain;

step S2 specifically includes:

s22, using the KL divergence to measure the similarity between the word stock real-fact and the word stock world-fact, and accordingly evaluating the quality of simulation experience.

2. The GP-based depth Dyna-Q method for dialog strategy learning of claim 1, wherein in step S2, the simulation experience qualified for quality detection is stored to a buffer for training the dialog strategy model.

3. The GP-based depth Dyna-Q method for dialogue strategy learning according to claim 1, wherein the world model is defined by W (s, a; θ)_w) Meaning s is the current state, a is the last response action, θ_wRepresenting the parameters of the respective GP model.

4. The GP-based depth Dyna-Q method for dialogue strategy learning according to claim 3, wherein the model of the GP model is as follows:

wherein the content of the first and second substances,

represents the mean value;

is a kernel function;

is a gaussian noise, and is a noise,

，

is the variance of the received signal and the received signal,

is an identity matrix;

the kernel function takes the following form:

wherein the content of the first and second substances,

and

amplitude and length scale parameters, respectively;

is a gamma function;

is a second type of modified Bessel function;

is a positive parameter of covariance;

representing the distance between the observed target values.

5. The GP-based deep Dyna-Q method for dialog strategy learning according to any one of claims 1 to 4, wherein primary keys of the thesaurus real-fact and the thesaurus world-fact are both user actions, and the corresponding value is a frequency corresponding to the user actions.

6. The GP-based depth Dyna-Q method for dialog strategy learning according to claim 5, wherein the step S22 specifically comprises:

7. The GP-based deep Dyna-Q method for dialogue strategy learning according to claim 6, wherein in step S22, the frequency values of the intersection main key of the thesaurus real-fact and the thesaurus world-fact in the two thesauruses are stored in the thesaurus same-fact, and the current experience is judged to be qualified when the length of the thesaurus same-fact is smaller than a constant C.