CN116628149A

CN116628149A - Variable autoregressive dialogue generation device and method based on joint hidden variables

Info

Publication number: CN116628149A
Application number: CN202310482318.7A
Authority: CN
Inventors: 王博; 马尚朝
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2023-04-30
Filing date: 2023-04-30
Publication date: 2023-08-22

Abstract

The invention discloses a variation autoregressive dialogue generating device and method based on a joint hidden variable, comprising an encoder module, a knowledge selecting module and a variation autoregressive decoder module; the encoder module is used for respectively encoding the dialogue context and the knowledge sentence set into characteristic representations in a vector form, wherein the knowledge sentence set is obtained by searching an external text library through the dialogue context and is provided with a plurality of labeled knowledge sentences; constructing word-level and sentence-level coded representations based on a pre-trained language model of a multi-layer self-attention mechanism; the knowledge selection module selects a knowledge sentence which is most relevant to the dialogue context semanteme from the knowledge sentence set for reply generation based on the dialogue context and the current given knowledge sentence set; the variational autoregressive decoder module comprises a variational layer and a stacked decoding layer, wherein the variational layer is used for calculating and obtaining a reply sequence hidden variable, and finally, a final reply statement is generated through the stacked decoding layer.

Description

Variable autoregressive dialogue generation device and method based on joint hidden variables

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a dialogue generating device and method integrating external unstructured knowledge, knowledge selection and knowledge perception in dialogue generation.

Background

Sequence-to-sequence representative dialog generation techniques have been relatively mature, but tend to generate short replies and ordinary replies, mainly due to lack of knowledge of dialog systems. The knowledge is crucial to understanding the language and generating the language, and the external knowledge such as common sense and background knowledge is an important information source for organizing dialogue sentences, wherein unstructured external knowledge such as encyclopedia articles, domain knowledge documents, social media comments and the like is more easily obtained compared with the structured knowledge (knowledge graph). A dialog system that merges external unstructured knowledge can identify entities or topics that the user speaks to and relate them to facts in the real world, e.g., retrieve relevant background information, introduce new dialog topics, talk to the user in an proactive manner; can also be trained by increasing knowledge, and has extremely high expansibility. Therefore, it is necessary to introduce external unstructured knowledge into the dialog system to improve the quality of dialog replies, resulting in more informative and diversified dialog.

In a dialog system that merges unstructured external knowledge, a typical task framework includes knowledge selection and reply generation of two subtasks. The existing knowledge selection method in research work includes using reply sentences as posterior information and tag knowledge as posterior information. For example, the PostKS algorithm uses a reply sentence as posterior information, does not depend on tag knowledge, models posterior distribution of knowledge selection, improves knowledge accuracy by reducing a gap between prior distribution and posterior distribution of knowledge selection, and further carries out reply generation of a dialogue, but knowledge selection is based on a single-round dialogue. The SKLS algorithm models knowledge selection as sequence hidden variables in a multi-round dialogue scene, and improves the accuracy of knowledge selection depending on dialogue context and history selected knowledge. The PIPM algorithm further improves the SKLS algorithm, and aims at the problem that inaccurate knowledge is possibly selected in a testing stage due to the fact that a priori knowledge selection module does not acquire posterior information, and further knowledge perception is affected in reply generation, information of a predicted reply sentence is provided as supplement, but posterior information of knowledge selection still passes through the reply sentence. Researchers have proposed a difference-aware knowledge selector that considers differences between the current round and the previous round of selected knowledge to assist in smooth transitions in knowledge selection. In addition, the DukeNet algorithm considers knowledge tracking and knowledge drift, and improves the accuracy of the DukeNet algorithm and the DukeNet algorithm based on a dual learning paradigm; the CoLV algorithm uses tag knowledge as posterior information, considers the connection between knowledge selection and reply generation, and simultaneously improves the diversity of reply generation through a joint hidden variable model, wherein hidden variables generated by reply are global single hidden variables.

Through research analysis, the prior art has been found to still present the following challenges: (1) The posterior distribution of knowledge selection hidden variables is calculated by dialogue context and tag knowledge, the prior distribution is calculated by dialogue context only. The existence of gaps between prior distribution and posterior distribution can lead to unreasonable selection in the knowledge selection stage, thereby affecting the quality of reply sentences; (2) The latent variables generated by the replies are single latent variables at the global and sentence level, which is insufficient to model complex semantics and diversity of reply generation, possibly resulting in ignoring the global latent variables generated by the replies during the autoregressive decoding generation.

Disclosure of Invention

The invention aims to overcome the defects in the prior art, and provides a variational autoregressive dialogue generating device and method based on joint hidden variables, which not only utilize complementary posterior information to improve the accuracy of knowledge selection, but also model the diversity of knowledge selection and the diversity of reply generation through a conditional variational autoregressive encoder, and simultaneously model the reply hidden variables in each decoding step by using a variational autoregressive decoder, so that replies are more diverse and have more information.

The invention aims at realizing the following technical scheme:

A variation autoregressive dialogue generating device based on joint hidden variables comprises an encoder module, a knowledge selection module and a variation autoregressive decoder module;

the encoder module is used for respectively encoding the dialogue upper part and the knowledge sentence set into characteristic representations in a vector form, and a plurality of labeled knowledge sentences are arranged; constructing word-level and sentence-level coded representations based on a pre-trained language model of a multi-layer self-attention mechanism;

the knowledge selection module selects a knowledge sentence which is most relevant to the dialogue context semanteme from the knowledge sentence set for reply generation based on the dialogue context and the current given knowledge sentence set; modeling a posterior distribution and a prior distribution of knowledge selection hidden variables respectively, wherein the knowledge selection hidden variables represent hidden variables which obey category distribution on a knowledge sentence set based on dialogue, and the posterior distribution is modeled by a labeled knowledge sentence and dialogue; the prior distribution is modeled using predictive posterior information and dialogue context, where the predictive posterior information refers to predicted reply sentence information; approximating the prior distribution and the posterior distribution in a training stage, and realizing the use of the prior distribution in a testing stage to select a knowledge sentence and sending the knowledge sentence to a variational autoregressive decoder module;

The variational autoregressive decoder module comprises a variational layer and a stacked decoding layer, the variational layer is used for calculating to obtain a recovery sequence hidden variable, in a training stage, the calculation of the recovery sequence hidden variable comprises a posterior and an priori sequence hidden variable, and the fusion of the selected recovery sequence hidden variable and decoding hidden states of an a priori path is transmitted to the decoding layer, wherein the selected recovery sequence hidden variable is the posterior and the priori recovery sequence hidden variable in the training and testing stages respectively, and the decoding hidden states of the a priori path refer to hidden state representations obtained by the generated recovery sequence through the variational layer; and finally, generating a final reply sentence through stacking the decoding layers.

Further, the method comprises the steps of:

s1, a pre-training language model BERT coding dialogue upper part and a knowledge sentence set based on a self-attention mechanism are used for constructing word-level and sentence-level coding representation;

s2, modeling knowledge selects posterior distribution and prior distribution of hidden variables; the posterior distribution is modeled by a labeled knowledge sentence and dialogue, the prior distribution is modeled by using prediction posterior information and dialogue, and hidden variables are sampled from the prior distribution to select the knowledge sentence to be sent to the variational autoregressive decoder module;

S3, generating a reply sentence through a variation autoregressive decoder module; the variational autoregressive decoder module comprises variational layers and stacked decoding layers, the variational layers calculate posterior distribution and prior distribution of hidden variables of a recovery sequence, and fusion and transfer decoding hidden states of the hidden variables of the selected recovery sequence and prior paths to the decoding layers, wherein the hidden variables of the selected recovery sequence are the hidden variables of the recovery sequence which are posterior and prior respectively in training and testing stages, and the decoding hidden states of the prior paths refer to hidden state representations obtained by the generated recovery sequence through the variational layers; the stacked decoding layer generates reply statements based on the output of the variant hierarchy, the dialog context, and the selected knowledge.

Further, in step S1, the current round of dialogue C is given ^t And knowledge sentence set K ^t Obtaining corresponding word-level feature representation using BERT model coding to obtain word-level representation vectors of conversational contextAnd average pooled sentence-level representation vector +.>For knowledge sentence set K ^t Any sentence K in (1) ^t,l Word-level and sentence-level representation vectors are also obtained>And->Knowledge sentence set K ^t The sentence-level representation of the whole is noted +.> L is the knowledge sentence set size, and d represents the hidden state dimension.

Further, step S2 includes:

s2.1 meterKnowledge-based selection of hidden variablesPosterior distribution of (c): concatenating the dialog context with the representation vector of the selected knowledge, i.e. +.>And sentence-level representation->Performing dot product attention calculation, and obtaining posterior hidden variable representation ++through softmax normalization layer>

S2.2, calculating knowledge selection hidden variablesIs a priori distributed of (a); using predicted reply sentence information to supplement a priori information of knowledge selection, dialog context and sentence level representation +.>Dot product attention calculation is performed, i.e. +.> Representing the knowledge sentence set fusion feature representation perceived above the dialog, attn represents the method of computing attention;

splicing the dialogue upper and knowledge sentence set fusion characteristic representations, namelyPredicting word probability distribution T of a reply sentence on a BERT word list through a multi-layer perceptron, weighting and summing the word probability distribution T and embedded representation E of the whole BERT word list, and then weighting and summing the word probability distribution T and the embedded representation E of the whole BERT word list with dialogue upper characteristic representation ++>Splicing to obtain->Then obtaining ∈10 after feature transformation>And sentence-level representation->Performing dot product attention calculation, obtaining attention scores on the knowledge sentence subsets through a softmax normalization layer and using the attention scores as prior hidden variables ++>

S2.3 selecting hidden variables from knowledge Sampling a priori hidden variables in a priori distribution of (1) to select a knowledge sentence K ^t,sel : namely, sampling from prior hidden variables, wherein the prior hidden variables obey category distribution, and the ++>Sending the corresponding knowledge sentence to a variational autoregressive decoder module, wherein the representation obeys a probability density function; reconstruction loss of knowledge selection task in knowledge selection module is +.>Wherein K is ^t,a Representing tag knowledge, i.e.)>

Further, step S3 includes:

s3.1, modeling posterior distribution of hidden variables of a recovery sequence through variable layering: training stage, generating sequence Y from input reply ^t Word embedded representation and position code of (1) are added by bits as initial input of variable hierarchy, denoted S ⁰ The method comprises the steps of carrying out a first treatment on the surface of the Further obtaining posterior distribution parameters of hidden variables of the recovery sequence through variable layering: mean and variance vectors, respectivelyAnd->Sampling by a re-parameterization technology to obtain posterior hidden variable ++>The hidden variable of the recovery sequence before the time step n is recorded asMeanwhile, the formula is abbreviated as-> Recovery sequence hidden variable +.>

Wherein C is ^t For the current round of dialogue, K ^t,sel For knowledge sentences, I represents an identity matrix;

s3.2, modeling prior distribution of hidden variables of the recovery sequence; the method comprises training and testing stages, and the prior distribution parameters of hidden variables of the recovery sequence are obtained through variational layers: mean and variance vectors, respectively And->Sampling by a re-parameterization technology to obtain a priori hidden variable of the time step n>The reply sequence before time step n is denoted as Y ^t,＜n Meanwhile, the following formula is abbreviated as +.> Recovery sequence hidden variable representing time step n-sampling a priori distribution>

S3.3, fusing a recovery sequence hidden variable in the variable hierarchy with a decoding hidden state of the priori path; combining decoding hidden states in a priori paths with selected recovery sequence hidden variablesFurther fusion to obtain a representation vector S ¹ The selected hidden variables of the recovery sequence are the hidden variables of the recovery sequence of posterior and priori respectively in the training and testing stage; then, the hidden state vector sent into the decoding layer is +.>

S3.4, the stacked decoding layers generate reply sentences based on the variable layering output, the dialogue contexts and the selected knowledge: the stacked decoding layers can obtain decoding generation probabilities, copy selected knowledge probabilities and copy dialogue upper probabilities, and weight and sum the probabilities to obtain probability distribution of the final generated words on the BERT word list; and obtaining the word with the highest probability on the BERT vocabulary through the argmax function, and further obtaining a final reply sentence through splicing the word with the highest probability.

Preferably, the stacked decoding layers are based on a transform decoder layer architecture, denoted Trs_Decs, to obtain a hidden state representation of the last layer

The stacked decoding layers use decoding layers with a copy mechanism, and the probability distribution of the final generated word is a weighted sum of the decoding layer generation, the copy selected knowledge, and the probability distribution above the copy dialog. After the generated word sequence is processed by a decoder, the hidden state of the final position and the generated probability distribution on the BERT vocabulary are obtained. And respectively carrying out corresponding calculation processing on the hidden state of the final position and the word level vector representation of the selected knowledge and the word level vector representation of the dialogue, and obtaining the probability distribution and the relevance score of the corresponding copy word. The corresponding relevance scores are weighted and summed with the word-level vector representation of the selected knowledge and the word-level vector representation of the dialog context, respectively, to obtain the selected knowledge and the attention representation of the dialog context. The hidden state of the last position, the selected knowledge and the attention above the dialogue are expressed to obtain the weight coefficients of the three probability distributions through the multi-layer perceptron. And finally, weighting and summing the weight coefficient and the corresponding probability distribution to obtain the final probability distribution of the generated word on the BERT vocabulary.

The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the steps of the variation autoregressive dialogue generating method based on the joint hidden variable when executing the program.

The invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the joint hidden variable based variational autoregressive dialog generation method.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1) The invention uses the knowledge sentence set and the dialogue upper line to predict the reply sentence information, further uses the reply sentence set and the dialogue upper line as the supplementary information for calculating the priori distribution of the knowledge selection hidden variables, models the knowledge selection task through the knowledge selection hidden variables, approximates the posterior distribution in the priori distribution of the knowledge selection hidden variables in the training stage, can select the knowledge sentence more related to the dialogue upper line semanteme, and ensures that the accuracy of the knowledge selection task in the testing stage is higher.

2) The invention uses the variational autoregressive decoder to model the hidden variable of the reply sequence, namely models the hidden variable of the reply at each decoding time step, so that the generated reply sentences are more various and more informative and are closer to human replies.

Drawings

FIG. 1 is a flow chart of a variation autoregressive dialogue generation method according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of the operation of the variational autoregressive decoder module according to an embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

The embodiment provides a variation autoregressive dialogue generating device based on joint hidden variables, which comprises an encoder module, a knowledge selecting module and a variation autoregressive decoder module. The encoder module is used for respectively encoding the dialogue context and the knowledge sentence set into characteristic representations in a vector form, wherein the knowledge sentence set is obtained by searching an external text library through the dialogue context and is provided with a plurality of labeled knowledge sentences; a pre-trained language model based on a multi-layer self-attention mechanism constructs word-level and sentence-level coded representations, which are used by a knowledge selection module and a variational autoregressive decoder module.

The knowledge selection module selects knowledge sentence knowledge which is most relevant to the meaning of the dialogue from the knowledge sentence set for reply generation based on the dialogue upper and the current given knowledge sentence set, and specifically models posterior distribution and prior distribution of knowledge selection hidden variables respectively, wherein the knowledge selection hidden variables represent hidden variables which are subject to category distribution on the knowledge sentence set based on the dialogue upper, and the posterior distribution is modeled by the labeled knowledge sentences and the dialogue upper; the prior distribution is modeled using predictive posterior information and dialogue context, where predictive posterior information refers to predicted reply sentence information. The prior distribution and the posterior distribution are approximated in the training stage, and the prior distribution is used in the testing stage to select knowledge sentences to be sent to the variational autoregressive decoder module.

The variational autoregressive decoder module generates reply sentences based on the dialog context and the selected knowledge sentences, including in particular variational layers and stacked decoding layers. The variational autoregressive decoder module uses the hidden variables of the reply sequence to model the diversity of reply sentences, wherein the diversity refers to that the reply sentences contain a plurality of different words; modeling the hidden variable of each decoding moment; the variable hierarchy comprises calculation of a recovery sequence hidden variable, the calculation of the recovery sequence hidden variable comprises posterior and priori sequence hidden variables in a training stage, and decoding hidden states of the selected recovery sequence hidden variable and a priori path are fused and transferred to a decoding layer, wherein the selected recovery sequence hidden variable is the posterior and priori recovery sequence hidden variable in the training stage and the testing stage respectively, and the decoding hidden state of the priori path refers to a hidden state representation obtained by the generated recovery sequence through the variable hierarchy; and finally generating a final reply by stacking the decoding layers.

Step 1, a pre-training language model BERT coding dialogue upper line and knowledge sentence set based on a self-attention mechanism constructs word level and sentence level coding representation: given the current round of dialogue context C ^t And knowledge sentence set K ^t Obtaining corresponding word-level feature representations using BERT model encoding: for example, add [ CLS ] to the beginning and end of sentence in the dialogue]And [ SEP ]]Special symbols, this operation being noted as SP, are then fed into the BERT model to obtain word level representations of the dialog's context(Vector)Sentence-level representation vector is obtained again using AvgPool averaging pooling>The following formula is shown:

knowledge sentence set K ^t Any sentence K in (1) ^t,l Obtaining word-level and sentence-level expression vectors through the same processing as above for dialogueAnd->Knowledge sentence set K ^t The overall sentence-level representation vector is +.> And the matrix, L is the size of the knowledge sentence set, and d represents the hidden state dimension.

Step 2, modeling knowledge selects posterior distribution and prior distribution of hidden variables, wherein the posterior distribution is modeled by a labeled knowledge sentence and dialogue context, the prior distribution is modeled by using predictive posterior information and dialogue context, and hidden variables are sampled from the prior distribution to select the knowledge sentence to be sent to a variational autoregressive decoder module; the specific treatment of the method is as follows:

step 2.1, calculating knowledge selection hidden variablesPosterior distribution of (c): the dialogue is spliced with the expression vector of the selected knowledge, and the selected knowledge is a knowledge sentence corresponding to a certain selected label; sentence-level representation vectors integral to the knowledge sentence set are dot product attention calculated and normalized by softmax ++ >The posterior distribution of hidden variables is chosen for knowledge, abbreviated as +.>The following formula is shown:

wherein [;]and T respectively represent vector concatenation and transposition, W _post Is a parameter that can be learned.

Step 2.2, calculating prior distribution of knowledge selection hidden variables: the invention provides a method for supplementing prior information of knowledge selection by using predicted reply sentence information, which comprises the steps of firstly carrying out dot product attention calculation on sentence-level representation vectors of a dialogue upper line and a knowledge sentence set, and then carrying out vector dot product operation on the score and integral feature representation of the knowledge sentence set:

wherein q is _P (K ^t |C ^t ) Attention score, W, representing the dialog context and knowledge sentence set _P As a parameter that can be learned,the set of knowledge sentences representing the perception of the dialog above fuses the feature representations.

Then fusing the dialogue upper and knowledge sentence set with feature representation, predicting word probability distribution T of reply sentence on word list by multilayer perceptron MLP, and combining the distribution T with the word probability distribution TThe embedded representation E of the whole word list is weighted and summed to serve as word bag form semantic representation of the prediction reply sentence, and the representation of the actual reply sentence is approximatedFor narrowing the gap between a priori information and posterior information in the knowledge selection task. Finally, splicing the dialogue upper and the predicted reply sentence information to be used as a new query vector, wherein the new query vector is shown in the following formula:

Where |v| represents the vocabulary size,knowledge selection a priori information representing the construction, +.>E represents a vocabulary embedding matrix for a learnable parameter,>representing a priori distribution of knowledge selection hidden variables, abbreviated as +.>The present embodiment uses KL distance loss constraint knowledge to select the distribution distance of posterior distribution and prior distribution in the task, i.e.

Step 2.3 selecting hidden variables from knowledgeSampling a priori hidden variables in a priori distribution of (1) to select a knowledge sentence K ^t,sel : namely, sampling from prior hidden variables, wherein the prior hidden variables obey category distribution, and the ++>And sending the corresponding knowledge sentence to a variational autoregressive decoder module, wherein the representation obeys the probability density function. Reconstruction loss of knowledge selection task in knowledge selection module is +.>Wherein K is ^t,a Representing tag knowledge, i.e.)>

Step 3, modeling a recovery sequence hidden variable and recovery generation through a variational autoregressive decoder module, wherein the variational autoregressive decoder module comprises variational layers and stacked decoding layers, the variational layers calculate posterior distribution and prior distribution of the recovery sequence hidden variable, and fusion transfer of the selected recovery sequence hidden variable and decoding hidden states of prior paths to the decoding layers, wherein the selected recovery sequence hidden variable is a posterior and prior recovery sequence hidden variable in a training and testing stage respectively, and the decoding hidden states of the prior paths refer to hidden state representations obtained by the generated recovery sequence through a decoder; the stack decoding layer generates reply sentences based on the variable layering output, the dialogue contexts and the selected knowledge, and the method specifically comprises the following steps:

Step 3.1, modeling posterior distribution of hidden variables of a recovery sequence through variable layering: during the training phase, the input replies are generated into a sequence Y ^t Word embedded representation and position code of (1) are added by bits as initial input of variable hierarchy, denoted S ⁰ . As shown in fig. 2, the present embodiment uses different mask matrices to calculate parameters of the hidden variable posterior and a priori distribution of the recovery sequence at the same time in the training stage, and does not need to code the recovery sequence in the reverse direction.

By varying layeringObtaining posterior distribution parameters of hidden variables of the recovery sequence: mean and variance vectors, respectivelyAnd->The following formula is shown:

wherein the method comprises the steps ofRepresenting a variant layering based on a transducer model,/->And->Representing a multi-layer perceptron network, softplus represents an activation function to ensure that the variance is constant as a positive number. Sampling by a re-parameterization technology to obtain posterior hidden variable ++>The following is abbreviated as-> Recovery sequence hidden variable representing posterior distribution sampled from time step n +.>The following formula is shown:

wherein n is E [0, N]Representing time steps, ε is from a normal distributionIn the vector of the samples, +.>Representing the hidden variable of the reply sequence before the time step n, and I represents the identity matrix.

Step 3.2, modeling prior distribution of hidden variables of a recovery sequence: the training and testing stage is included, the steps are similar to posterior distribution calculation of hidden variables of the recovery sequence in the step 3.1, and masking operation is needed when calculating self multi-head attention. Further obtaining the mean vector of the hidden variable prior distribution of the recovery sequence Sum of variance vector->Sampling by a re-parameterization technology to obtain a priori hidden variable of time n>The following is abbreviated as-> Recovery sequence hidden variable representing time step n-sampling a priori distribution>

The probability distribution distance of posterior distribution and prior distribution of sequence hidden variables in a generating task is generated by using KL distance loss constraint reply, and the probability distribution distance is shown as the following formula:

step 3.3, fusing the hidden variable of the recovery sequence in the variable hierarchy with the decoding hidden state of the prior path: decoding hidden state of prior path and selected hidden variable of recovery sequenceFurther fusion to obtain a representation vector S ¹ And performing layer normalization processing, wherein the hidden state vector fed into the decoding layer is +.>Wherein the hidden variables selected during the training and testing phases are posterior and a priori hidden variables, respectively.

Step 3.4, the stacked decoding layers generate reply sentences based on the variant hierarchy output, the dialog context and the selected knowledge: the stacked decoding layers are denoted trs_des, as shown in fig. 2, each comprising two cross-attention sub-layers for focusing on the dialog context and the selected knowledge, respectively. The input received by the decoding layer isAfter the processing of the stacked decoding layers, the hidden state representation of the last layer is obtained>The following formula is shown:

assuming that the generated word sequence is The sequence is processed by a variational autoregressive decoder to obtain a hidden state representation of the last position in the last layer, which is marked as +.>The j-th step generated word y can be obtained through full connection and normalization operation ^t,j Probability score p on vocabulary _θ (y ^t,j Gen), as shown in the following formula:

the decoding layer with the copying mechanism is used, and the probability distribution of the finally generated word is the weighted sum of the decoding layer generation, the copying of the selected knowledge and the probability distribution of the copying dialogue.

Representing hidden states at decoding time step jAnd word level representation of the selected knowledge +.>Respectively performing projective transformation to obtain new representation vectors +.>And->And normalizing the correlation score of the dot product operation of the two points, wherein the score is used as the probability score of the corresponding word in the copy knowledge, and the calculation is shown in the following formula:

wherein the method comprises the steps ofRepresenting +.>And a normalized score for the selected knowledge dot product operation,/->Representing the i-th word K in the selected knowledge ^t,sel,i And decoding status->Is a correlation score, p _θ (y ^t,j |CP _k ) Copy the AND y in the selected knowledge for decoding time step j ^t,j Same word K ^t,sel,i Probability sum of (d).

Similarly, the probability score p of the word in the dialog context is copied _θ (y ^t,j |CP _c ) Also, hidden states will be decodedAnd word level representation of dialog above +.>Respectively performing projective transformation to obtain new representation vectors +. >And->The correlation score of both dot product operations is then normalized, and this score is used as the copy dialog contextThe probability of the corresponding word in (c) is calculated as follows:

wherein the method comprises the steps ofRepresenting +.>And normalized score of dot product operation above dialogue,/->Representing dialog context C ^t The i-th word C of (a) ^t,i And decoding status->Is a correlation score, p _θ (y ^t,j |CP _c ) Copying dialog context and y for decoding time step j ^t,j Identical word C ^t,i Probability sum of (d).

Calculating the weight coefficient lambda added by the three probabilities through the multi-layer perceptron MLP ₁ 、λ ₂ And lambda (lambda) ₃ ，Abbreviated as p _θ (y ^t,j ) The final generated probability score is represented by the following formula:

p _θ (y ^t,j )＝λ ₁ p _θ (y ^t,j ∣Gen)+λ ₂ p _θ (y ^t,j ∣CP _k )+λ ₂ p _θ (y ^t,j ∣CP _c )

wherein the method comprises the steps ofAnd->Respectively represent the hidden state of the current time step +.>With selected knowledge, the attention of the user above. P is p _θ (y ^t,j ) Representing time step j outputting vocabulary word y ^t,j Is used to determine the final probability score of (a).

The loss of the reply generation task is a negative log likelihood loss, as shown in the following equation:

step 3.5, performing model training: the framework optimization objective of the variational autoregressive dialogue generation model provided by the embodiment comprises two subtasks of knowledge selection and reply generation: knowledge selection subtask penalty includes the KL distance penalty in step 2.2And reconstruction loss of step 2.3->Reply generation subtask penalty includes KL distance penalty in step 3.2 +. >And +.f of negative log likelihood loss in step 3.4>

The complete variational autoregressive dialogue generation model optimization objective is the sum of the four losses: wherein gamma is ₁ 、γ ₂ 、γ ₃ And gamma ₄ For the weight lost for each part, four weights were set to 1.0 in this experiment.

The prior information of knowledge selection is supplemented through the predicted reply sentence information, so that the accuracy of knowledge selection is improved, and the diversity of knowledge selection is modeled by using a condition variation self-encoder; the variability autoregressive decoder is used to model the hidden variables of the recovery sequence, namely, the hidden variables of the recovery are modeled at each decoding time step, and then the diversity of the recovery sentences with higher dimension is modeled. The present embodiment performs experiments on a published open domain knowledge-of-Wikipedia (abbreviated as WoW) dialogue dataset that contains tag data for knowledge selection tasks. Each data sample in the dialogue data set WoW includes a dialogue context and a corresponding knowledge sentence set, and a reply sentence; from https:// parl ai/projects/wizard_of_wikipedia/。

Knowledge of the WoW data set is derived from encyclopedia websites, one of two parties of the conversation is a learner (apphenalic), and the other party of the conversation is an expert (Wizard); the expert may obtain unstructured knowledge (paragraphs or sentences) in the dialogue and reply to the learner with reference to these external knowledge, without the learner being able to see the knowledge content selected or used by the expert. The double-side active conversation ensures the conversation, the data set comprises 1365 open domain conversation topics, the data set is divided into a training set, a verification set and two Test sets, wherein the topics in one Test set appear in the training set or the verification set (Test set 1 is Test sen), and the other Test set examines zero sample learning ability of the model, namely the topics in the Test set never appear in the training set or the verification set (Test set 2 is Test sen). The role of the model to be modeled is expert, and the invention reserves the original division mode of the WOW data set, as shown in the table 1, and is the statistical information of the data set.

TABLE 1

In the embodiment, the reply sentence information is predicted by using the information in the external knowledge sentence set, so that the prior information of knowledge selection is supplemented, the prior distribution of knowledge selection in the training stage is close to posterior distribution, the accuracy of knowledge selection is improved, and the diversity of knowledge selection is modeled by using a condition variation self-encoder; modeling a reply sequence hidden variable by using a variational autoregressive decoder, namely modeling a reply hidden variable at each decoding time step, and further modeling reply sentence diversity with higher dimension; the model is enabled to select the external knowledge that is more relevant to the conversation context, and the generated reply content is more informative and diverse.

In this embodiment, the results of the model of the present invention are evaluated by comparing with a plurality of knowledge-based dialogue-generated baseline models, memNet, postKS, SKLS, dukeNet and covv, respectively, using Accuracy (ACC), recovery quality evaluation BLEU-4, ROUGE-2 (RG-2), and recovery diversity (Dist-2) indexes, and as shown in table 2, the baseline model and the model experimental results of the present invention are shown.

TABLE 2

As shown in Table 2, the variational autoregressive dialogue generation model provided by the invention is superior to the compared baseline model (except RG-2 of Test Unsen) in multiple indexes, and the effectiveness and the stronger generalization of the model provided by the invention are integrally illustrated. Compared with a strong baseline CoLV model, the knowledge selection accuracy rate is improved to a certain extent on two test sets, and the knowledge selection accuracy rate is improved by using supplementary posterior information in a knowledge selection task, so that the prior distribution of knowledge selection hidden variables can learn the rule of posterior distribution in a training stage. But due to the diversity of knowledge choices, ACC metrics are not significantly improved because there are multiple pieces of appropriate knowledge that fit the current dialog context. In the reply generation task, compared with all baseline models, the invention improves the BLEU-4 index, which shows that the model can generate reply sentences conforming to human habit and having more information, and further proves the effectiveness of combining knowledge selection and reply generation hidden variables. In the aspect of reply diversity, the invention is obviously superior to a baseline model in Dist-2 indexes, which shows that the introduction of a reply sequence hidden variable can improve the diversity of reply sentences of a dialogue generation model, and the introduction of a copying mechanism can enable the model to learn to copy words from the dialogue and knowledge.

In this example, an ablation experiment was also designed to demonstrate the effectiveness of each module in the model framework proposed by the present invention, see table 3, showing the results of the ablation experiment. Each control group in the ablation experiment was set as follows:

(1) w/o PI: in knowledge selection hidden variable modeling, posterior prediction information is not used, and only dialogue context is used;

(2) w/o K var: removing knowledge selection hidden variables, and using knowledge corresponding to the highest value of the attention score;

(3) w/o Y var-seq, the recovery sequence hidden variable becomes a single hidden variable, and the single hidden variable is spliced and sent to a decoder;

(4) w/o Y var: removing a hidden variable modeling module of the recovery sequence, and directly decoding and outputting;

(5) w/o all: becomes the basic encoder-decoder model;

the 5 ablation experimental setups above are identical to the complete model of the present invention except for the modifications mentioned.

TABLE 3 Table 3

The conclusions that can be drawn from the ablation experiments are as follows: after finding that the invention does not use posterior prediction information, the accuracy index of knowledge selection is reduced, and meanwhile, the quality of recovery generation is influenced, and the BLEU-4 index is slightly reduced. After removing the knowledge selection hidden variables, the ACC index is greatly reduced on the Test set and the Test unnsen, and the model selects external knowledge only according to the attention score at this time, so that improper selection may be made, and the reply content lacks information. After changing the hidden variable of the modeling sequence into a single hidden variable, finding that the recovery generation index is reduced, and explaining the importance of the hidden variable of the modeling sequence on recovery diversity; after thoroughly removing hidden variables of the recovery sequence, the BLEU-4, RG-2 and Dist-2 indexes in the recovery indexes are reduced, and the ACC indexes of the knowledge selection are reduced, so that the effectiveness of knowledge selection and hidden variable modeling of the recovery sequence are considered. Finally, the overall model of the invention becomes a basic encoder-decoder model with all hidden variable modeling removed, with effects close to MemNet.

From comparison experiments and ablation experiment results and analysis, the knowledge selection effect and the recovery generation on the disclosed knowledge-based dialogue data set WOW are improved to a certain extent relative to a baseline model, and the design of each module in the model framework can effectively contribute to the overall model performance. The technical effects of the application are as follows: according to the application, the reply sentence information is predicted by using the information in the external knowledge sentence set, so that the priori information of knowledge selection is supplemented, the priori distribution of knowledge selection hidden variables in the training stage is close to posterior distribution, and the accuracy of knowledge selection in the testing stage is higher; since the variational autoregressive decoder is used to model the hidden variables of the reply sequence, i.e., the hidden variables of the reply are modeled at each decoding time step, the reply sentences generated are more diverse and informative and more approximate to human replies.

Preferably, the embodiment of the present application further provides a specific implementation manner of an electronic device capable of implementing all the steps in the variation autoregressive dialogue generating method based on the joint hidden variable in the foregoing embodiment, where the electronic device specifically includes the following contents:

a processor (processor), a memory (memory), a communication interface (Communications Interface), and a bus;

The processor, the memory and the communication interface complete communication with each other through buses; the communication interface is used for realizing information transmission among relevant equipment such as server-side equipment, metering equipment and user-side equipment.

The processor is configured to invoke the computer program in the memory, and when the processor executes the computer program, implement all the steps in the method for generating a variational autoregressive dialogue based on joint hidden variables in the above embodiment.

The embodiment of the present application also provides a computer readable storage medium capable of implementing all the steps in the variation auto-regressive dialogue generation method based on the joint hidden variable in the above embodiment, and the computer readable storage medium stores a computer program thereon, where the computer program when executed by a processor implements all the steps in the variation auto-regressive dialogue generation method based on the joint hidden variable in the above embodiment.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

Although the application provides method operational steps as an example or a flowchart, more or fewer operational steps may be included based on conventional or non-inventive labor. The order of steps recited in the embodiments is merely one way of performing the order of steps and does not represent a unique order of execution. When implemented by an actual device or client product, the instructions may be executed sequentially or in parallel (e.g., in a parallel processor or multi-threaded processing environment) as shown in the embodiments or figures.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The invention is not limited to the embodiments described above. The above description of specific embodiments is intended to describe and illustrate the technical aspects of the present invention, and is intended to be illustrative only and not limiting. Numerous specific modifications can be made by those skilled in the art without departing from the spirit of the invention and scope of the claims, which are within the scope of the invention.

Claims

1. The variational autoregressive dialogue generating device based on the joint hidden variable is characterized by comprising an encoder module, a knowledge selecting module and a variational autoregressive decoder module;

2. A variation autoregressive dialogue generation method based on joint hidden variables, based on the variation autoregressive dialogue generation device of claim 1, comprising:

3. The method for generating a variational autoregressive dialogue based on joint hidden variables according to claim 2, wherein in step S1, the current round of dialogue context C is given ^t And knowledge sentence set K ^t Obtaining corresponding word-level feature representation using BERT model coding to obtain word-level representation vectors of conversational contextAnd average pooled sentence-level representation vector +.>For knowledge sentence set K ^t Any sentence K in (1) ^t,l Word-level and sentence-level representation vectors are also obtained>And->Knowledge sentence collectionK ^t The sentence-level representation of the whole is noted +.>L is the knowledge sentence set size, and d represents the hidden state dimension.

4. The method for generating a variational autoregressive dialogue based on a joint hidden variable according to claim 2, wherein the step S2 comprises:

s2.1, calculating knowledge selection hidden variablesPosterior distribution of (c): splicing representation vectors of dialog contexts and selected knowledge, i.e.And sentence-level representation->Performing dot product attention calculation, and obtaining posterior hidden variable representation ++through softmax normalization layer>

S2.2, calculating knowledge selection hidden variablesIs a priori distributed of (a); using predicted reply sentence information to supplement a priori information of knowledge selection, dialog context and sentence level representation +. >Dot product attention calculation is performed, i.e. +.> Representing the knowledge sentence set fusion feature representation perceived above the dialog, attn represents the method of computing attention;

S2.3 selecting hidden variables from knowledgeSampling a priori hidden variables in a priori distribution of (1) to select a knowledge sentence K ^t,sel : namely, sampling from prior hidden variables, wherein the prior hidden variables obey category distribution, and the ++>And then will beThe corresponding knowledge sentence is sent to a variational autoregressive decoder module, namely, a probability density function is obeyed; reconstruction loss of knowledge selection task in knowledge selection module is +.>Wherein K is ^t,a Representing tag knowledge, i.e.)>

5. The method for generating a variational autoregressive dialogue based on a joint hidden variable according to claim 2, wherein the step S3 comprises:

s3.2, modeling prior distribution of hidden variables of the recovery sequence; the method comprises training and testing stages, and the prior distribution parameters of hidden variables of the recovery sequence are obtained through variational layers: mean and variance vectors, respectivelyAnd->Sampling by a re-parameterization technology to obtain a priori hidden variable of the time step n>The reply sequence before time step n is denoted as Y ^t,＜n Meanwhile, the following formula is abbreviated as +.> Recovery sequence hidden variable representing time step n-sampling a priori distribution>

S3.3, fusing a recovery sequence hidden variable in the variable hierarchy with a decoding hidden state of the priori path; combining decoding hidden states in a priori paths with selected recovery sequence hidden variables Further fusion to obtain a representation vector S ¹ The selected hidden variables of the recovery sequence are the hidden variables of the recovery sequence of posterior and priori respectively in the training and testing stage; then, the hidden state vector sent into the decoding layer is +.>

6. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the joint hidden variable based variational autoregressive dialog generation method of any of claims 2 to 5 when the program is executed by the processor.

7. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the joint hidden variable based variational autoregressive dialogue generating method as defined in any one of claims 2 to 5.