CN114610861A

CN114610861A - End-to-end dialogue method for integrating knowledge and emotion based on variational self-encoder

Info

Publication number: CN114610861A
Application number: CN202210508804.7A
Authority: CN
Inventors: 谢冰; 宋伟; 朱世强; 袭向明; 金天磊; 周元海
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2022-05-11
Filing date: 2022-05-11
Publication date: 2022-06-10
Anticipated expiration: 2042-05-11
Also published as: CN114610861B

Abstract

The invention discloses an end-to-end dialogue method for integrating knowledge and emotion based on a variational self-encoder, which comprises the following steps: acquiring emotion labels, conversations, knowledge and replies, and preprocessing the emotion labels, the conversations, the knowledge and the replies to be used as training data; building a model consisting of a variational self-encoder module and a copy module, and training; and preprocessing the test data, inputting the test data into the trained model for prediction to obtain a reply, and continuously carrying out end-to-end conversation. And the encoding module of the variational self-encoder module encodes the emotion labels and semantic information of the input dialog. The decoding module of the variational self-encoder module is used for integrating knowledge and emotion for generating content. The copy module generates a reply output in conjunction with the content generated by the decoder, the input dialog, and the knowledge. The method of the invention adopts a variational self-encoder structure to generate rich replies; introducing emotion tags for controlling the emotion types of the replies; information is copied from input dialogue and knowledge, so that the generated reply has richness and controllability.

Description

End-to-end dialogue method for integrating knowledge and emotion based on variational self-encoder

Technical Field

The invention belongs to the field of natural language processing, particularly relates to a text generation and dialogue system, and particularly relates to an end-to-end dialogue method for integrating knowledge and emotion based on a variational self-encoder.

Background

In 1950, Artork proposed the Turing test in Computing Machinery and Intelligence as a method to detect if a robot could chat like a human. The turing test may be described as separating the tester from the testee (one person and one machine) and asking questions to the testee at will through some means (e.g., a keyboard). After a number of tests, if more than 30% of the testers cannot determine whether the testee is a human or a machine, the machine passes the test and is considered to have artificial intelligence. Turing proposes a standard to verify that a chat robot possesses intelligence. This can be considered the beginning of the study of the chat robot.

In 1966, Joseph Weizenbaum, university of massachusetts, developed a conversational robot named ELIZA, which is a psychotherapist who helped people with psychological illnesses by conversing with the user. ELIZA uses pattern matching and reply selection methods to implement dialogs, which have limited dialog capabilities and can only answer questions in a particular field. ELIZA inspired the research of chat robots later.

In 1972, a chat robot named PARRY appeared, which was designed for a schizophrenic patient. PARRY has personality and better dialog control structure than ELIZA. However, PARRY has a low language understanding ability overall, cannot learn knowledge from a dialog, and even responds slowly.

In 1988, the JABBERWACKY chat robot emerged, developed with CleverScript script, that showed conversational memory by pattern matching historical conversation records to reply.

In 1995, ALUICE was developed by Richard Wallance. It is considered to be a further advance in the chat robot development history. Similar to ELIZA, ALICE is also a template matching based approach. The AIML (Artificial Intelligence Markup language) language was created specifically for developing ALICE. ALICE contains about 41000 templates and associated patterns, which allow it to conduct multiple rounds of multi-subject conversations. The excellent performance of ALICE received the Loebner prize in 2000 and 2001. However, it is still a rule-based chat robot, not much intelligent to talk.

With the popularization of mobile internet and smart phones, chat robots have been developed further. In 2010, an online SIRI voice assistant for the apple mobile phone can be regarded as a pioneer of a personal voice assistant for the smart phone. Users interact through voice and SIRI, and the SIRI can call rich Internet resources to serve the users. SIRI may also recommend suitable goods or services to a user based on accumulated user data.

In 2011, the Watson robot developed by IBM defeated two human championship players on the quiz program Jeopardy, and exhibited powerful natural language understanding and information retrieval capabilities. Google Now started to develop in 2012, and microsoft promoted the personal service assistant Cortana in 2014. In the same year, Alexa was introduced by Amazon and is a voice assistant dedicated to smart homes and the Internet of things. The voice assistant's research has reached a wave of heat. Microsoft issued Microsoft mini-ice chat robots in 2014, dedicated to social scenes and owned by own personal devices. It is set as an 18 year old girl, with IQ and EQ. She can generate long-run emotional replies and support multiple languages, with wide impact after release. It is a complex set of systems and not an end-to-end model.

The rise of deep learning has profound effects on the field of natural language processing. A method for generating a reply based on a large-scale corpus training model gradually appears and achieves a good effect. The model trained on the large-scale corpus can process unknown conversations, can process multiple rounds of conversations, and achieves good effects on fluency and logicality of reply generation. Typical models are GPT series, DialoGPT, Meena, Blender and Baidu PLATO series, etc. Deep learning enables end-to-end chat robot architectures to exhibit increasingly greater capabilities.

The current end-to-end model can generate readable replies from historical conversation records, but has yet to be enhanced in the richness of the generated conversation. How to enable the model to generate a reply specifying emotion is a considerable problem to be solved. A reply containing the specified emotion and related knowledge will greatly improve the quality of the reply and improve the user's conversational experience.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides an end-to-end dialogue method based on the integration knowledge and emotion of a variational self-encoder.

In order to achieve the technical purpose, the technical scheme of the invention is as follows: a first aspect of an embodiment of the present invention provides an end-to-end dialog method based on knowledge and emotion of a variational self-encoder, where the method includes the following steps:

(1) acquiring emotion labels, conversations, knowledge and replies, and preprocessing to obtain training data;

(2) building a model consisting of a variational self-encoder module and a copy module; the variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion label by combining autoregression of the knowledge encoding matrix; the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction;

(3) inputting the training data preprocessed in the step (1) into the model constructed in the step (2) for training the model and storing the model;

(4) acquiring emotion labels and dialogs, selecting knowledge, and preprocessing the emotion labels and the dialogs including splicing to obtain prediction data;

(5) inputting the prediction data preprocessed in the step (4) into the model trained in the step (3) for model prediction to obtain a reply.

Further, the preprocessing in the step (1) and the step (4) comprises converting the emotion label into a one-hot category label, and splicing the emotion label and the dialogue; the process of splicing the emotion labels and the dialogue specifically comprises the following steps: starting with a separator [ CLS ], then concatenating emotion tags and separators [ SEP ], then concatenating history conversations, and separating with [ SEP ], and the length does not exceed 512.

Further, the loss function formula of model training is as follows:

wherein Loss is the Loss value, em 'is the predicted emotion tag, Y' is the predicted reply,

for the character predicted at the time t,

is the character of the label at the time t,

characters which are predicted before the time t; u is dialogue, K is knowledge, em is emotion tag.

Further, inputting the dialogue coding matrix into a feedforward neural network to generate a mean value and a variance of normal distribution; inputting knowledge into a coding end of a decoder of a variational self-coder to obtain a knowledge coding matrix; sampling the normal distribution to obtain a sampling vector; when the model predicts and generates a reply, adding the sampling vector into a word embedding vector corresponding to the conversation start character; and the variational decoder outputs a decoding matrix from a decoding end of the decoder of the encoder module for predicting the emotion label generating the reply.

Furthermore, the copying module carries out weighted summation on the dialogue coding matrix to obtain a dialogue reading vector, and carries out weighted summation on the knowledge coding matrix to obtain a knowledge reading vector; and splicing the dialogue reading vector, the knowledge reading vector and the state vector with an output vector generated by the current decoder, and obtaining a new state vector after passing through a feedforward neural network.

Further, the copy module has a generation mode and a copy mode; in a generating mode, generating scores of all characters by the updated state vector through a linear layer; in a copy mode, after a vector corresponding to each input character in the knowledge coding matrix passes through the mapping and the activation function of the linear layer, carrying out inner product on the vector and the updated state vector to obtain a score for generating the input character; after the vector corresponding to each input character in the dialogue coding matrix passes through the mapping and activation function of the linear layer, performing inner product on the updated state vector and the sum of the vectors sampled from the normal distribution to obtain the probability of generating the input character; and combining the generation mode and the copy mode, adding the scores of the generated characters in each mode, and normalizing to obtain the probability of the model for generating the characters.

Further, the step (5) is specifically: selecting characters by adopting greedy search or cluster search based on the probability of generating the characters by the model, and generating a reply; until [ CLS ], [ SEP ], start symbol or end symbol are generated, the generation of reply is completed; after the decoding matrix output by the decoding end of the variational self-encoder module decoder is subjected to average pooling, the variational self-encoder module decoder is input into a feedforward neural network to obtain a predicted emotion label; after the reply generated by the model is sent to the user, the user replies new content; splicing the reply generated by the model and the new reply of the user into the conversation; selecting a new emotion label to splice the new emotion label to the front of the conversation; and selecting knowledge, inputting a model and continuously carrying out end-to-end conversation.

A second aspect of embodiments of the present invention provides a neural network for an end-to-end dialog incorporating knowledge and emotion, comprising:

a variational self-encoder module comprising an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to generate a conversation encoding matrix and normally distributed parameters; the decoder is used for generating a knowledge coding matrix by combining knowledge, and generating a decoding vector and a predicted emotion tag by combining autoregression of the knowledge coding matrix; an encoder in the variational self-encoder module consists of a plurality of encoding layers, is realized by adopting a Transformer model structure and is an encoder end of a Transformer; each coding layer comprises a multi-head attention layer, a residual connection layer, a normalization layer, a linear layer, a residual connection layer and a normalization layer which are connected in sequence; a decoder in the variational self-encoder module consists of a plurality of decoding layers, is realized by adopting a Transformer model structure and is an encoding end and a decoding end of a Transformer model; each decoding layer comprises a multi-head mask attention layer, a residual connecting layer, a normalization layer, a cross attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer which are connected in sequence;

the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.

A third aspect of the embodiments of the present invention provides an end-to-end dialog apparatus based on knowledge and emotion blending of variational self-encoders, including a memory and a processor, where the memory is coupled to the processor; wherein the memory is used for storing program data, and the processor is used for executing the program data to realize the end-to-end conversation method based on the integration knowledge and emotion of the variational self-encoder.

A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described variation self-encoder-based knowledge and emotion infused end-to-end dialogue method.

The invention has the beneficial effects that:

1. a variational self-encoder structure is adopted, emotion types and conversation records are encoded into specific normal distribution, and then the normal distribution is input into a decoder in a mode of collecting sample vectors so as to generate rich replies.

2. The Transformer structure is used for the decoder structure of the variational self-encoder. The Encoder end of the Transformer is used for coding knowledge, and the collected sample vectors are input into the Decoder end of the Transformer to be used for generating a reply. And carrying out deep fusion on the information of the emotion types, the dialogs and the knowledge for generating the reply.

3. The variational self-encoder and the Transformer structure are combined, so that the model can combine knowledge to generate emotionally controllable and diversified replies.

4. The method adopts a copying mode capable of copying information from conversation and knowledge, so that the model has the capability of generating low-frequency words such as names and the like appearing in the conversation and also has the capability of generating low-frequency words such as professional terms and the like appearing in the knowledge, and the generated reply has richness and controllability.

5. The manner in which information can be copied from conversations and knowledge employs different strategies in copying the information in conversations and knowledge. The sample vectors sampled from the generated normal distribution are merged when copying information from the conversation, whereas the sample vectors are not merged when copying information from the knowledge. The sample vector contains information of the emotion label. The sample vector is merged when copying information from the conversation to allow the model to focus more on features about emotion in the conversation. The sample vector is not blended when information is copied from knowledge, so that the information is more objectively copied from knowledge.

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a view of the overall model structure according to the present invention;

FIG. 3 is a diagram showing an encoder structure of a Transformer;

FIG. 4 is a decoder structure diagram of a Transformer;

FIG. 5 is a view of a copy module structure;

FIG. 6 is an exemplary graph of model operation;

fig. 7 is a schematic diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

The end-to-end dialogue method based on the knowledge and emotion of the variational self-encoder of the present invention will be described in detail with reference to the accompanying drawings. The features of the following examples and embodiments may be combined with each other without conflict.

As shown in fig. 1, the end-to-end dialogue method based on knowledge and emotion fusion of variational self-encoder proposed by the present invention includes the following steps:

(1) and acquiring emotion labels, dialogue, knowledge and reply, and preprocessing to obtain training data.

Specifically, the training data includes emotion tags, conversations, knowledge, and replies. An emotion tag is a tag that represents an emotion and is used to control the generation of a reply with that emotion. The conversation is a chat conversation record. Knowledge is knowledge information related to chat content. The reply is the output that the model should predict from the current emotion tags, dialogue and knowledge.

For example the following training data:

emotion label: "questions".

Conversation: "I ensure that people do not drown or hurt in or near water".

Knowledge: "in some regions, the rescuer is part of the emergency services system for an accident, and in some communities, the rescuer may be the primary EMS provider. ".

And (3) recovering: "in some places, it is not just helpful or not to help the rescuer handle other emergencies, such as mountain rescue

”。

Emotion label: "happy".

Conversation: "I ensure that people do not drown or hurt in or near water".

And (3) recovering: it is very meaningful to be able to do this kind of rescue and support. ".

When the training data is processed, the emotion labels are converted into One-Hot (One-Hot) category labels. And calculating the classification loss by using the one-hot class label when judging the emotion type of the model generation result. For example, assume that a total of five emotions, namely, neutral, question, happy, sad and angry, correspond to the one-hot vectors [1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0], [0,0,0,0,1, 0 ].

Starting with a separator [ CLS ], splicing the conversation with [ SEP ] symbols, and splicing the [ CLS ] symbols and emotion labels before the conversation record, the input form is obtained as follows: [ CLS ] Emotion tag [ SEP ] dialog 1[ SEP ] dialog 2[ SEP ] … … [ SEP ]. The spliced dialog input variation obtains a dialog coding matrix from an encoder end of an encoder.

When the length of the spliced conversation exceeds 512, the previous conversation is discarded. For example, the dialog:

the visitor: teacher, you are good. The recent mood of the people is depressed, and the deep heart is more dysphoria. The ability level of the former friends of the user is almost equal, and the ability level of the former friends of the user is not as good as that of the former friends of the user. At present, when big families get together to have a dinner and chat each time, the people feel unconsciously, and the gap between the people is larger and larger. For this reason, the mind has a strong sense of frustration, making it difficult for I to accept such results. I do not see that someone else is superior to i but, in contrast, i appear to be particularly frustrating.

The consultant: that means you are not satisfied with the current situation of oneself

The visitor: is. The university is a brand-new beginning, i want to achieve good results as long as i strives, but the university is ended in two years, i do not strive as soon as i do not have the anticipatory effort, and are just abridged, i speak in class, but learning efficiency is not high, and people are worried about. For the self without state, I are very helpless and therefore want to change.

The consultant: you say that you are not working diligently now, that you have been diligent in the past

The visitor: yes, this must have. I remember that when I am in junior middle school, dad transfers I to a school with better teaching conditions in the city, and a new teacher who is a new college in a new environment always has the feeling that the lattice is not imported. Classmates and teachers are questioned about my learning ability, so to prove to them that i is not erratic, and that i is diligent learning each day, reciting words and lessons with a time of early exercise, making full use of each minute of the break-in time per second, also for the expectation of innocent negative parents. After all, the time and the labor are not relieved, i obtain excellent results in the first monthly exam, and teachers and students can look at the time and the labor is saved.

The emotion label is a question, and the input of the splicing is as follows:

[CLS]question [ SEP]That means you are not satisfied with the current situation of oneself

[SEP]Is. The university is a brand-new beginning, i want to achieve good results as long as i strives, but the university is ended in two years, i do not strive as soon as i do not have the anticipatory effort, and are just abridged, i speak in class, but learning efficiency is not high, and people are worried about. For the self without state, I are very helpless and therefore want to change. [ SEP ]]You say that you are not working diligently now, that you have been diligent in the past

[SEP]Yes, this must have. When I remember that I is in the middle school, dad transfers I to a school with better teaching conditions in a city, and a new teacher of a new student in a new environment always feels that the situation is not input. Classmates and teachers are questioned about my learning ability, so to prove to them that i is not erratic, and that i is diligent learning each day, reciting words and lessons with a time of early exercise, making full use of each minute of the break-in time per second, also for the expectation of innocent negative parents. After all, the time and the labor are not relieved, i obtain excellent results in the first monthly exam, and teachers and students can look at the time and the labor is saved. [ SEP ]]

Due to the length limitation, the first sentence "teacher, you good … …" spoken by the visitor is discarded. When the length of a single round of dialog has exceeded 512, the excess length portion is truncated to keep the input length of the splice not exceeding 512.

After the dialog is spliced, the dialog is converted into an integer index vector. Knowledge and replies are also converted to integer index vectors in the same way.

(2) Building a model consisting of a variational self-encoder module and a copy module; the variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion tag by combining knowledge encoding matrix autoregressive; the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variation self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.

Specifically, the model structure is as shown in fig. 2, and the model is mainly divided into two parts: the variabilities are derived from the encoder module and the copy module.

The encoders of the variational self-encoder are shown in the left column of fig. 2. The method is realized by adopting a Transformer model structure and utilizing an encoder end of a Transformer. The encoder side of the Transformer employed in this example is composed of 6 layers of encoders (coding layers). The structure of each coding layer encoder is shown in fig. 3, and the input is processed by a multi-head attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer in sequence. The variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the mean value and the variance of normal distribution of a conversational coding matrix output by an encoder in the variational self-encoder module can be predicted through a multi-layer feedforward network; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion tag by combining knowledge encoding matrix autoregressive; inputting knowledge into a coding end of a decoder of a variational self-coder to obtain a knowledge coding matrix; sampling the normal distribution to obtain a sampling vector; when the model predicts and generates a reply, adding the sampling vector into a word embedding vector corresponding to the conversation start character; and the variational decoder outputs a decoding matrix from a decoding end of the decoder of the encoder module for predicting the emotion label generating the reply.

The decoder of the variational self-encoder is implemented by adopting a Transformer structure as shown in the middle column and the right column of fig. 2. The middle column of fig. 2 represents the encoding side of a transform, and the right column represents the decoding side of a transform. This example uses a 6-layer structure of the Transformer. The encoder structure of the transform encoding end is shown in fig. 3, and the decoder (decoding layer) of the decoding end is shown in fig. 4. The decoder input is output through a multi-head mask attention layer, a residual connecting layer, a normalization layer, a cross attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer. After the reply generation is completed, the output of the transform decoding end can predict the emotion type of the output through a multi-layer feedforward network. The copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.

The structure of the copy module is shown in fig. 5. And carrying out weighted summation on the dialogue coding matrix to obtain a dialogue reading vector, and carrying out weighted summation on the knowledge coding matrix to obtain a knowledge reading vector. And then splicing the dialogue reading vector, the knowledge reading vector, the state vector and the output vector generated by the decoder at the current moment, and obtaining a new state vector after passing through a feedforward neural network. The new state vector will be used to generate the output. The copy module has a generation mode and a copy mode.

In the generation mode, the new state vector generates a score for each character through the linear layer.

In the copy mode, after the vector corresponding to each input character in the knowledge coding matrix passes through the mapping and activation function of the linear layer, the vector and the new state vector are subjected to inner product to obtain the score for generating the input character. And after the vector corresponding to each input character in the dialogue coding matrix passes through the mapping and activation function of the linear layer, performing inner product on the vector and the sum of the new state vector and the vector sampled from the normal distribution to obtain the score for generating the input character.

And adding the scores of all the characters generated in the generation mode and the copy mode, and dividing the scores by the normalization factor to obtain the generation probability of each character.

(3) Inputting the training data preprocessed in the step (1) into the model constructed in the step (2) for model training and storing.

And preparing training data, and after the model is built, starting to train the model. In the embodiment of the invention, a model is trained by adopting a Teacher Forcing method. The loss function of the model training is the sum of the loss of the generated reply and the target reply and the loss of the emotion label and the target emotion label of the generated reply, and the formula is as follows:

for the character predicted at the time t,

is the character of the label at the time t,

And (5) storing the model after training the model.

(4) And acquiring emotion labels and dialogues, selecting knowledge, and preprocessing the emotion labels and the dialogues including splicing to obtain prediction data.

Specifically, after the model is trained, a reply can be generated using the model. And splicing the emotion types and conversation records which wish to generate the reply into a form of [ CLS ] emotion label [ SEP ] conversation 1[ SEP ] conversation 2[ SEP ] … … [ SEP ]. When the length of the spliced conversation exceeds 512, the previous conversation is discarded. When the length of a single round of dialog has exceeded 512, the excess length portion is truncated to keep the input length of the splice not exceeding 512. And converting the spliced character string into an integer index vector. And selecting related knowledge, controlling the length of the related knowledge within 512, and converting the related knowledge into an integer index vector.

The method comprises the following specific steps: selecting characters by adopting greedy search or cluster search based on the probability of generating the characters by the model, and generating a reply; until [ CLS ], [ SEP ], a start symbol or an end symbol are generated, and the generation of a reply is completed; after the decoding matrix output by the decoding end of the variational self-encoder module decoder is subjected to average pooling, the variational self-encoder module decoder is input into a feedforward neural network to obtain a predicted emotion label; after the reply generated by the model is sent to the user, the user replies new content; splicing the reply generated by the model and the new reply of the user into the conversation; selecting a new emotion label to splice the new emotion label to the front of the conversation; and selecting knowledge, inputting a model and continuously carrying out end-to-end conversation.

And further, inputting the spliced dialogue integer index vector and the integer index vector of the related knowledge into a model, and generating a reply by the model.

Specifically, an integer index vector of a dialog is input into an encoder of a model variation self-encoder, the encoder adds an embedded vector corresponding to an integer index and position coding information to obtain a matrix, and the matrix is expressed by a formula:

wherein

To obtain the result matrix, U is the concatenated dialogue integer index vector,

in order to obtain a word-embedding matrix operation,

to obtain a position-coding matrix operation.

Acquiring a position coding matrix operation, and representing position information by constructing a trigonometric function, specifically:

wherein

The 2 i-th component of the code vector at position k,

is the 2i +1 th component of the encoded vector at position k, k being the position and i being the dimension component.

And then inputting the obtained matrix into a multilayer encoder to obtain the dialog coding output. The calculation process can be expressed as:

wherein the content of the first and second substances,

the Encoder is obtained by encoding operation of an Encoder end of a Transformer,

a dialog coding matrix.

The encoder operation input is sequentially subjected to multi-head attention, residual connection, layer normalization, linear layer, residual connection and layer normalization. The formula is expressed as follows:

wherein

Is an input of the i-th layer encoder,

in the form of an intermediate result matrix,

the output of the i-th layer encoder, LayerNormalization for layer normalization. MultiHeadAttention is a multi-head attention calculation, and a specific calculation formula is as follows:

wherein Q, K, V is the input matrix, Concat is the vector splicing operation, h is the number of heads,

、

、

is the weight of the ith header.

The above process of obtaining the dialogue coding matrix by passing the spliced dialogue integer index vector through a model is a transform coding process, and this process is summarized as the following form by a formula:

obtaining a dialog coding matrix

Then, will

Inputting the feedforward network layer to obtain a normal distribution mean value and variance matrix, wherein the formula is as follows:

wherein m is the mean, d is the standard deviation, and MLP is the multilayer feedforward neural network.

And inputting the integer index vector of the related knowledge into a decoder of a variational self-encoder to obtain a knowledge encoding matrix. The operation of deriving a knowledge coding matrix from the integer index vectors of knowledge is formulated as follows, in the same way as the above-described operation of deriving a dialogue coding matrix from concatenated dialogue integer index vectors:

wherein

And K is a knowledge index vector.

And after the knowledge coding matrix is obtained, starting the transform decoding operation. And sampling from normal distribution, multiplying by standard deviation, and adding the mean value to obtain a sample vector. The formula is expressed as follows:

where s is the sample taken from a standard normal distribution, N (0, I) is the standard normal distribution, m is the mean, d is the standard deviation,

is element multiplication and b is the resulting sample vector.

And adding the obtained sample vector into an embedded vector corresponding to the starting character string for generating a decoding vector by autoregressive. Specifically, the transform decoder adds the embedded vector corresponding to the integer index and the position coding information to obtain a matrix, and if the integer index corresponding to the starting string is input, adds the sample vector obtained by sampling. Is formulated as:

where t denotes the count of characters generated and t =0 denotes the start character.

To generate the matrix obtained at the t-th character,

in order to obtain a word-embedding matrix operation,

in order to obtain the position-coding matrix operation,

b is a sampling vector for the character string that has been generated before time t.

And then inputting the obtained matrix into a multilayer decoder to obtain a decoding matrix. The calculation process can be expressed as:

the decoder input is output through multi-head mask attention, residual connection, layer normalization, cross attention, residual connection, layer normalization, linear layer, residual connection and layer normalization in sequence. The formula is expressed as follows:

wherein the content of the first and second substances,

to input the i-th layer decoder when generating the time t character,

and

in order to generate the intermediate result matrix,

in order to encode the matrix for the knowledge,

the output of the i-th layer decoder when generating the character at the time t.

For multi-head attention calculation, with the encoder of the Transformer

The calculation is the same.

For multi-head attention calculation with masks, the specific calculation process is as follows:

、

、

m is the mask matrix, which is the weight of the ith header.

The variations are input from the dialog coding matrix, knowledge coding matrix, sample vector and decoding matrix generated by the encoder to a copy module for generating an output. The copy module structure is shown in fig. 5. And the copying module updates the state vector by using the dialogue coding matrix, the knowledge coding matrix and the sampling vector to generate a dialogue selecting and reading vector and a knowledge selecting and reading vector, and then generates a reply character by using the state vector, the sampling vector, the dialogue selecting and reading vector and the knowledge selecting and reading vector.

And the copying module normalizes the generation probability corresponding to the character which is the same as the character generated at the previous moment in the input splicing conversation to obtain a weight, and then performs weighted summation with the vector corresponding to each character in the conversation coding matrix to generate a conversation reading vector. The formula is expressed as:

wherein the content of the first and second substances,

in order to normalize the factors, the method comprises the steps of,

for copying from dialogue

The probability of (a) of (b) being,

the weight corresponding to the ith dialog character at time t,

for the vector corresponding to the ith character in the dialog coding matrix,

the matrix is read for the session at time t.

And generating knowledge reading vectors in the same way, normalizing the generation probability corresponding to the character which is the same as the character generated at the previous moment in the splicing knowledge to obtain a weight, and performing weighted summation with the vector corresponding to each character in the knowledge coding matrix to generate the knowledge reading vectors. The formula is expressed as:

wherein

In order to normalize the factors, the method comprises the steps of,

for generating from knowledge copy time

The probability of (a) of (b) being,

the weight corresponding to the ith knowledge character at time t,

for the vector corresponding to the ith character in the knowledge coding matrix,

the vector is selected for knowledge at time t.

After the dialogue selective reading vector and the knowledge selective reading vector are obtained, the state vector at the last moment, the decoding vector generated at the last moment in the decoding matrix, the dialogue selective reading vector and the knowledge selective reading vector are spliced, and then the new state vector is obtained by inputting the spliced state vector and the knowledge selective reading vector into a feedforward neural network. The formula is expressed as:

a new state vector is obtained for generating the reply character. In the generation mode, the new state vector is mapped to the character space through the linear layer, and the score of each character is obtained, namely:

wherein

In order to generate a scoring function for the pattern,

for the character generated at the present moment in time,

for the ith character in the lexicon,

is the one-hot vector of the ith element being 1 and the rest elements being 0,

is a linear layer, and the linear layer is,

is a new state vector.

In copy mode, copies are made from dialogue and knowledge, respectively, using the new state vector.

When copying from the dialogue, the vector corresponding to each input character in the dialogue coding matrix is mapped and activated by the linear layer and then is subjected to inner product with the new state vector and the sum of the vectors obtained from the normal distribution to obtain the score for generating the input character.

Specifically, the method comprises the following steps:

wherein

As a function of the score when copied from the dialog,

for the character generated at the present moment in time,

to splice the ith character in the dialog,

for splicing the encoded vectors corresponding to the ith character in the dialog,

is a linear layer, and the linear layer is,

in order to activate the function(s),

is the new state vector and b is the sample vector.

When copying from knowledge, the vector corresponding to each input character in the knowledge coding matrix is subjected to mapping and activation functions of the linear layer and then is subjected to inner product with a new state vector to obtain a score for generating the input character. Specifically, the method comprises the following steps:

wherein the content of the first and second substances,

as a function of the score when copied from knowledge,

for the character generated at the present moment in time,

for the ith character in the knowledge,

the coding vector corresponding to the ith character in the knowledge,

is a linear layer, and the linear layer is a linear layer,

in order to activate the function(s),

is a new state vector.

And (3) combining a word bank of the model, inputting a character set of the conversation and a character set of the knowledge to obtain a normalization factor:

wherein Z is a normalization factor, v is a character,

a lexicon of models, UNK is an unknown character,

in order to generate a scoring function for the pattern,

as a function of the score when copied from the dialog,

as a function of the score when copied from knowledge.

And dividing the scores in the modes by the normalization factor to obtain the probability of generating the character. Specifically, the probability of generating a character in the generation mode is:

the probability of copying a character from a dialog is:

the probability of copying a character from knowledge is:

combining the generation mode and the copy mode, the probability of generating characters by the model is the sum of the probabilities of generating characters in each mode, and is specifically represented as:

wherein the content of the first and second substances,

for generating patterns

The probability of (a) of (b) being,

to copy from input dialog

The probability of (a) of (b) being,

to copy from knowledge

The probability of (a) of (b) being,

for the character generated at the time t,

is the state vector at the time t,

for the character generated at time t-1,

to variate the dialog coding matrix output from the encoder module of the encoder,

the method is characterized in that a knowledge coding matrix output by an Encoder end of a decoder Transformer of a variational self-Encoder is used, b is a sampling sample, g represents a generation mode, and c represents a copy mode.

After the model outputs the probability of each character, the method of greedy search, cluster search and the like can be adopted to select the character, each method has respective advantages, and the embodiment adopts simple greedy search to select the generated character, namely, the character with the maximum probability is selected as the generated character.

The model autoregressive generation reverts to the generation of [ CLS ], [ SEP ], start symbol, or end symbol. If [ CLS ], [ SEP ], start symbol or end symbol has not been generated, the generation is stopped when the generated reply length is equal to the set threshold. The set threshold should be less than 512.

After the generation reply of autoregressive is completed, the decoding matrix generated by the decoder of the variational self-encoder is averaged and pooled and then input into the feedforward neural network to predict the replied emotion label, and the formula is expressed as follows:

where em' is the predicted emotion tag,

is the decoder output of layer 6 Transformer, i.e. the decoding matrix. AvgPool is the average pooling operation.

After the output generated by the model is returned to the user, the user returns new content. And splicing the output of the model and the new reply of the user into the previous dialogue record, and setting the emotion label as a new label to obtain a new spliced dialogue. And selecting new knowledge, and inputting the new splicing dialogue and the knowledge into the model to obtain model reply. The continuous conversation with the user can be carried out in such a loop.

An example of the model operation is shown in FIG. 6. The encoder of the diversity auto-encoder in the example employs the Bert model. Input the emotional tag "question", input the pairIn some regions, the rescuer is part of the emergency services system for accidents, and in some communities, the rescuer may be the primary EMS provider. ". Model output reply "in some places, it is not also helpful for the rescuer to deal with other emergencies, such as mountain rescue

". The emotional label predicted by the model is "question". The model generates a reply that is related to the input knowledge and carries the sentiment of the question, consistent with the requirements of the input sentiment tag. The emotion label predicted by the model is also consistent with the input emotion label.

The emotion label is changed into 'happy', and the input splicing conversation is changed into [ CLS]Happy [ SEP]I ensure that people do not drown or hurt in or near water [ SEP]. The relevant knowledge of the input remains unchanged. The model outputs the returned 'Tai stick', and the occupation which can deal with the rescue and the relief is really glorious. ", the emotion label predicted by the model is" happy ". An example of a multi-turn dialog is: inputting emotion label 'question', inputting dialogue 'hello', and splicing dialogue as [ CLS]Question [ SEP]Hello [ SEP]. The psychological consultant who inputs the relevant knowledge refers to a professional who helps a help seeker to relieve psychological problems by applying psychology and the relevant knowledge, following the psychology principle and through the technology and the method of psychological consultation. Seven principles followed by psychological consultants are the privacy principle, the comprehension principle, the time-limiting principle, the assistant self-help principle, the "the coming party doesn't reject, the going party doesn't pursue" principle, the objective neutral and unconditional positive attention principle, the important decision delay principle. ". Model generation reply "sit, you have experience with no psychological consultation before

", the predicted emotion label is" question ". The user answers "none. I want to ask a question that our conversation will not be heard by people

", it is desired that the model generate a neutral reply. Setting the emotion label to be neutral, splicing the user reply to the conversation to obtain spliced input' [ CLS ]]Neutral [ SEP ]]Hello [ SEP]Sit-please, you have experience with no psychological consultation before

[SEP]None. I want to ask a question that our conversation will not be heard by people

[SEP]"input knowledge remains unchanged, i.e. input knowledge" psychologist refers to a professional who helps help seeker to relieve psychological problems by applying psychology and related knowledge, following psychology principles, and through psychology counseling techniques and methods. Seven principles followed by psychological consultants are the privacy principle, the comprehension principle, the time-limiting principle, the assistant self-help principle, the "the coming party doesn't reject, the going party doesn't pursue" principle, the objective neutral and unconditional positive attention principle, the important decision delay principle. ". The model generates a reply' please believe me, the principle of psychological consultant is secret, and the consulting room has good sound insulation effect and cannot be heard by people. ", the model predicts the emotion label as" neutral ".

Corresponding to the embodiment of the end-to-end dialogue method integrating knowledge and emotion based on the variational self-encoder, the invention also provides an embodiment of the invention name device.

Referring to fig. 7, an end-to-end dialog apparatus based on knowledge and emotion fused by a variational self-encoder according to an embodiment of the present invention includes one or more processors, which is configured to implement the end-to-end dialog method based on knowledge and emotion fused by a variational self-encoder in the foregoing embodiment.

The embodiments of the knowledge and emotion infused end-to-end dialog apparatus based on variational self-encoders of the present invention can be applied to any data processing capable device, such as a computer or other like device or apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, a hardware structure diagram of any device with data processing capability where an end-to-end dialog apparatus based on knowledge and emotion integration of a variational self-encoder according to the present invention is located is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, any device with data processing capability where an apparatus in the embodiment is located may generally include other hardware according to the actual function of the any device with data processing capability, which is not described again.

The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.

For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.

Embodiments of the present invention also provide a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements an end-to-end dialogue method based on knowledge and emotion fusion of a variational self-encoder in the above embodiments.

The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium can be any device with data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A variational self-encoder based knowledge and emotion infused end-to-end dialog method, said method comprising the steps of:

(2) building a model consisting of a variational self-encoder module and a copy module; the variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion tag by combining knowledge encoding matrix autoregressive; the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variation self-encoder module with the current decoding vector; generating an output reply by using the updated state vector and combining the dialog coding matrix and the knowledge coding matrix for prediction;

(4) acquiring emotion labels and dialogues, selecting knowledge, and preprocessing the emotion labels and the dialogues including splicing to obtain prediction data;

2. The end-to-end conversation method integrating knowledge and emotion based on variational self-encoder as claimed in claim 1, wherein the preprocessing in step (1) and step (4) comprises converting emotion label to one-hot category label, splicing emotion label and conversation; the process of splicing the emotion labels and the dialogue specifically comprises the following steps: starting with a separator [ CLS ], then concatenating emotion tags and separators [ SEP ], then concatenating history conversations, and separating with [ SEP ], and the length does not exceed 512.

3. The variational self-encoder based knowledge-and-emotion infused end-to-end dialog method of claim 1, wherein the model training loss function formula is as follows:

for the character predicted at the time t,

is the character of the label at the time t,

characters which are predicted before the time t; u is dialogue, K is knowledge, em is sentiment tag.

4. The end-to-end dialogue method based on knowledge and emotion of integration of variational self-encoders according to claim 1, wherein the dialogue encoding matrix is input to a feedforward neural network to generate a mean and a variance of a normal distribution; inputting knowledge into a coding end of a decoder of a variational self-coder to obtain a knowledge coding matrix; sampling the normal distribution to obtain a sampling vector; when the model predicts and generates a reply, adding the sampling vector into a word embedding vector corresponding to the conversation start character; and the variational decoder outputs a decoding matrix from a decoding end of the decoder of the encoder module for predicting the emotion label generating the reply.

5. The end-to-end dialogue method for knowledge and emotion fusion based on the variational self-encoder as claimed in claim 1, wherein the copy module weights and sums the dialogue coding matrix to obtain dialogue reading vector, weights and sums the knowledge coding matrix to obtain knowledge reading vector; and splicing the dialogue reading vector, the knowledge reading vector and the state vector with the output vector generated by the current decoder, and obtaining a new state vector after passing through a feedforward neural network.

6. The variational self-encoder based knowledge-and-emotion-infused end-to-end dialog method of claim 5, wherein the copy module has a generation mode and a copy mode; in a generating mode, generating scores of all characters by the updated state vector through a linear layer; in a copy mode, after a vector corresponding to each input character in the knowledge coding matrix passes through the mapping and the activation function of the linear layer, carrying out inner product on the vector and the updated state vector to obtain a score for generating the input character; after the vector corresponding to each input character in the dialogue coding matrix passes through the mapping and activation function of the linear layer, performing inner product on the updated state vector and the sum of the vectors sampled from the normal distribution to obtain the probability of generating the input character; and combining the generation mode and the copy mode, adding the scores of the generated characters in each mode, and normalizing to obtain the probability of the model for generating the characters.

7. The end-to-end dialogue method based on knowledge and emotion of integration of variational self-encoder as claimed in claim 5, wherein said step (5) is specifically: selecting characters by adopting greedy search or cluster search based on the probability of generating the characters by the model, and generating a reply; until [ CLS ], [ SEP ], start symbol or end symbol are generated, the generation of reply is completed; after the decoding matrix output by the decoding end of the variational self-encoder module decoder is subjected to average pooling, the variational self-encoder module decoder is input into a feedforward neural network to obtain a predicted emotion label; after the reply generated by the model is sent to the user, the user replies new content; splicing the reply generated by the model and the new reply of the user into the conversation; selecting a new emotion label to splice the new emotion label to the front of the conversation; and selecting knowledge, inputting a model and continuously carrying out end-to-end conversation.

8. A neural network for end-to-end dialogue of knowledge and emotion, comprising:

9. A variational self-encoder based knowledge and emotion infused end-to-end dialog device, comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the variational self-encoder based knowledge-and-emotion-infused end-to-end dialog method of any of the preceding claims 1-7.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out a method of end-to-end dialogue based on knowledge and emotion infused by a variational self-encoder as claimed in any one of claims 1 to 7.