CN114610861A - End-to-end dialogue method for integrating knowledge and emotion based on variational self-encoder - Google Patents

End-to-end dialogue method for integrating knowledge and emotion based on variational self-encoder Download PDF

Info

Publication number
CN114610861A
CN114610861A CN202210508804.7A CN202210508804A CN114610861A CN 114610861 A CN114610861 A CN 114610861A CN 202210508804 A CN202210508804 A CN 202210508804A CN 114610861 A CN114610861 A CN 114610861A
Authority
CN
China
Prior art keywords
knowledge
emotion
encoder
vector
variational self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210508804.7A
Other languages
Chinese (zh)
Other versions
CN114610861B (en
Inventor
谢冰
宋伟
朱世强
袭向明
金天磊
周元海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202210508804.7A priority Critical patent/CN114610861B/en
Publication of CN114610861A publication Critical patent/CN114610861A/en
Application granted granted Critical
Publication of CN114610861B publication Critical patent/CN114610861B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology

Abstract

The invention discloses an end-to-end dialogue method for integrating knowledge and emotion based on a variational self-encoder, which comprises the following steps: acquiring emotion labels, conversations, knowledge and replies, and preprocessing the emotion labels, the conversations, the knowledge and the replies to be used as training data; building a model consisting of a variational self-encoder module and a copy module, and training; and preprocessing the test data, inputting the test data into the trained model for prediction to obtain a reply, and continuously carrying out end-to-end conversation. And the encoding module of the variational self-encoder module encodes the emotion labels and semantic information of the input dialog. The decoding module of the variational self-encoder module is used for integrating knowledge and emotion for generating content. The copy module generates a reply output in conjunction with the content generated by the decoder, the input dialog, and the knowledge. The method of the invention adopts a variational self-encoder structure to generate rich replies; introducing emotion tags for controlling the emotion types of the replies; information is copied from input dialogue and knowledge, so that the generated reply has richness and controllability.

Description

End-to-end dialogue method for integrating knowledge and emotion based on variational self-encoder
Technical Field
The invention belongs to the field of natural language processing, particularly relates to a text generation and dialogue system, and particularly relates to an end-to-end dialogue method for integrating knowledge and emotion based on a variational self-encoder.
Background
In 1950, Artork proposed the Turing test in Computing Machinery and Intelligence as a method to detect if a robot could chat like a human. The turing test may be described as separating the tester from the testee (one person and one machine) and asking questions to the testee at will through some means (e.g., a keyboard). After a number of tests, if more than 30% of the testers cannot determine whether the testee is a human or a machine, the machine passes the test and is considered to have artificial intelligence. Turing proposes a standard to verify that a chat robot possesses intelligence. This can be considered the beginning of the study of the chat robot.
In 1966, Joseph Weizenbaum, university of massachusetts, developed a conversational robot named ELIZA, which is a psychotherapist who helped people with psychological illnesses by conversing with the user. ELIZA uses pattern matching and reply selection methods to implement dialogs, which have limited dialog capabilities and can only answer questions in a particular field. ELIZA inspired the research of chat robots later.
In 1972, a chat robot named PARRY appeared, which was designed for a schizophrenic patient. PARRY has personality and better dialog control structure than ELIZA. However, PARRY has a low language understanding ability overall, cannot learn knowledge from a dialog, and even responds slowly.
In 1988, the JABBERWACKY chat robot emerged, developed with CleverScript script, that showed conversational memory by pattern matching historical conversation records to reply.
In 1995, ALUICE was developed by Richard Wallance. It is considered to be a further advance in the chat robot development history. Similar to ELIZA, ALICE is also a template matching based approach. The AIML (Artificial Intelligence Markup language) language was created specifically for developing ALICE. ALICE contains about 41000 templates and associated patterns, which allow it to conduct multiple rounds of multi-subject conversations. The excellent performance of ALICE received the Loebner prize in 2000 and 2001. However, it is still a rule-based chat robot, not much intelligent to talk.
With the popularization of mobile internet and smart phones, chat robots have been developed further. In 2010, an online SIRI voice assistant for the apple mobile phone can be regarded as a pioneer of a personal voice assistant for the smart phone. Users interact through voice and SIRI, and the SIRI can call rich Internet resources to serve the users. SIRI may also recommend suitable goods or services to a user based on accumulated user data.
In 2011, the Watson robot developed by IBM defeated two human championship players on the quiz program Jeopardy, and exhibited powerful natural language understanding and information retrieval capabilities. Google Now started to develop in 2012, and microsoft promoted the personal service assistant Cortana in 2014. In the same year, Alexa was introduced by Amazon and is a voice assistant dedicated to smart homes and the Internet of things. The voice assistant's research has reached a wave of heat. Microsoft issued Microsoft mini-ice chat robots in 2014, dedicated to social scenes and owned by own personal devices. It is set as an 18 year old girl, with IQ and EQ. She can generate long-run emotional replies and support multiple languages, with wide impact after release. It is a complex set of systems and not an end-to-end model.
The rise of deep learning has profound effects on the field of natural language processing. A method for generating a reply based on a large-scale corpus training model gradually appears and achieves a good effect. The model trained on the large-scale corpus can process unknown conversations, can process multiple rounds of conversations, and achieves good effects on fluency and logicality of reply generation. Typical models are GPT series, DialoGPT, Meena, Blender and Baidu PLATO series, etc. Deep learning enables end-to-end chat robot architectures to exhibit increasingly greater capabilities.
The current end-to-end model can generate readable replies from historical conversation records, but has yet to be enhanced in the richness of the generated conversation. How to enable the model to generate a reply specifying emotion is a considerable problem to be solved. A reply containing the specified emotion and related knowledge will greatly improve the quality of the reply and improve the user's conversational experience.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides an end-to-end dialogue method based on the integration knowledge and emotion of a variational self-encoder.
In order to achieve the technical purpose, the technical scheme of the invention is as follows: a first aspect of an embodiment of the present invention provides an end-to-end dialog method based on knowledge and emotion of a variational self-encoder, where the method includes the following steps:
(1) acquiring emotion labels, conversations, knowledge and replies, and preprocessing to obtain training data;
(2) building a model consisting of a variational self-encoder module and a copy module; the variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion label by combining autoregression of the knowledge encoding matrix; the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction;
(3) inputting the training data preprocessed in the step (1) into the model constructed in the step (2) for training the model and storing the model;
(4) acquiring emotion labels and dialogs, selecting knowledge, and preprocessing the emotion labels and the dialogs including splicing to obtain prediction data;
(5) inputting the prediction data preprocessed in the step (4) into the model trained in the step (3) for model prediction to obtain a reply.
Further, the preprocessing in the step (1) and the step (4) comprises converting the emotion label into a one-hot category label, and splicing the emotion label and the dialogue; the process of splicing the emotion labels and the dialogue specifically comprises the following steps: starting with a separator [ CLS ], then concatenating emotion tags and separators [ SEP ], then concatenating history conversations, and separating with [ SEP ], and the length does not exceed 512.
Further, the loss function formula of model training is as follows:
Figure 654507DEST_PATH_IMAGE001
wherein Loss is the Loss value, em 'is the predicted emotion tag, Y' is the predicted reply,
Figure 837227DEST_PATH_IMAGE002
for the character predicted at the time t,
Figure 148122DEST_PATH_IMAGE003
is the character of the label at the time t,
Figure 77420DEST_PATH_IMAGE004
characters which are predicted before the time t; u is dialogue, K is knowledge, em is emotion tag.
Further, inputting the dialogue coding matrix into a feedforward neural network to generate a mean value and a variance of normal distribution; inputting knowledge into a coding end of a decoder of a variational self-coder to obtain a knowledge coding matrix; sampling the normal distribution to obtain a sampling vector; when the model predicts and generates a reply, adding the sampling vector into a word embedding vector corresponding to the conversation start character; and the variational decoder outputs a decoding matrix from a decoding end of the decoder of the encoder module for predicting the emotion label generating the reply.
Furthermore, the copying module carries out weighted summation on the dialogue coding matrix to obtain a dialogue reading vector, and carries out weighted summation on the knowledge coding matrix to obtain a knowledge reading vector; and splicing the dialogue reading vector, the knowledge reading vector and the state vector with an output vector generated by the current decoder, and obtaining a new state vector after passing through a feedforward neural network.
Further, the copy module has a generation mode and a copy mode; in a generating mode, generating scores of all characters by the updated state vector through a linear layer; in a copy mode, after a vector corresponding to each input character in the knowledge coding matrix passes through the mapping and the activation function of the linear layer, carrying out inner product on the vector and the updated state vector to obtain a score for generating the input character; after the vector corresponding to each input character in the dialogue coding matrix passes through the mapping and activation function of the linear layer, performing inner product on the updated state vector and the sum of the vectors sampled from the normal distribution to obtain the probability of generating the input character; and combining the generation mode and the copy mode, adding the scores of the generated characters in each mode, and normalizing to obtain the probability of the model for generating the characters.
Further, the step (5) is specifically: selecting characters by adopting greedy search or cluster search based on the probability of generating the characters by the model, and generating a reply; until [ CLS ], [ SEP ], start symbol or end symbol are generated, the generation of reply is completed; after the decoding matrix output by the decoding end of the variational self-encoder module decoder is subjected to average pooling, the variational self-encoder module decoder is input into a feedforward neural network to obtain a predicted emotion label; after the reply generated by the model is sent to the user, the user replies new content; splicing the reply generated by the model and the new reply of the user into the conversation; selecting a new emotion label to splice the new emotion label to the front of the conversation; and selecting knowledge, inputting a model and continuously carrying out end-to-end conversation.
A second aspect of embodiments of the present invention provides a neural network for an end-to-end dialog incorporating knowledge and emotion, comprising:
a variational self-encoder module comprising an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to generate a conversation encoding matrix and normally distributed parameters; the decoder is used for generating a knowledge coding matrix by combining knowledge, and generating a decoding vector and a predicted emotion tag by combining autoregression of the knowledge coding matrix; an encoder in the variational self-encoder module consists of a plurality of encoding layers, is realized by adopting a Transformer model structure and is an encoder end of a Transformer; each coding layer comprises a multi-head attention layer, a residual connection layer, a normalization layer, a linear layer, a residual connection layer and a normalization layer which are connected in sequence; a decoder in the variational self-encoder module consists of a plurality of decoding layers, is realized by adopting a Transformer model structure and is an encoding end and a decoding end of a Transformer model; each decoding layer comprises a multi-head mask attention layer, a residual connecting layer, a normalization layer, a cross attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer which are connected in sequence;
the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.
A third aspect of the embodiments of the present invention provides an end-to-end dialog apparatus based on knowledge and emotion blending of variational self-encoders, including a memory and a processor, where the memory is coupled to the processor; wherein the memory is used for storing program data, and the processor is used for executing the program data to realize the end-to-end conversation method based on the integration knowledge and emotion of the variational self-encoder.
A fourth aspect of embodiments of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the above-described variation self-encoder-based knowledge and emotion infused end-to-end dialogue method.
The invention has the beneficial effects that:
1. a variational self-encoder structure is adopted, emotion types and conversation records are encoded into specific normal distribution, and then the normal distribution is input into a decoder in a mode of collecting sample vectors so as to generate rich replies.
2. The Transformer structure is used for the decoder structure of the variational self-encoder. The Encoder end of the Transformer is used for coding knowledge, and the collected sample vectors are input into the Decoder end of the Transformer to be used for generating a reply. And carrying out deep fusion on the information of the emotion types, the dialogs and the knowledge for generating the reply.
3. The variational self-encoder and the Transformer structure are combined, so that the model can combine knowledge to generate emotionally controllable and diversified replies.
4. The method adopts a copying mode capable of copying information from conversation and knowledge, so that the model has the capability of generating low-frequency words such as names and the like appearing in the conversation and also has the capability of generating low-frequency words such as professional terms and the like appearing in the knowledge, and the generated reply has richness and controllability.
5. The manner in which information can be copied from conversations and knowledge employs different strategies in copying the information in conversations and knowledge. The sample vectors sampled from the generated normal distribution are merged when copying information from the conversation, whereas the sample vectors are not merged when copying information from the knowledge. The sample vector contains information of the emotion label. The sample vector is merged when copying information from the conversation to allow the model to focus more on features about emotion in the conversation. The sample vector is not blended when information is copied from knowledge, so that the information is more objectively copied from knowledge.
Drawings
FIG. 1 is a flow chart of the method of the present invention;
FIG. 2 is a view of the overall model structure according to the present invention;
FIG. 3 is a diagram showing an encoder structure of a Transformer;
FIG. 4 is a decoder structure diagram of a Transformer;
FIG. 5 is a view of a copy module structure;
FIG. 6 is an exemplary graph of model operation;
fig. 7 is a schematic diagram of an apparatus according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present invention. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the invention, as detailed in the appended claims.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
The end-to-end dialogue method based on the knowledge and emotion of the variational self-encoder of the present invention will be described in detail with reference to the accompanying drawings. The features of the following examples and embodiments may be combined with each other without conflict.
As shown in fig. 1, the end-to-end dialogue method based on knowledge and emotion fusion of variational self-encoder proposed by the present invention includes the following steps:
(1) and acquiring emotion labels, dialogue, knowledge and reply, and preprocessing to obtain training data.
Specifically, the training data includes emotion tags, conversations, knowledge, and replies. An emotion tag is a tag that represents an emotion and is used to control the generation of a reply with that emotion. The conversation is a chat conversation record. Knowledge is knowledge information related to chat content. The reply is the output that the model should predict from the current emotion tags, dialogue and knowledge.
For example the following training data:
emotion label: "questions".
Conversation: "I ensure that people do not drown or hurt in or near water".
Knowledge: "in some regions, the rescuer is part of the emergency services system for an accident, and in some communities, the rescuer may be the primary EMS provider. ".
And (3) recovering: "in some places, it is not just helpful or not to help the rescuer handle other emergencies, such as mountain rescue
Figure 688529DEST_PATH_IMAGE005
”。
Emotion label: "happy".
Conversation: "I ensure that people do not drown or hurt in or near water".
Knowledge: "in some regions, the rescuer is part of the emergency services system for an accident, and in some communities, the rescuer may be the primary EMS provider. ".
And (3) recovering: it is very meaningful to be able to do this kind of rescue and support. ".
When the training data is processed, the emotion labels are converted into One-Hot (One-Hot) category labels. And calculating the classification loss by using the one-hot class label when judging the emotion type of the model generation result. For example, assume that a total of five emotions, namely, neutral, question, happy, sad and angry, correspond to the one-hot vectors [1,0,0,0,0], [0,1,0,0,0], [0,0,1,0,0], [0,0,0,1,0], [0,0,0,0,1, 0 ].
Starting with a separator [ CLS ], splicing the conversation with [ SEP ] symbols, and splicing the [ CLS ] symbols and emotion labels before the conversation record, the input form is obtained as follows: [ CLS ] Emotion tag [ SEP ] dialog 1[ SEP ] dialog 2[ SEP ] … … [ SEP ]. The spliced dialog input variation obtains a dialog coding matrix from an encoder end of an encoder.
When the length of the spliced conversation exceeds 512, the previous conversation is discarded. For example, the dialog:
the visitor: teacher, you are good. The recent mood of the people is depressed, and the deep heart is more dysphoria. The ability level of the former friends of the user is almost equal, and the ability level of the former friends of the user is not as good as that of the former friends of the user. At present, when big families get together to have a dinner and chat each time, the people feel unconsciously, and the gap between the people is larger and larger. For this reason, the mind has a strong sense of frustration, making it difficult for I to accept such results. I do not see that someone else is superior to i but, in contrast, i appear to be particularly frustrating.
The consultant: that means you are not satisfied with the current situation of oneself
Figure 776571DEST_PATH_IMAGE005
The visitor: is. The university is a brand-new beginning, i want to achieve good results as long as i strives, but the university is ended in two years, i do not strive as soon as i do not have the anticipatory effort, and are just abridged, i speak in class, but learning efficiency is not high, and people are worried about. For the self without state, I are very helpless and therefore want to change.
The consultant: you say that you are not working diligently now, that you have been diligent in the past
Figure 574763DEST_PATH_IMAGE005
The visitor: yes, this must have. I remember that when I am in junior middle school, dad transfers I to a school with better teaching conditions in the city, and a new teacher who is a new college in a new environment always has the feeling that the lattice is not imported. Classmates and teachers are questioned about my learning ability, so to prove to them that i is not erratic, and that i is diligent learning each day, reciting words and lessons with a time of early exercise, making full use of each minute of the break-in time per second, also for the expectation of innocent negative parents. After all, the time and the labor are not relieved, i obtain excellent results in the first monthly exam, and teachers and students can look at the time and the labor is saved.
The emotion label is a question, and the input of the splicing is as follows:
[CLS]question [ SEP]That means you are not satisfied with the current situation of oneself
Figure 304821DEST_PATH_IMAGE005
[SEP]Is. The university is a brand-new beginning, i want to achieve good results as long as i strives, but the university is ended in two years, i do not strive as soon as i do not have the anticipatory effort, and are just abridged, i speak in class, but learning efficiency is not high, and people are worried about. For the self without state, I are very helpless and therefore want to change. [ SEP ]]You say that you are not working diligently now, that you have been diligent in the past
Figure 504859DEST_PATH_IMAGE005
[SEP]Yes, this must have. When I remember that I is in the middle school, dad transfers I to a school with better teaching conditions in a city, and a new teacher of a new student in a new environment always feels that the situation is not input. Classmates and teachers are questioned about my learning ability, so to prove to them that i is not erratic, and that i is diligent learning each day, reciting words and lessons with a time of early exercise, making full use of each minute of the break-in time per second, also for the expectation of innocent negative parents. After all, the time and the labor are not relieved, i obtain excellent results in the first monthly exam, and teachers and students can look at the time and the labor is saved. [ SEP ]]
Due to the length limitation, the first sentence "teacher, you good … …" spoken by the visitor is discarded. When the length of a single round of dialog has exceeded 512, the excess length portion is truncated to keep the input length of the splice not exceeding 512.
After the dialog is spliced, the dialog is converted into an integer index vector. Knowledge and replies are also converted to integer index vectors in the same way.
(2) Building a model consisting of a variational self-encoder module and a copy module; the variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion tag by combining knowledge encoding matrix autoregressive; the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variation self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.
Specifically, the model structure is as shown in fig. 2, and the model is mainly divided into two parts: the variabilities are derived from the encoder module and the copy module.
The encoders of the variational self-encoder are shown in the left column of fig. 2. The method is realized by adopting a Transformer model structure and utilizing an encoder end of a Transformer. The encoder side of the Transformer employed in this example is composed of 6 layers of encoders (coding layers). The structure of each coding layer encoder is shown in fig. 3, and the input is processed by a multi-head attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer in sequence. The variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the mean value and the variance of normal distribution of a conversational coding matrix output by an encoder in the variational self-encoder module can be predicted through a multi-layer feedforward network; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion tag by combining knowledge encoding matrix autoregressive; inputting knowledge into a coding end of a decoder of a variational self-coder to obtain a knowledge coding matrix; sampling the normal distribution to obtain a sampling vector; when the model predicts and generates a reply, adding the sampling vector into a word embedding vector corresponding to the conversation start character; and the variational decoder outputs a decoding matrix from a decoding end of the decoder of the encoder module for predicting the emotion label generating the reply.
The decoder of the variational self-encoder is implemented by adopting a Transformer structure as shown in the middle column and the right column of fig. 2. The middle column of fig. 2 represents the encoding side of a transform, and the right column represents the decoding side of a transform. This example uses a 6-layer structure of the Transformer. The encoder structure of the transform encoding end is shown in fig. 3, and the decoder (decoding layer) of the decoding end is shown in fig. 4. The decoder input is output through a multi-head mask attention layer, a residual connecting layer, a normalization layer, a cross attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer. After the reply generation is completed, the output of the transform decoding end can predict the emotion type of the output through a multi-layer feedforward network. The copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.
The structure of the copy module is shown in fig. 5. And carrying out weighted summation on the dialogue coding matrix to obtain a dialogue reading vector, and carrying out weighted summation on the knowledge coding matrix to obtain a knowledge reading vector. And then splicing the dialogue reading vector, the knowledge reading vector, the state vector and the output vector generated by the decoder at the current moment, and obtaining a new state vector after passing through a feedforward neural network. The new state vector will be used to generate the output. The copy module has a generation mode and a copy mode.
In the generation mode, the new state vector generates a score for each character through the linear layer.
In the copy mode, after the vector corresponding to each input character in the knowledge coding matrix passes through the mapping and activation function of the linear layer, the vector and the new state vector are subjected to inner product to obtain the score for generating the input character. And after the vector corresponding to each input character in the dialogue coding matrix passes through the mapping and activation function of the linear layer, performing inner product on the vector and the sum of the new state vector and the vector sampled from the normal distribution to obtain the score for generating the input character.
And adding the scores of all the characters generated in the generation mode and the copy mode, and dividing the scores by the normalization factor to obtain the generation probability of each character.
(3) Inputting the training data preprocessed in the step (1) into the model constructed in the step (2) for model training and storing.
And preparing training data, and after the model is built, starting to train the model. In the embodiment of the invention, a model is trained by adopting a Teacher Forcing method. The loss function of the model training is the sum of the loss of the generated reply and the target reply and the loss of the emotion label and the target emotion label of the generated reply, and the formula is as follows:
Figure 29381DEST_PATH_IMAGE001
wherein Loss is the Loss value, em 'is the predicted emotion tag, Y' is the predicted reply,
Figure 314869DEST_PATH_IMAGE002
for the character predicted at the time t,
Figure 583039DEST_PATH_IMAGE003
is the character of the label at the time t,
Figure 637583DEST_PATH_IMAGE004
characters which are predicted before the time t; u is dialogue, K is knowledge, em is emotion tag.
And (5) storing the model after training the model.
(4) And acquiring emotion labels and dialogues, selecting knowledge, and preprocessing the emotion labels and the dialogues including splicing to obtain prediction data.
Specifically, after the model is trained, a reply can be generated using the model. And splicing the emotion types and conversation records which wish to generate the reply into a form of [ CLS ] emotion label [ SEP ] conversation 1[ SEP ] conversation 2[ SEP ] … … [ SEP ]. When the length of the spliced conversation exceeds 512, the previous conversation is discarded. When the length of a single round of dialog has exceeded 512, the excess length portion is truncated to keep the input length of the splice not exceeding 512. And converting the spliced character string into an integer index vector. And selecting related knowledge, controlling the length of the related knowledge within 512, and converting the related knowledge into an integer index vector.
(5) Inputting the prediction data preprocessed in the step (4) into the model trained in the step (3) for model prediction to obtain a reply.
The method comprises the following specific steps: selecting characters by adopting greedy search or cluster search based on the probability of generating the characters by the model, and generating a reply; until [ CLS ], [ SEP ], a start symbol or an end symbol are generated, and the generation of a reply is completed; after the decoding matrix output by the decoding end of the variational self-encoder module decoder is subjected to average pooling, the variational self-encoder module decoder is input into a feedforward neural network to obtain a predicted emotion label; after the reply generated by the model is sent to the user, the user replies new content; splicing the reply generated by the model and the new reply of the user into the conversation; selecting a new emotion label to splice the new emotion label to the front of the conversation; and selecting knowledge, inputting a model and continuously carrying out end-to-end conversation.
And further, inputting the spliced dialogue integer index vector and the integer index vector of the related knowledge into a model, and generating a reply by the model.
Specifically, an integer index vector of a dialog is input into an encoder of a model variation self-encoder, the encoder adds an embedded vector corresponding to an integer index and position coding information to obtain a matrix, and the matrix is expressed by a formula:
Figure 598585DEST_PATH_IMAGE006
wherein
Figure 371369DEST_PATH_IMAGE007
To obtain the result matrix, U is the concatenated dialogue integer index vector,
Figure 912072DEST_PATH_IMAGE008
in order to obtain a word-embedding matrix operation,
Figure 86702DEST_PATH_IMAGE009
to obtain a position-coding matrix operation.
Acquiring a position coding matrix operation, and representing position information by constructing a trigonometric function, specifically:
Figure 953026DEST_PATH_IMAGE010
wherein
Figure 213106DEST_PATH_IMAGE011
The 2 i-th component of the code vector at position k,
Figure 823079DEST_PATH_IMAGE012
is the 2i +1 th component of the encoded vector at position k, k being the position and i being the dimension component.
And then inputting the obtained matrix into a multilayer encoder to obtain the dialog coding output. The calculation process can be expressed as:
Figure 855145DEST_PATH_IMAGE013
wherein the content of the first and second substances,
Figure 157950DEST_PATH_IMAGE014
the Encoder is obtained by encoding operation of an Encoder end of a Transformer,
Figure 374168DEST_PATH_IMAGE015
a dialog coding matrix.
The encoder operation input is sequentially subjected to multi-head attention, residual connection, layer normalization, linear layer, residual connection and layer normalization. The formula is expressed as follows:
Figure 53411DEST_PATH_IMAGE016
wherein
Figure 671474DEST_PATH_IMAGE017
Is an input of the i-th layer encoder,
Figure 145181DEST_PATH_IMAGE018
in the form of an intermediate result matrix,
Figure 114274DEST_PATH_IMAGE019
the output of the i-th layer encoder, LayerNormalization for layer normalization. MultiHeadAttention is a multi-head attention calculation, and a specific calculation formula is as follows:
Figure 66049DEST_PATH_IMAGE020
wherein Q, K, V is the input matrix, Concat is the vector splicing operation, h is the number of heads,
Figure 69777DEST_PATH_IMAGE021
Figure 714385DEST_PATH_IMAGE022
Figure 170775DEST_PATH_IMAGE023
is the weight of the ith header.
The above process of obtaining the dialogue coding matrix by passing the spliced dialogue integer index vector through a model is a transform coding process, and this process is summarized as the following form by a formula:
Figure 926241DEST_PATH_IMAGE024
obtaining a dialog coding matrix
Figure 518896DEST_PATH_IMAGE015
Then, will
Figure 68826DEST_PATH_IMAGE015
Inputting the feedforward network layer to obtain a normal distribution mean value and variance matrix, wherein the formula is as follows:
Figure 12512DEST_PATH_IMAGE025
wherein m is the mean, d is the standard deviation, and MLP is the multilayer feedforward neural network.
And inputting the integer index vector of the related knowledge into a decoder of a variational self-encoder to obtain a knowledge encoding matrix. The operation of deriving a knowledge coding matrix from the integer index vectors of knowledge is formulated as follows, in the same way as the above-described operation of deriving a dialogue coding matrix from concatenated dialogue integer index vectors:
Figure 571669DEST_PATH_IMAGE026
wherein
Figure 284410DEST_PATH_IMAGE027
And K is a knowledge index vector.
And after the knowledge coding matrix is obtained, starting the transform decoding operation. And sampling from normal distribution, multiplying by standard deviation, and adding the mean value to obtain a sample vector. The formula is expressed as follows:
Figure 730873DEST_PATH_IMAGE028
where s is the sample taken from a standard normal distribution, N (0, I) is the standard normal distribution, m is the mean, d is the standard deviation,
Figure 161854DEST_PATH_IMAGE029
is element multiplication and b is the resulting sample vector.
And adding the obtained sample vector into an embedded vector corresponding to the starting character string for generating a decoding vector by autoregressive. Specifically, the transform decoder adds the embedded vector corresponding to the integer index and the position coding information to obtain a matrix, and if the integer index corresponding to the starting string is input, adds the sample vector obtained by sampling. Is formulated as:
Figure 259123DEST_PATH_IMAGE030
where t denotes the count of characters generated and t =0 denotes the start character.
Figure 91950DEST_PATH_IMAGE031
To generate the matrix obtained at the t-th character,
Figure 983683DEST_PATH_IMAGE008
in order to obtain a word-embedding matrix operation,
Figure 901960DEST_PATH_IMAGE009
in order to obtain the position-coding matrix operation,
Figure 802920DEST_PATH_IMAGE032
b is a sampling vector for the character string that has been generated before time t.
And then inputting the obtained matrix into a multilayer decoder to obtain a decoding matrix. The calculation process can be expressed as:
Figure 224674DEST_PATH_IMAGE033
the decoder input is output through multi-head mask attention, residual connection, layer normalization, cross attention, residual connection, layer normalization, linear layer, residual connection and layer normalization in sequence. The formula is expressed as follows:
Figure 552887DEST_PATH_IMAGE034
Figure 427302DEST_PATH_IMAGE035
wherein the content of the first and second substances,
Figure 131953DEST_PATH_IMAGE036
to input the i-th layer decoder when generating the time t character,
Figure 673793DEST_PATH_IMAGE037
and
Figure 172907DEST_PATH_IMAGE038
in order to generate the intermediate result matrix,
Figure 800198DEST_PATH_IMAGE039
in order to encode the matrix for the knowledge,
Figure 42960DEST_PATH_IMAGE040
the output of the i-th layer decoder when generating the character at the time t.
Figure 173727DEST_PATH_IMAGE041
For multi-head attention calculation, with the encoder of the Transformer
Figure 843743DEST_PATH_IMAGE041
The calculation is the same.
Figure 961259DEST_PATH_IMAGE042
For multi-head attention calculation with masks, the specific calculation process is as follows:
Figure 742134DEST_PATH_IMAGE043
wherein Q, K, V is the input matrix, Concat is the vector splicing operation, h is the number of heads,
Figure 258566DEST_PATH_IMAGE021
Figure 99483DEST_PATH_IMAGE022
Figure 701365DEST_PATH_IMAGE023
m is the mask matrix, which is the weight of the ith header.
The variations are input from the dialog coding matrix, knowledge coding matrix, sample vector and decoding matrix generated by the encoder to a copy module for generating an output. The copy module structure is shown in fig. 5. And the copying module updates the state vector by using the dialogue coding matrix, the knowledge coding matrix and the sampling vector to generate a dialogue selecting and reading vector and a knowledge selecting and reading vector, and then generates a reply character by using the state vector, the sampling vector, the dialogue selecting and reading vector and the knowledge selecting and reading vector.
And the copying module normalizes the generation probability corresponding to the character which is the same as the character generated at the previous moment in the input splicing conversation to obtain a weight, and then performs weighted summation with the vector corresponding to each character in the conversation coding matrix to generate a conversation reading vector. The formula is expressed as:
Figure 285930DEST_PATH_IMAGE044
wherein the content of the first and second substances,
Figure 391290DEST_PATH_IMAGE045
in order to normalize the factors, the method comprises the steps of,
Figure 403108DEST_PATH_IMAGE046
for copying from dialogue
Figure 226708DEST_PATH_IMAGE047
The probability of (a) of (b) being,
Figure 614964DEST_PATH_IMAGE048
the weight corresponding to the ith dialog character at time t,
Figure 840409DEST_PATH_IMAGE049
for the vector corresponding to the ith character in the dialog coding matrix,
Figure 757549DEST_PATH_IMAGE050
the matrix is read for the session at time t.
And generating knowledge reading vectors in the same way, normalizing the generation probability corresponding to the character which is the same as the character generated at the previous moment in the splicing knowledge to obtain a weight, and performing weighted summation with the vector corresponding to each character in the knowledge coding matrix to generate the knowledge reading vectors. The formula is expressed as:
Figure 334024DEST_PATH_IMAGE051
wherein
Figure 260391DEST_PATH_IMAGE052
In order to normalize the factors, the method comprises the steps of,
Figure 340343DEST_PATH_IMAGE053
for generating from knowledge copy time
Figure 959543DEST_PATH_IMAGE054
The probability of (a) of (b) being,
Figure 757735DEST_PATH_IMAGE055
the weight corresponding to the ith knowledge character at time t,
Figure 222214DEST_PATH_IMAGE056
for the vector corresponding to the ith character in the knowledge coding matrix,
Figure 425181DEST_PATH_IMAGE057
the vector is selected for knowledge at time t.
After the dialogue selective reading vector and the knowledge selective reading vector are obtained, the state vector at the last moment, the decoding vector generated at the last moment in the decoding matrix, the dialogue selective reading vector and the knowledge selective reading vector are spliced, and then the new state vector is obtained by inputting the spliced state vector and the knowledge selective reading vector into a feedforward neural network. The formula is expressed as:
Figure 949703DEST_PATH_IMAGE058
a new state vector is obtained for generating the reply character. In the generation mode, the new state vector is mapped to the character space through the linear layer, and the score of each character is obtained, namely:
Figure 500770DEST_PATH_IMAGE059
wherein
Figure 768941DEST_PATH_IMAGE060
In order to generate a scoring function for the pattern,
Figure 823484DEST_PATH_IMAGE061
for the character generated at the present moment in time,
Figure 518908DEST_PATH_IMAGE062
for the ith character in the lexicon,
Figure 26113DEST_PATH_IMAGE063
is the one-hot vector of the ith element being 1 and the rest elements being 0,
Figure 97974DEST_PATH_IMAGE064
is a linear layer, and the linear layer is,
Figure 7024DEST_PATH_IMAGE065
is a new state vector.
In copy mode, copies are made from dialogue and knowledge, respectively, using the new state vector.
When copying from the dialogue, the vector corresponding to each input character in the dialogue coding matrix is mapped and activated by the linear layer and then is subjected to inner product with the new state vector and the sum of the vectors obtained from the normal distribution to obtain the score for generating the input character.
Specifically, the method comprises the following steps:
Figure 138928DEST_PATH_IMAGE066
wherein
Figure 133429DEST_PATH_IMAGE067
As a function of the score when copied from the dialog,
Figure 743402DEST_PATH_IMAGE061
for the character generated at the present moment in time,
Figure 772538DEST_PATH_IMAGE068
to splice the ith character in the dialog,
Figure 809764DEST_PATH_IMAGE069
for splicing the encoded vectors corresponding to the ith character in the dialog,
Figure 557140DEST_PATH_IMAGE070
is a linear layer, and the linear layer is,
Figure 970804DEST_PATH_IMAGE071
in order to activate the function(s),
Figure 857376DEST_PATH_IMAGE072
is the new state vector and b is the sample vector.
When copying from knowledge, the vector corresponding to each input character in the knowledge coding matrix is subjected to mapping and activation functions of the linear layer and then is subjected to inner product with a new state vector to obtain a score for generating the input character. Specifically, the method comprises the following steps:
Figure 331082DEST_PATH_IMAGE073
wherein the content of the first and second substances,
Figure 300175DEST_PATH_IMAGE074
as a function of the score when copied from knowledge,
Figure 986372DEST_PATH_IMAGE061
for the character generated at the present moment in time,
Figure 990100DEST_PATH_IMAGE075
for the ith character in the knowledge,
Figure 369129DEST_PATH_IMAGE056
the coding vector corresponding to the ith character in the knowledge,
Figure 825518DEST_PATH_IMAGE076
is a linear layer, and the linear layer is a linear layer,
Figure 580984DEST_PATH_IMAGE071
in order to activate the function(s),
Figure 439219DEST_PATH_IMAGE072
is a new state vector.
And (3) combining a word bank of the model, inputting a character set of the conversation and a character set of the knowledge to obtain a normalization factor:
Figure 989149DEST_PATH_IMAGE077
wherein Z is a normalization factor, v is a character,
Figure 932834DEST_PATH_IMAGE078
a lexicon of models, UNK is an unknown character,
Figure 226412DEST_PATH_IMAGE079
in order to generate a scoring function for the pattern,
Figure 204732DEST_PATH_IMAGE067
as a function of the score when copied from the dialog,
Figure 925564DEST_PATH_IMAGE080
as a function of the score when copied from knowledge.
And dividing the scores in the modes by the normalization factor to obtain the probability of generating the character. Specifically, the probability of generating a character in the generation mode is:
Figure 356545DEST_PATH_IMAGE081
the probability of copying a character from a dialog is:
Figure 188235DEST_PATH_IMAGE082
the probability of copying a character from knowledge is:
Figure 755482DEST_PATH_IMAGE083
combining the generation mode and the copy mode, the probability of generating characters by the model is the sum of the probabilities of generating characters in each mode, and is specifically represented as:
Figure 912794DEST_PATH_IMAGE085
wherein the content of the first and second substances,
Figure 845720DEST_PATH_IMAGE086
for generating patterns
Figure 481101DEST_PATH_IMAGE061
The probability of (a) of (b) being,
Figure 902855DEST_PATH_IMAGE087
to copy from input dialog
Figure 496647DEST_PATH_IMAGE061
The probability of (a) of (b) being,
Figure 371062DEST_PATH_IMAGE088
to copy from knowledge
Figure 75713DEST_PATH_IMAGE061
The probability of (a) of (b) being,
Figure 617553DEST_PATH_IMAGE061
for the character generated at the time t,
Figure 116667DEST_PATH_IMAGE089
is the state vector at the time t,
Figure 743958DEST_PATH_IMAGE090
for the character generated at time t-1,
Figure 721141DEST_PATH_IMAGE091
to variate the dialog coding matrix output from the encoder module of the encoder,
Figure 117487DEST_PATH_IMAGE092
the method is characterized in that a knowledge coding matrix output by an Encoder end of a decoder Transformer of a variational self-Encoder is used, b is a sampling sample, g represents a generation mode, and c represents a copy mode.
After the model outputs the probability of each character, the method of greedy search, cluster search and the like can be adopted to select the character, each method has respective advantages, and the embodiment adopts simple greedy search to select the generated character, namely, the character with the maximum probability is selected as the generated character.
The model autoregressive generation reverts to the generation of [ CLS ], [ SEP ], start symbol, or end symbol. If [ CLS ], [ SEP ], start symbol or end symbol has not been generated, the generation is stopped when the generated reply length is equal to the set threshold. The set threshold should be less than 512.
After the generation reply of autoregressive is completed, the decoding matrix generated by the decoder of the variational self-encoder is averaged and pooled and then input into the feedforward neural network to predict the replied emotion label, and the formula is expressed as follows:
Figure 787503DEST_PATH_IMAGE093
where em' is the predicted emotion tag,
Figure 902090DEST_PATH_IMAGE094
is the decoder output of layer 6 Transformer, i.e. the decoding matrix. AvgPool is the average pooling operation.
After the output generated by the model is returned to the user, the user returns new content. And splicing the output of the model and the new reply of the user into the previous dialogue record, and setting the emotion label as a new label to obtain a new spliced dialogue. And selecting new knowledge, and inputting the new splicing dialogue and the knowledge into the model to obtain model reply. The continuous conversation with the user can be carried out in such a loop.
An example of the model operation is shown in FIG. 6. The encoder of the diversity auto-encoder in the example employs the Bert model. Input the emotional tag "question", input the pairIn some regions, the rescuer is part of the emergency services system for accidents, and in some communities, the rescuer may be the primary EMS provider. ". Model output reply "in some places, it is not also helpful for the rescuer to deal with other emergencies, such as mountain rescue
Figure 417385DEST_PATH_IMAGE005
". The emotional label predicted by the model is "question". The model generates a reply that is related to the input knowledge and carries the sentiment of the question, consistent with the requirements of the input sentiment tag. The emotion label predicted by the model is also consistent with the input emotion label.
The emotion label is changed into 'happy', and the input splicing conversation is changed into [ CLS]Happy [ SEP]I ensure that people do not drown or hurt in or near water [ SEP]. The relevant knowledge of the input remains unchanged. The model outputs the returned 'Tai stick', and the occupation which can deal with the rescue and the relief is really glorious. ", the emotion label predicted by the model is" happy ". An example of a multi-turn dialog is: inputting emotion label 'question', inputting dialogue 'hello', and splicing dialogue as [ CLS]Question [ SEP]Hello [ SEP]. The psychological consultant who inputs the relevant knowledge refers to a professional who helps a help seeker to relieve psychological problems by applying psychology and the relevant knowledge, following the psychology principle and through the technology and the method of psychological consultation. Seven principles followed by psychological consultants are the privacy principle, the comprehension principle, the time-limiting principle, the assistant self-help principle, the "the coming party doesn't reject, the going party doesn't pursue" principle, the objective neutral and unconditional positive attention principle, the important decision delay principle. ". Model generation reply "sit, you have experience with no psychological consultation before
Figure 199396DEST_PATH_IMAGE005
", the predicted emotion label is" question ". The user answers "none. I want to ask a question that our conversation will not be heard by people
Figure 774734DEST_PATH_IMAGE005
", it is desired that the model generate a neutral reply. Setting the emotion label to be neutral, splicing the user reply to the conversation to obtain spliced input' [ CLS ]]Neutral [ SEP ]]Hello [ SEP]Sit-please, you have experience with no psychological consultation before
Figure 376616DEST_PATH_IMAGE005
[SEP]None. I want to ask a question that our conversation will not be heard by people
Figure 695602DEST_PATH_IMAGE005
[SEP]"input knowledge remains unchanged, i.e. input knowledge" psychologist refers to a professional who helps help seeker to relieve psychological problems by applying psychology and related knowledge, following psychology principles, and through psychology counseling techniques and methods. Seven principles followed by psychological consultants are the privacy principle, the comprehension principle, the time-limiting principle, the assistant self-help principle, the "the coming party doesn't reject, the going party doesn't pursue" principle, the objective neutral and unconditional positive attention principle, the important decision delay principle. ". The model generates a reply' please believe me, the principle of psychological consultant is secret, and the consulting room has good sound insulation effect and cannot be heard by people. ", the model predicts the emotion label as" neutral ".
Corresponding to the embodiment of the end-to-end dialogue method integrating knowledge and emotion based on the variational self-encoder, the invention also provides an embodiment of the invention name device.
Referring to fig. 7, an end-to-end dialog apparatus based on knowledge and emotion fused by a variational self-encoder according to an embodiment of the present invention includes one or more processors, which is configured to implement the end-to-end dialog method based on knowledge and emotion fused by a variational self-encoder in the foregoing embodiment.
The embodiments of the knowledge and emotion infused end-to-end dialog apparatus based on variational self-encoders of the present invention can be applied to any data processing capable device, such as a computer or other like device or apparatus. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 7, a hardware structure diagram of any device with data processing capability where an end-to-end dialog apparatus based on knowledge and emotion integration of a variational self-encoder according to the present invention is located is shown, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 7, any device with data processing capability where an apparatus in the embodiment is located may generally include other hardware according to the actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Embodiments of the present invention also provide a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements an end-to-end dialogue method based on knowledge and emotion fusion of a variational self-encoder in the above embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium can be any device with data processing capability, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A variational self-encoder based knowledge and emotion infused end-to-end dialog method, said method comprising the steps of:
(1) acquiring emotion labels, conversations, knowledge and replies, and preprocessing to obtain training data;
(2) building a model consisting of a variational self-encoder module and a copy module; the variational self-encoder module comprises an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to obtain a conversation encoding matrix; the decoder comprises an encoding end and a decoding end and is used for generating a knowledge encoding matrix by combining knowledge and generating a decoding vector and a predicted emotion tag by combining knowledge encoding matrix autoregressive; the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variation self-encoder module with the current decoding vector; generating an output reply by using the updated state vector and combining the dialog coding matrix and the knowledge coding matrix for prediction;
(3) inputting the training data preprocessed in the step (1) into the model constructed in the step (2) for training the model and storing the model;
(4) acquiring emotion labels and dialogues, selecting knowledge, and preprocessing the emotion labels and the dialogues including splicing to obtain prediction data;
(5) inputting the prediction data preprocessed in the step (4) into the model trained in the step (3) for model prediction to obtain a reply.
2. The end-to-end conversation method integrating knowledge and emotion based on variational self-encoder as claimed in claim 1, wherein the preprocessing in step (1) and step (4) comprises converting emotion label to one-hot category label, splicing emotion label and conversation; the process of splicing the emotion labels and the dialogue specifically comprises the following steps: starting with a separator [ CLS ], then concatenating emotion tags and separators [ SEP ], then concatenating history conversations, and separating with [ SEP ], and the length does not exceed 512.
3. The variational self-encoder based knowledge-and-emotion infused end-to-end dialog method of claim 1, wherein the model training loss function formula is as follows:
Figure 239315DEST_PATH_IMAGE001
wherein Loss is the Loss value, em 'is the predicted emotion tag, Y' is the predicted reply,
Figure 687614DEST_PATH_IMAGE002
for the character predicted at the time t,
Figure 873876DEST_PATH_IMAGE003
is the character of the label at the time t,
Figure 800244DEST_PATH_IMAGE004
characters which are predicted before the time t; u is dialogue, K is knowledge, em is sentiment tag.
4. The end-to-end dialogue method based on knowledge and emotion of integration of variational self-encoders according to claim 1, wherein the dialogue encoding matrix is input to a feedforward neural network to generate a mean and a variance of a normal distribution; inputting knowledge into a coding end of a decoder of a variational self-coder to obtain a knowledge coding matrix; sampling the normal distribution to obtain a sampling vector; when the model predicts and generates a reply, adding the sampling vector into a word embedding vector corresponding to the conversation start character; and the variational decoder outputs a decoding matrix from a decoding end of the decoder of the encoder module for predicting the emotion label generating the reply.
5. The end-to-end dialogue method for knowledge and emotion fusion based on the variational self-encoder as claimed in claim 1, wherein the copy module weights and sums the dialogue coding matrix to obtain dialogue reading vector, weights and sums the knowledge coding matrix to obtain knowledge reading vector; and splicing the dialogue reading vector, the knowledge reading vector and the state vector with the output vector generated by the current decoder, and obtaining a new state vector after passing through a feedforward neural network.
6. The variational self-encoder based knowledge-and-emotion-infused end-to-end dialog method of claim 5, wherein the copy module has a generation mode and a copy mode; in a generating mode, generating scores of all characters by the updated state vector through a linear layer; in a copy mode, after a vector corresponding to each input character in the knowledge coding matrix passes through the mapping and the activation function of the linear layer, carrying out inner product on the vector and the updated state vector to obtain a score for generating the input character; after the vector corresponding to each input character in the dialogue coding matrix passes through the mapping and activation function of the linear layer, performing inner product on the updated state vector and the sum of the vectors sampled from the normal distribution to obtain the probability of generating the input character; and combining the generation mode and the copy mode, adding the scores of the generated characters in each mode, and normalizing to obtain the probability of the model for generating the characters.
7. The end-to-end dialogue method based on knowledge and emotion of integration of variational self-encoder as claimed in claim 5, wherein said step (5) is specifically: selecting characters by adopting greedy search or cluster search based on the probability of generating the characters by the model, and generating a reply; until [ CLS ], [ SEP ], start symbol or end symbol are generated, the generation of reply is completed; after the decoding matrix output by the decoding end of the variational self-encoder module decoder is subjected to average pooling, the variational self-encoder module decoder is input into a feedforward neural network to obtain a predicted emotion label; after the reply generated by the model is sent to the user, the user replies new content; splicing the reply generated by the model and the new reply of the user into the conversation; selecting a new emotion label to splice the new emotion label to the front of the conversation; and selecting knowledge, inputting a model and continuously carrying out end-to-end conversation.
8. A neural network for end-to-end dialogue of knowledge and emotion, comprising:
a variational self-encoder module comprising an encoder and a decoder; the encoder is used for encoding the emotion labels and semantic information of the conversation to generate a conversation encoding matrix and normally distributed parameters; the decoder is used for generating a knowledge coding matrix by combining knowledge, and generating a decoding vector and a predicted emotion tag by combining autoregression of the knowledge coding matrix; an encoder in the variational self-encoder module consists of a plurality of encoding layers, is realized by adopting a Transformer model structure and is an encoder end of a Transformer; each coding layer comprises a multi-head attention layer, a residual connection layer, a normalization layer, a linear layer, a residual connection layer and a normalization layer which are connected in sequence; a decoder in the variational self-encoder module consists of a plurality of decoding layers, is realized by adopting a Transformer model structure and is an encoding end and a decoding end of a Transformer model; each decoding layer comprises a multi-head mask attention layer, a residual connecting layer, a normalization layer, a cross attention layer, a residual connecting layer, a normalization layer, a linear layer, a residual connecting layer and a normalization layer which are connected in sequence;
the copying module updates the state vector by combining the dialog coding matrix and the knowledge coding matrix generated by the variational self-encoder module with the current decoding vector; and generating an output reply by using the updated state vector and combining with the dialogue coding matrix and the knowledge coding matrix for prediction.
9. A variational self-encoder based knowledge and emotion infused end-to-end dialog device, comprising a memory and a processor, wherein the memory is coupled to the processor; wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the variational self-encoder based knowledge-and-emotion-infused end-to-end dialog method of any of the preceding claims 1-7.
10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out a method of end-to-end dialogue based on knowledge and emotion infused by a variational self-encoder as claimed in any one of claims 1 to 7.
CN202210508804.7A 2022-05-11 2022-05-11 End-to-end dialogue method integrating knowledge and emotion based on variational self-encoder Active CN114610861B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210508804.7A CN114610861B (en) 2022-05-11 2022-05-11 End-to-end dialogue method integrating knowledge and emotion based on variational self-encoder

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210508804.7A CN114610861B (en) 2022-05-11 2022-05-11 End-to-end dialogue method integrating knowledge and emotion based on variational self-encoder

Publications (2)

Publication Number Publication Date
CN114610861A true CN114610861A (en) 2022-06-10
CN114610861B CN114610861B (en) 2022-08-26

Family

ID=81870487

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210508804.7A Active CN114610861B (en) 2022-05-11 2022-05-11 End-to-end dialogue method integrating knowledge and emotion based on variational self-encoder

Country Status (1)

Country Link
CN (1) CN114610861B (en)

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107798140A (en) * 2017-11-23 2018-03-13 北京神州泰岳软件股份有限公司 A kind of conversational system construction method, semantic controlled answer method and device
CN108595436A (en) * 2018-04-28 2018-09-28 合肥工业大学 The generation method and system of emotion conversation content, storage medium
CN108805089A (en) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 Based on multi-modal Emotion identification method
US20180374474A1 (en) * 2017-06-22 2018-12-27 Baidu Online Network Technology (Beijing) Co., Ltd. Method and Apparatus for Broadcasting a Response Based on Artificial Intelligence, and Storage Medium
CN110032636A (en) * 2019-04-30 2019-07-19 合肥工业大学 Emotion based on intensified learning talks with the method that asynchronous generation model generates text
CN111241250A (en) * 2020-01-22 2020-06-05 中国人民大学 Emotional dialogue generation system and method
CN112084314A (en) * 2020-08-20 2020-12-15 电子科技大学 Knowledge-introducing generating type session system
CN112289239A (en) * 2020-12-28 2021-01-29 之江实验室 Dynamically adjustable explaining method and device and electronic equipment
US20210089588A1 (en) * 2019-09-24 2021-03-25 Salesforce.Com, Inc. System and Method for Automatic Task-Oriented Dialog System
CN113239157A (en) * 2021-03-31 2021-08-10 北京百度网讯科技有限公司 Method, device, equipment and storage medium for training conversation model
CN113379606A (en) * 2021-08-16 2021-09-10 之江实验室 Face super-resolution method based on pre-training generation model
CN114168721A (en) * 2021-11-18 2022-03-11 华东师范大学 Method for constructing knowledge enhancement model for multi-sub-target dialogue recommendation system
CN114168707A (en) * 2021-10-28 2022-03-11 上海大学 Recommendation-oriented emotion type conversation method

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180374474A1 (en) * 2017-06-22 2018-12-27 Baidu Online Network Technology (Beijing) Co., Ltd. Method and Apparatus for Broadcasting a Response Based on Artificial Intelligence, and Storage Medium
CN107798140A (en) * 2017-11-23 2018-03-13 北京神州泰岳软件股份有限公司 A kind of conversational system construction method, semantic controlled answer method and device
CN108595436A (en) * 2018-04-28 2018-09-28 合肥工业大学 The generation method and system of emotion conversation content, storage medium
CN108805089A (en) * 2018-06-14 2018-11-13 南京云思创智信息科技有限公司 Based on multi-modal Emotion identification method
CN110032636A (en) * 2019-04-30 2019-07-19 合肥工业大学 Emotion based on intensified learning talks with the method that asynchronous generation model generates text
US20210089588A1 (en) * 2019-09-24 2021-03-25 Salesforce.Com, Inc. System and Method for Automatic Task-Oriented Dialog System
CN111241250A (en) * 2020-01-22 2020-06-05 中国人民大学 Emotional dialogue generation system and method
CN112084314A (en) * 2020-08-20 2020-12-15 电子科技大学 Knowledge-introducing generating type session system
CN112289239A (en) * 2020-12-28 2021-01-29 之江实验室 Dynamically adjustable explaining method and device and electronic equipment
CN113239157A (en) * 2021-03-31 2021-08-10 北京百度网讯科技有限公司 Method, device, equipment and storage medium for training conversation model
CN113379606A (en) * 2021-08-16 2021-09-10 之江实验室 Face super-resolution method based on pre-training generation model
CN114168707A (en) * 2021-10-28 2022-03-11 上海大学 Recommendation-oriented emotion type conversation method
CN114168721A (en) * 2021-11-18 2022-03-11 华东师范大学 Method for constructing knowledge enhancement model for multi-sub-target dialogue recommendation system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
RAMAN GOEL,等: "Emotion-Aware Transformer Encoder for Empathetic Dialogue Generation", 《2021 9TH INTERNATIONAL CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION WORKSHOPS AND DEMOS (ACIIW)》 *
王庆林,等: "基于全局语义学习的文本情感增强方法研究", 《科学技术与工程》 *

Also Published As

Publication number Publication date
CN114610861B (en) 2022-08-26

Similar Documents

Publication Publication Date Title
Bibauw et al. Discussing with a computer to practice a foreign language: Research synthesis and conceptual framework of dialogue-based CALL
Clarke Language and action: A structural model of behaviour
CN107944027B (en) Method and system for creating semantic key index
Kim et al. Design principles and architecture of a second language learning chatbot
Merdivan et al. Dialogue systems for intelligent human computer interactions
CN108595436B (en) Method and system for generating emotional dialogue content and storage medium
Kelly et al. Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors
CN112765333B (en) Automatic dialogue generation method and system based on emotion and prompt word combination
CN115563290B (en) Intelligent emotion recognition method based on context modeling
CN111930914A (en) Question generation method and device, electronic equipment and computer-readable storage medium
CN113918813A (en) Method and device for recommending posts based on external knowledge in chat record form
CN113761156A (en) Data processing method, device and medium for man-machine interaction conversation and electronic equipment
Tseng et al. Approaching Human Performance in Behavior Estimation in Couples Therapy Using Deep Sentence Embeddings.
Wang et al. Transformer-based empathetic response generation using dialogue situation and advanced-level definition of empathy
Wang et al. Information-enhanced hierarchical self-attention network for multiturn dialog generation
CN117271745A (en) Information processing method and device, computing equipment and storage medium
Xu et al. CLUF: A neural model for second language acquisition modeling
Tu Learn to speak like a native: AI-powered chatbot simulating natural conversation for language tutoring
CN114610861B (en) End-to-end dialogue method integrating knowledge and emotion based on variational self-encoder
CN116561265A (en) Personalized dialogue generation method, model training method and device
KR102395702B1 (en) Method for providing english education service using step-by-step expanding sentence structure unit
Jiang et al. An affective chatbot with controlled specific emotion expression
Zhong et al. Question generation based on chat‐response conversion
CN115934909B (en) Co-emotion reply generation method and device, terminal and storage medium
Mallios Virtual doctor: an intelligent human-computer dialogue system for quick response to people in need

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant