CN111897933B - Emotion dialogue generation method and device and emotion dialogue model training method and device - Google Patents

Emotion dialogue generation method and device and emotion dialogue model training method and device Download PDF

Info

Publication number
CN111897933B
CN111897933B CN202010733061.4A CN202010733061A CN111897933B CN 111897933 B CN111897933 B CN 111897933B CN 202010733061 A CN202010733061 A CN 202010733061A CN 111897933 B CN111897933 B CN 111897933B
Authority
CN
China
Prior art keywords
emotion
information
dialogue
sample
reply
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010733061.4A
Other languages
Chinese (zh)
Other versions
CN111897933A (en
Inventor
梁云龙
孟凡东
周杰
徐金安
陈钰枫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010733061.4A priority Critical patent/CN111897933B/en
Publication of CN111897933A publication Critical patent/CN111897933A/en
Application granted granted Critical
Publication of CN111897933B publication Critical patent/CN111897933B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition

Abstract

The disclosure provides an emotion dialogue generation method and device and an emotion dialogue model training method and device, and relates to the field of artificial intelligence. The method comprises the following steps: acquiring multi-source knowledge related to a conversation participant; encoding the multi-source knowledge to obtain multi-source knowledge characteristic information, and determining reply emotion prediction information corresponding to the current conversation participant according to the multi-source knowledge characteristic information; coding the characteristic information corresponding to the historical reply information and the multi-source knowledge characteristic information to acquire the characteristic information to be replied; and determining a target reply sentence according to the emotion characteristics and the characteristic information to be replied corresponding to the reply emotion prediction information. According to the method and the device for generating the feedback information, emotion information corresponding to the current dialogue participant can be accurately predicted according to the multisource knowledge, the emotion information is introduced into the generated feedback, the accuracy of the feedback is improved, and user experience is further improved.

Description

Emotion dialogue generation method and device and emotion dialogue model training method and device
Technical Field
The present disclosure relates to the field of artificial intelligence, and more particularly, to an emotion conversation generating method, an emotion conversation generating device, an emotion conversation model training method, an emotion conversation model training device, a computer-readable storage medium, and an electronic device.
Background
With the rapid development of artificial intelligence, human-machine conversations are receiving widespread attention from academia and industry. Human-machine conversation is a way of working a computer, i.e., between a computer operator or user and the computer, in a conversational manner, including both speech conversations and written conversations.
With the update of various intelligent voice devices and intelligent voice software, voice dialogue becomes a main research direction. In order to improve the quality of voice conversation, it is significant that a robot and a robot perform voice conversation with emotion, and current emotion conversation systems are based on limited scenes, for example, the machine expresses a given emotion, or only senses emotion from voice text of a user and replies to generate corresponding content, but because emotion cannot be accurately sensed from part of text content, the machine cannot make replies closely related to the scene, resulting in poor user experience.
It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the present disclosure and thus may include information that does not constitute prior art known to those of ordinary skill in the art.
Disclosure of Invention
The embodiment of the disclosure provides an emotion dialogue generation method, an emotion dialogue generation device, an emotion dialogue model training method, an emotion dialogue model training device, a computer-readable storage medium and electronic equipment, so that the accuracy of emotion prediction and reply prediction can be improved at least to a certain extent, and the user experience is further improved.
Other features and advantages of the present disclosure will be apparent from the following detailed description, or may be learned in part by the practice of the disclosure.
According to an aspect of an embodiment of the present disclosure, there is provided an emotion dialogue generation method including: acquiring multi-source knowledge related to a conversation participant; encoding the multi-source knowledge to obtain multi-source knowledge characteristic information, and determining reply emotion prediction information corresponding to the current conversation participant according to the multi-source knowledge characteristic information; coding the characteristic information corresponding to the historical reply information and the multi-source knowledge characteristic information to acquire the characteristic information to be replied; and determining a target reply sentence according to the emotion characteristics corresponding to the reply emotion prediction information and the characteristic information to be replied.
According to an aspect of an embodiment of the present disclosure, there is provided an emotion dialogue generation device including: the multi-source knowledge acquisition module is used for acquiring multi-source knowledge related to the dialogue participants; the feedback emotion prediction module is used for encoding the multi-source knowledge to obtain multi-source knowledge characteristic information and determining feedback emotion prediction information corresponding to the current dialogue participant according to the multi-source knowledge characteristic information; the to-be-replied characteristic determining module is used for encoding characteristic information corresponding to the historical reply information and the multi-source knowledge characteristic information so as to obtain the to-be-replied characteristic information; and the reply sentence generation module is used for determining a target reply sentence according to the emotion characteristics corresponding to the reply emotion prediction information and the characteristic information to be replied.
In some embodiments of the disclosure, based on the foregoing, the return emotion prediction module includes: the feature extraction unit is used for carrying out feature extraction on various knowledge in the multi-source knowledge so as to acquire a plurality of sub-feature information; the characteristic splicing unit is used for splicing the sub-characteristic information to obtain first spliced characteristic information; and the full-connection unit is used for carrying out full-connection processing on the first splicing characteristic information so as to acquire the multi-source knowledge characteristic information.
In some embodiments of the disclosure, the multi-source knowledge includes: historical conversations related to the conversation participants, emotion information, expression images and personality information of the conversation participants corresponding to each sentence in the historical conversations; based on the foregoing, the feature extraction unit includes: the first sub-feature acquisition unit is used for sequentially encoding the context relation in the history dialogue through a long-period memory network and a self-attention network so as to acquire first sub-feature information; the second sub-feature acquisition unit is used for carrying out feature extraction on the expression image through an image feature extraction network to acquire facial expression features, and carrying out full-connection processing on the facial expression features through a feedforward neural network to acquire second sub-feature information; a third sub-feature obtaining unit, configured to query in an emotion lookup table according to the emotion information, so as to obtain third sub-feature information; and the fourth sub-feature acquisition unit is used for inquiring in the personality lookup table according to the personality information so as to acquire fourth sub-feature information.
In some embodiments of the present disclosure, based on the foregoing solution, the first sub-feature obtaining unit includes: the hidden layer information extraction unit is used for extracting the characteristics of each statement in the history dialogue through the long-short-term memory network so as to obtain hidden layer characteristic information corresponding to each statement; the position coding unit is used for carrying out position coding on the hidden layer characteristic information according to the appearance sequence of each statement so as to obtain dialogue characteristic information; and the sub-feature information acquisition unit is used for encoding the dialogue feature information through the self-attention network so as to acquire the first sub-feature information.
In some embodiments of the present disclosure, the hidden layer information extraction unit is configured to: and taking the hidden layer vector corresponding to the last word in each statement as the hidden layer characteristic information.
In some embodiments of the disclosure, based on the foregoing, the return emotion prediction module is further configured to: performing dimension reduction processing on the multi-source knowledge feature information to obtain a feature vector with a first dimension of 1; and carrying out full connection processing and normalization processing on the characteristic vector to obtain the return emotion prediction information.
In some embodiments of the disclosure, based on the foregoing solution, the feature to be replied determining module includes: the history reply feature acquisition unit is used for carrying out vector conversion on the history reply information to acquire a history reply vector, and encoding the history reply vector based on a self-attention mechanism to acquire history reply feature information; the to-be-replied feature acquisition unit is used for encoding the historical reply feature information and the multi-source knowledge feature information based on a self-attention mechanism, and performing full-connection processing on the encoded feature information to acquire the to-be-replied feature information.
In some embodiments of the disclosure, based on the foregoing scheme, the reply sentence generating module is configured to: splicing the emotion characteristics and the characteristic information to be replied to obtain second spliced characteristic information; and carrying out normalization processing on the second splicing characteristic information to obtain current reply information, and determining the target reply statement according to the current reply information and the historical reply information.
In some embodiments of the disclosure, based on the foregoing solution, the acquiring the second stitching characteristic information may further be configured to: and splicing the emotion characteristics, the personalized characteristics corresponding to the current dialogue participant and the characteristic information to be replied to obtain the second spliced characteristic information.
According to an aspect of an embodiment of the present disclosure, there is provided a training method of emotion dialogue model, including: acquiring a multi-source knowledge sample related to a conversation participant, and inputting the multi-source knowledge sample into an emotion conversation model to be trained, wherein the multi-source knowledge sample at least comprises a history conversation sample and a reply sample, and emotion samples corresponding to sentences in the history conversation sample and the reply sample; encoding the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample to obtain sample characteristic information, and determining predicted emotion information corresponding to a current dialogue participant according to the sample characteristic information; coding the characteristic information corresponding to the reply sample and the sample characteristic information to obtain predicted reply characteristic information, and determining a predicted reply sentence according to the emotion characteristic corresponding to the predicted emotion information and the predicted reply characteristic information; and constructing a first loss function according to the predicted reply statement and the reply sample, constructing a second loss function according to the predicted emotion information and the emotion sample corresponding to the reply sample, and optimizing parameters of the emotion conversation model to be trained according to the first loss function and the second loss function so as to obtain the emotion conversation model.
According to an aspect of an embodiment of the present disclosure, there is provided a training apparatus for emotion dialogue model, including: the system comprises a sample acquisition module, a training module and a training module, wherein the sample acquisition module is used for acquiring a multi-source knowledge sample related to a conversation participant, inputting the multi-source knowledge sample into an emotion conversation model to be trained, and the multi-source knowledge sample at least comprises a historical conversation sample, a reply sample and emotion samples corresponding to sentences in the historical conversation sample and the reply sample; the emotion prediction module is used for encoding the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample to obtain sample characteristic information, and determining predicted emotion information corresponding to a current dialogue participant according to the sample characteristic information; the reply prediction module is used for encoding the characteristic information corresponding to the reply sample and the sample characteristic information to obtain prediction reply characteristic information, and determining a prediction reply statement according to the emotion characteristic corresponding to the prediction emotion information and the prediction reply characteristic information; and the parameter optimization module is used for constructing a first loss function according to the predicted reply statement and the reply sample, constructing a second loss function according to the predicted emotion information and the emotion sample corresponding to the reply sample, and optimizing parameters of the emotion conversation model to be trained according to the first loss function and the second loss function so as to acquire the emotion conversation model.
In some embodiments of the disclosure, based on the foregoing solution, the multi-source knowledge further includes an emoticon sample corresponding to each sentence in the historical dialog sample and a personality sample of the dialog participant, and an emoticon sample corresponding to the reply sample and a personality sample of the dialog participant.
In some embodiments of the disclosure, based on the foregoing, the emotion prediction module is further configured to: encoding the historical dialogue sample, emotion samples, expression image samples and personality samples of dialogue participants corresponding to all sentences in the historical dialogue sample to acquire sample characteristic information; the reply prediction module is further configured to: and determining the predicted reply sentence according to the emotion characteristics corresponding to the predicted emotion information, the personality characteristics corresponding to the current dialogue participant and the predicted reply characteristic information, wherein the current dialogue participant is the dialogue participant corresponding to the reply sample.
According to one aspect of the disclosed embodiments, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the emotion conversation generation method and the training method of the emotion conversation model provided in the above-described implementation.
According to an aspect of the embodiments of the present disclosure, there is provided an electronic device including: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the emotion dialogue generation method and the emotion dialogue model training method provided by the implementation mode.
In the technical schemes provided by some embodiments of the present disclosure, multi-source knowledge of a conversation participant is encoded through an emotion conversation model to obtain reply emotion prediction information, and feature information corresponding to historical reply information and multi-source knowledge feature information generated in an encoding process are processed to obtain feature information to be replied; and finally, determining a target reply sentence according to the emotion characteristics and the characteristic information to be replied, which correspond to the reply emotion prediction information. According to the technical scheme, on one hand, emotion in the reply sentence can be accurately predicted according to multi-source knowledge related to the conversation participants, and a reply sentence corresponding to the latest conversation sentence is generated, so that the prediction accuracy of emotion of the current conversation participant is improved, and the accuracy of the reply sentence is improved; on the other hand, the performance of the emotion dialogue system can be improved, and the user experience is further improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure. It will be apparent to those of ordinary skill in the art that the drawings in the following description are merely examples of the disclosure and that other drawings may be derived from them without undue effort. In the drawings:
FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of embodiments of the present disclosure may be applied;
FIG. 2 schematically illustrates an architecture diagram of an Emo-HERD model in the related art;
FIG. 3 schematically illustrates a flow diagram of a method of emotion dialogue generation in accordance with one embodiment of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of a dialog containing multi-source knowledge in accordance with one embodiment of the present disclosure;
FIG. 5 schematically illustrates an architectural diagram of an emotion conversation model, according to one embodiment of the present disclosure;
FIG. 6 schematically illustrates a flow diagram for obtaining return emotion prediction information, according to one embodiment of the present disclosure;
FIG. 7 schematically illustrates a flow diagram of emotion conversation model training, according to one embodiment of the present disclosure;
FIG. 8 schematically illustrates a framework diagram of an emotion conversation generation device in accordance with one embodiment of the present disclosure;
FIG. 9 schematically illustrates a frame diagram of a training apparatus for emotion conversation models in accordance with one embodiment of the present disclosure;
FIG. 10 illustrates a schematic diagram of a computer system suitable for use in implementing the emotion conversation generation device and training device for emotion conversation models of embodiments of the present disclosure.
Detailed Description
Example embodiments will now be described more fully with reference to the accompanying drawings. However, the exemplary embodiments may be embodied in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of the example embodiments to those skilled in the art.
Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the disclosure. One skilled in the relevant art will recognize, however, that the disclosed aspects may be practiced without one or more of the specific details, or with other methods, components, devices, steps, etc. In other instances, well-known methods, devices, implementations, or operations are not shown or described in detail to avoid obscuring aspects of the disclosure.
The block diagrams depicted in the figures are merely functional entities and do not necessarily correspond to physically separate entities. That is, the functional entities may be implemented in software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.
The flow diagrams depicted in the figures are exemplary only, and do not necessarily include all of the elements and operations/steps, nor must they be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the order of actual execution may be changed according to actual situations.
Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present disclosure may be applied.
As shown in fig. 1, system architecture 100 may include a terminal device 101, a network 102, and a server 103. The terminal device 101 may be a terminal device with a voice collection module, such as a mobile phone, a portable computer, a tablet computer, etc., and further, the terminal device 101 may also be a terminal device including a voice collection module and an image collection module; network 102 the medium used to provide the communication link between terminal device 101 and server 103, network 102 may include various connection types, such as wired communication links, wireless communication links, etc., and in the disclosed embodiments, network 102 between terminal device 101 and server 103 may be a wireless communication link, and in particular may be a mobile network.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminals, networks, and servers, as desired for implementation. For example, the server 103 may be a server cluster formed by a plurality of servers, and may be used to store information related to emotion dialogue generation.
In one embodiment of the present disclosure, a user speaks into the terminal device 101, a voice collection module in the terminal device 101 may collect voice information of the user, and when the terminal device 101 has an image collection module, facial expressions of the user may also be collected through the image collection module, dialogue sentences of the user may be obtained according to the collected voice information, and expression images of the user may be obtained according to the collected facial expressions. The terminal device 101 can then send the dialogue sentence or the dialogue sentence and the expression image to the server 103 through the network 102, and the server 103 can primarily determine the emotion or the emotion and the personality of the dialogue participant corresponding to each sentence dialogue according to the received dialogue sentence or dialogue sentence and expression image, and further can perform reply emotion prediction according to the emotion information corresponding to each sentence in the dialogue sentence and dialogue sentence or according to the dialogue sentence, the expression image corresponding to each sentence in the dialogue sentence, the emotion information and the personality information of the dialogue participant, and determine the feature information to be replied according to the historical reply information and the feature information fused with the historical dialogue information, and determine the target reply sentence containing the emotion according to the predicted reply emotion prediction information corresponding to the current dialogue participant and the feature information to be replied or according to the predicted reply emotion prediction information corresponding to the current dialogue participant, the feature information to be replied and the personality feature of the current dialogue participant. In a man-machine conversation scene, a conversation is carried out between a person and a robot once, and then a conversation participant comprises a user and the robot, wherein the current conversation participant is the robot, and the personality characteristics of the current conversation participant are the personality characteristics of the role played by the robot. In the process of carrying out emotion dialogue, emotion corresponding to each sentence in dialogue sentences or emotion images corresponding to each sentence in dialogue sentences and dialogue sentences, personalized information and emotion information of dialogue participants are processed through an emotion dialogue model to obtain reply emotion prediction information and feature information to be replied, further, a target reply sentence can be determined according to the emotion features corresponding to the reply emotion prediction information and the feature information to be replied, or the target reply sentence can be determined according to the emotion features corresponding to the reply emotion prediction information, the personalized features of current dialogue participants and the feature information to be replied.
It should be noted that, the emotion dialogue generation method provided in the embodiments of the present disclosure is generally executed by a server, and accordingly, the emotion dialogue generation device is generally disposed in the server. However, in other embodiments of the present disclosure, the emotion dialogue generation scheme provided in the embodiments of the present disclosure may also be executed by a terminal device, for example, software corresponding to the emotion dialogue generation method is installed in the terminal device in a plug-in manner, and after the terminal device acquires a voice signal and an expression image of a user, the terminal device may directly call the software to perform emotion dialogue.
In the related art, an Emo-HRED model is generally used to process emotion dialogue tasks, the Emo-HRED model is a model based on a depth RNN network, FIG. 2 shows a schematic architecture of Emo-HERD model, and as shown in FIG. 2, each word vector (w 1,1 、……w 1,N1 ) Sequentially extracting features, and taking the feature information output by the last RNN unit as hidden layer information h of the statement utt For hidden layer information h utt Copying to obtain characteristic information h dlg The method comprises the steps of carrying out a first treatment on the surface of the Then to the characteristic information h dlg Performing linear transformation to obtain emotion information h emo Finally, the characteristic information h is processed through a plurality of RNN units dlg And emotion information h emo Processing is performed to obtain a reply (w) 2,1 、……w 2,N2 ). Then, the sentence (w 1,1 、……w 1,N1 ) Sum (w) 2,1 、……w 2,N2 ) As an input sentence, a corresponding reply (w) can be obtained by repeating the above-described procedure 3,1 、……w 3,N3 ) That is, the n+1st sentence reply can be obtained by repeating the above-described flow with the previous N sentences as input.
It can be seen that the method only carries out emotion analysis on the text sentences to realize emotion dialogue generation, but inaccuracy or errors may exist in the process of acquiring emotion information, so that reply sentences are inaccurate and do not correspond to input sentences, and further user experience is reduced.
In view of the problems existing in the related art, the embodiments of the present disclosure provide an emotion dialogue generation method implemented based on machine learning, which is one of artificial intelligence (Artificial Intelligence, AI), which is a theory, method, technique, and application system that simulates, extends, and expands human intelligence using a digital computer or a machine controlled by a digital computer, perceives an environment, acquires knowledge, and acquires an optimal result using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.
Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
The scheme provided by the embodiment of the disclosure relates to natural language processing technology of artificial intelligence, and is specifically described by the following embodiments:
The embodiment of the disclosure firstly provides an emotion dialogue generation method, which can be applied to the field of man-machine interaction, can also be applied to voice dialogue between robots, and the like, and takes a scene of a simplest person VS-robot in man-machine interaction as an example, the implementation details of the technical scheme of the embodiment of the disclosure are elaborated:
fig. 3 schematically illustrates a flow diagram of an emotion conversation generation method, which may be performed by a server, which may be the server 103 shown in fig. 1, according to one embodiment of the present disclosure. Referring to fig. 3, the emotion dialogue generation method at least includes steps S310 to S340, and is described in detail as follows:
in step S310, multi-source knowledge about the participants of the conversation is acquired.
In one embodiment of the present disclosure, firstly, multi-source knowledge related to a conversation participant, where the conversation participant is all roles of a conversation in a scene, including a user and a robot, and the multi-source knowledge is multiple types of knowledge related to the conversation participant, where the specific types of knowledge may be determined according to actual needs, for example, considering that the conversation is an emotion conversation, then a conversation sentence generated by the conversation participant in the conversation process needs to be acquired, in order to accurately predict an emotion that the reply should contain, better determine the content of the reply, and further, in order to obtain the emotion of the conversation participant, predict the emotion that the reply should contain according to the conversation sentence and the emotion corresponding to each sentence of the conversation participant, and fuse the predicted emotion into the content of the reply, so as to implement a high-quality emotion conversation, that is, the multi-source knowledge may be formed by the conversation sentence and emotion information corresponding to each sentence. In the embodiment of the disclosure, feature extraction may be performed on dialogue sentences through a pre-trained emotion recognition model to obtain emotion information corresponding to each dialogue sentence, where the emotion recognition model may be obtained by training according to corpus and marked emotion corresponding to corpus.
Further, in order to better grasp the emotion of the conversation participant and correctly predict the emotion which the reply of the current conversation participant should contain, the emotion image of the conversation participant when speaking can be obtained for auxiliary analysis, and the current emotion of the conversation participant is accurately analyzed from the emotion; in addition, the individuality of the dialogue participant has an influence on the language expression mode of the dialogue participant, so that the individuality information of the dialogue participant can be obtained, multi-source knowledge is formed by emotion information, expression images and individuality information corresponding to each sentence of dialogue sentences, and the multi-source knowledge is further processed through an emotion dialogue model to obtain corresponding replies. In the scene of man-machine conversation, the expression image of the user can be directly obtained, the emotion information and the personality information of the conversation participants can be determined according to the conversation sentences and the expression image of the user, the corresponding personality of the robot is generated by training according to a large number of training samples, the robot can play a plurality of roles, and then the robot can have a plurality of personalities, and the personalities correspondingly change according to different conversation scenes. Of course, other information about the conversation participants can be obtained to perform emotion auxiliary analysis and prediction, so that the robot can be helped to make the most appropriate reply.
In one embodiment of the present disclosure, the multi-source knowledge may include historical conversations associated with conversation participants and emotional information corresponding to each sentence in the historical conversations; the method can comprise a history dialogue related to dialogue participants, and emotion information and expression images corresponding to each sentence in the history dialogue; the method can comprise a history dialogue related to dialogue participants, emotion information corresponding to each sentence in the history dialogue and personality information of the dialogue participants; the method can also comprise a history dialogue related to the dialogue participant, and emotion information, expression images and personality information of the dialogue participant corresponding to each sentence in the history dialogue. Wherein, the history dialogue is all dialogues spoken by the user and the robot from the initial moment to the current moment; the emotion information corresponding to each sentence is emotion determined according to dialogue or analysis according to dialogue and expression images, such as sadness, happiness, surprise, fear and the like; the expression image of the user is obtained by shooting the robot in the speaking process of the user, the robot shoots the expression of the user when the user speaks each sentence, a plurality of images can be continuously obtained in the shooting process, in order to accurately judge the emotion when the user speaks each sentence, the image which can clearly distinguish the five sense organs and the expression of a conversation participant in the plurality of images corresponding to the sentence can be used as the expression image, and the expression image corresponding to the conversation sentence of the robot can be searched to obtain the corresponding expression image according to the emotion which the sentence should contain; the personality information is the personality of the conversation participants, such as open, perfect, relatively genuine, random, etc. Fig. 4 shows a schematic diagram of a dialogue containing multi-source knowledge, as shown in fig. 4, part a is the content of a historical dialogue, including 8 sentences U1-U8, part b is emotion corresponding to each sentence in the historical dialogue, part c is an emotion image of a dialogue participant involved in the historical dialogue, part d is the name of the dialogue participant, which corresponds to the personality of the dialogue participant, and since the personality of each person is unique, the personality of the dialogue participant can be defined when the name of the dialogue participant is determined. It should be noted that fig. 4 shows multi-source knowledge containing 4 kinds of knowledge, the composition of the multi-source knowledge can be increased or decreased based on the 4 kinds of knowledge according to different application scenarios, and the dialog shown in fig. 4 is a training sample and is only used for schematically illustrating the composition of the multi-source knowledge.
It should be noted that, when the user speaks the first sentence into the robot, the robot cannot determine the personality information and the emotion information of the user, and then the emotion information or the emotion information and the personality information of the user can be determined by analyzing the dialogue sentence or the expression image and the dialogue sentence when speaking. For example, a round of dialogue is "M: you Mi o-! S: today is monday. "whether the emotion of the user S is happy or sad cannot be accurately determined according to the sentence of the user S, so that the emotion can be determined according to the expression image when the user S speaks, for example, the emotion can be determined as sad if the user S is in the expression image with the eyebrow, and the emotion can be determined as happy if the mouth angle of the user S is raised and the eyes are slightly squinted in the expression image. And for the personality information of the user, the dialogue statement of the user can be processed through the designed personality prediction model so as to acquire the personality information of the user.
In one embodiment of the present disclosure, after the multi-source knowledge of the conversation participants is acquired, the multi-source knowledge may be input to an emotion conversation model, feature extraction is performed on the multi-source knowledge through the emotion conversation model, and a reply sentence corresponding to the latest conversation sentence is output. In an embodiment of the disclosure, the emotion conversation model includes an encoder and a decoder, wherein the encoder is configured to process input multi-source knowledge to obtain multi-source knowledge feature information and reply emotion prediction information, the decoder is configured to determine feature information to be replied according to historical reply information and multi-source knowledge feature information, and finally determine a target reply sentence according to emotion features corresponding to the reply emotion prediction information and the feature information to be replied, and further, the decoder may determine the target reply sentence according to the emotion features corresponding to the reply emotion prediction information, the feature information to be replied and personality information of a current conversation participant. The reply emotion prediction information is emotion which the reply sentence predicted by the encoder should contain, and the target reply sentence is a reply sentence corresponding to the latest dialogue sentence.
In step S320, the multi-source knowledge is encoded to obtain multi-source knowledge feature information, and the predicted information of the reply emotion corresponding to the current conversation participant is determined according to the multi-source knowledge feature information.
In one embodiment of the present disclosure, FIG. 5 shows an architectural diagram of an emotion conversation model, as shown in FIG. 5, emotion conversation model 500 includes encoder 501 and decoder 502; wherein encoder 501 includes a multi-source knowledge input layer 501-1, a feature initialization layer 501-2, a feature stitching layer 501-3, a feed forward neural network layer 501-4, and an emotion prediction layer 501-5.
Next, a detailed description will be made on how to acquire the return emotion prediction information corresponding to the current conversation partner, specifically, the robot, based on the structure of the encoder 501.
The multi-source knowledge related to the conversation participants acquired in step S310 may be transmitted to the feature initialization layer 501-2 through the multi-source knowledge input layer 501-1, and feature extraction is performed on various kinds of knowledge in the multi-source knowledge through the feature initialization layer 501-2, so as to acquire multiple pieces of sub-feature information; the feature initialization layer 501-2 sends the multiple pieces of sub-feature information to the feature splicing layer 501-3 to splice the sub-feature information to obtain first spliced feature information; finally, the feature stitching layer 501-3 may send the first stitching feature information to the feedforward neural network layer 501-4, and perform full connection processing on the first stitching feature information through the feedforward neural network layer 501-4 to obtain multi-source knowledge feature information. Further, the multi-source knowledge feature information may be transmitted to the emotion prediction layer 501-5, and the multi-source knowledge feature information may be processed by the emotion prediction layer 501-5 to obtain the recovered emotion prediction information.
In one embodiment of the present disclosure, the number and types of sub-feature information correspond to the categories of knowledge that the multi-source knowledge includes, and the more knowledge categories the multi-source knowledge includes, the more accurate the obtained reply emotion prediction information and target reply sentences corresponding to the current conversation participant. When the multi-source knowledge includes a history dialogue related to a dialogue participant, and emotion information, an expression image, and personality information of the dialogue participant corresponding to each sentence in the history dialogue, the plurality of sub-feature information includes first sub-feature information corresponding to the history dialogue, second sub-feature information corresponding to the expression image, third sub-feature information corresponding to the emotion information, and fourth sub-feature information corresponding to the personality information. Specifically, the feature initialization layer 501-2 includes a long-short-term memory network, a self-attention network, an image feature extraction network and a feedforward neural network, and the context relationship in the history dialogue can be encoded sequentially through the long-short-term memory network and the self-attention network to obtain the first sub-feature information; extracting the characteristics of the expression image through an image characteristic extraction network to obtain expression characteristics, and carrying out full-connection processing on the facial expression characteristics through a feedforward neural network layer to obtain second sub-characteristic information; and simultaneously, inquiring in an emotion lookup table according to the emotion information to acquire the third sub-feature information, and inquiring in a personality lookup table according to the personality information to acquire the fourth sub-feature information.
In one embodiment of the present disclosure, when context relationships in a history dialogue are encoded sequentially through a long-short-term memory network and a self-attention network to obtain first sub-feature information, feature extraction may be performed on each dialogue sentence in the history dialogue through the long-short-term memory network to obtain hidden layer feature information corresponding to each dialogue sentence; secondly, position coding is carried out on hidden layer characteristic information according to the appearance sequence of each sentence so as to obtain dialogue characteristic information; finally, the dialogue characteristic information is encoded through the self-attention network so as to acquire the first sub-characteristic information. In the embodiment of the disclosure, the hidden layer vector corresponding to the last word in each dialogue sentence may be used as the hidden layer feature information, for example, the history dialogue includes N sentences, the feature extraction is performed on each word from the first sentence to the nth sentence through the long-short-term memory network, and the hidden layer vector corresponding to the last word in each sentence is used as the hidden layer feature information of the sentence, so that the hidden layer feature information h corresponding to the history dialogue may be obtained u ={h u1 ,h u2 ,……h uN }. Then, in order to distinguish the order of the dialogue, the hidden layer characteristic information can be position-coded, and when the position-coding is performed, the N-1 th sentence is the basis for the generation of the N-1 st sentence, so that the N-1 st sentence is the sentence closest to the N-th sentence, and the 1 st sentence is the sentence farthest from the N-th sentence, then the dialogue characteristic information H corresponding to the history dialogue can be obtained through the position-coding u ={[h u1 ;PE N ],[h u2 ;PE N-1 ],……[h uN ;PE 1 ]}. Further, to obtain feature information for the entire dialog level, hidden feature information for all dialog sentences may be encoded using a self-attention network to obtain first sub-feature information characterizing the entire historical dialog. In embodiments of the present disclosure, multi-headed self-attention may be employedThe mechanism encodes hidden layer feature information of all dialog sentences to obtain first sub-feature information, which may be expressed specifically as xu=multi head (H u ,H u ,H u ) Wherein the dialogue characteristic information H u ,H u ,H u Q (query), K (key) and V (value) are respectively, when Xu is determined, Q and K are aligned, similarity of each information in Q and K is calculated, then the similarity value is normalized to generate an attention value, finally corresponding V and corresponding attention value are subjected to weighted summation to obtain corresponding feature vectors, the same operation is carried out for a plurality of times, and then the plurality of feature vectors are spliced to obtain Xu. Of course, a single-head attention mechanism may also be used to encode hidden layer feature information of all dialogue sentences, which is not a specific requirement of the embodiments of the present disclosure.
In one embodiment of the present disclosure, when feature extraction is performed on an expression image through an image feature extraction network to obtain facial expression features, and full-connection processing is performed on the facial expression features through a feedforward neural network layer, feature extraction may be performed on the expression image through tools such as OpenFace to obtain facial expression features G f ={g f1 ,g f2 ,……,g fN Then the facial expression characteristic G is processed through the feedforward neural network layer f Performing full connection processing to obtain second sub-feature information, wherein the second sub-feature information can be specifically represented as X f =FFN(G f ). It should be noted that the image feature extraction network may be other networks besides OpenFace, and the feedforward neural network layer may be a common full-connection layer, so long as facial expression feature G is implemented f Extraction and pair G of (C) f And (3) the full connection processing is finished.
In one embodiment of the present disclosure, there are two trainable parameter lookup tables: the emotion lookup table EMB1 and the personality lookup table EMB2 are k-v key value tables, the emotion type in the emotion lookup table EMB1 is k value, the feature vector corresponding to the emotion type is v value, the personality type in the personality lookup table EMB2 is k value, the feature vector corresponding to the personality type is v value, after the emotion information and the personality information are obtained, the corresponding feature vector, namely the third sub-feature information Xe and the fourth sub-feature information Xs, can be obtained by inquiring in the corresponding lookup table. As an example, emotions can be classified into 7 categories, namely Neutral (Neutral), happiness (Happiness), surprise (Surprise), sadness (Sadness), anger (angry), disgust (aversion) and Fear (Fear), respectively, so that the emotion lookup table EMB1 is a 7-dimensional k-v table; the personality may be classified into 20 types of open type, perfect type, relatively true type, random type, etc., so the personality look-up table EMB2 may be a 20-dimensional k-v table.
In one embodiment of the present disclosure, after the first sub-feature information, the second sub-feature information, the third sub-feature information, and the fourth sub-feature information are acquired, the first sub-feature information, the second sub-feature information, the third sub-feature information, and the fourth sub-feature information may be first spliced to acquire first spliced feature information, and then the first spliced feature information may be subjected to full connection processing to acquire multi-source knowledge feature information. Because the first sub-feature information, the second sub-feature information, the third sub-feature information and the fourth sub-feature information are all n×d in size, where N is the number of sentences contained in the history dialogue, and D is the dimension of the feature information corresponding to each sentence, for example, 128 dimensions or 256 dimensions, the four sub-feature information can be spliced in the second dimension D to obtain first spliced feature information with size n×4d, which can be specifically expressed as H ufes =Concat(X u ;X f ;X e ;X s ). When the first splicing characteristic information is fully connected, the feed-forward neural network layer 501-4 can be used for fully connecting to obtain multi-source knowledge characteristic information, wherein the multi-source knowledge characteristic information is the final representation of the encoder end and can be specifically represented as H enc =FFN(H ufes ) The multi-source knowledge feature information is feature information which merges all dialogue information in the history dialogue. Of course, the feedforward neural network layer 501-4 may be replaced by a common full-connection layer, and full-connection processing may be performed on the first spliced characteristic information to obtain multi-source knowledge characteristic information.
In one embodiment of the present disclosureAfter the multi-source knowledge feature information is acquired, the reply emotion prediction information may be determined according to the multi-source knowledge feature information, fig. 6 shows a flow chart of acquiring the reply emotion prediction information, and in step S601, the multi-source knowledge feature information is subjected to dimension reduction processing to acquire a feature vector with a first dimension of 1, as shown in fig. 6; in step S602, the feature vector is subjected to full connection processing and normalization processing to acquire the return emotion prediction information. In step S601, since only one comprehensive emotion is finally used and the multi-source knowledge feature information is subjected to the full-connection processing, the multi-source knowledge feature information may have a size of n×2d, and therefore the multi-source knowledge feature information needs to be compressed, and the first dimension thereof is compressed to 1, so as to form feature information having a size of 1×m (M is a positive integer less than or equal to 2D). In the dimension reduction, dimension reduction may be specifically performed by using a pooling operation, for example, average pooling, maximum pooling, and in the embodiment of the present disclosure, dimension reduction processing is performed on the multi-source knowledge feature information by using an average pooling method, and the feature vector after dimension reduction may be specifically expressed as H mean =Meanpooling(H enc ) The size of the pooled convolution kernel can be set according to actual needs, so long as the first dimension is guaranteed to be compressed from N to 1. In step S602, by fully connecting and normalizing the feature vectors, the recovered emotion prediction information, which is emotion probability distribution information, is obtained, for example, when the emotion is classified into 7 types, then the recovered emotion prediction information is a probability value corresponding to 7 types of emotion, and the recovered emotion prediction information may be specifically expressed as p=softmax (W P H mean ) Wherein W is P For the characteristic vector H mean Weights at the time of full connection are made.
In step S330, the feature information corresponding to the historical reply information and the multi-source knowledge feature information are encoded to obtain feature information to be recovered.
In one embodiment of the present disclosure, after the multi-source knowledge feature information and the reply emotion prediction information are acquired, the acquired information may be processed by a decoder to acquire a target reply sentence containing emotion. The decoder is able to determine the reply content from the final representation at the encoder side, i.e. the multi-source knowledge feature information, which is typically generated word by word, say "how is the user today? When generating the reply "today weather", the robot shall generate the order "today weather", that is, when generating the t-th word, 1-t-1 words have been generated, and at the same time, because of the interrelation between each word, when determining the t-th word, not only the history dialogue sentence but also the generated history reply information need to be relied on.
Returning to FIG. 5, as shown in FIG. 5, the decoder 502 includes a history reply input layer 502-1, a self-attention network layer 502-2, a codec self-attention network layer 502-3, and a feed forward neural network layer 502-4. Next, how to acquire the feature information to be replied to is described based on the structure of the decoder 502.
Firstly, the historical reply information is sent to a historical reply input layer 502-1, and word embedding operation is carried out on the historical reply information so as to obtain a historical reply vector corresponding to the historical reply information; the historical reply vector is then sent to the self-attention network layer 502-2, and is encoded based on the self-attention mechanism to obtain the historical reply feature information, which may be specifically expressed as H, and may also be encoded based on the multi-head attention mechanism when the historical reply vector is encoded r =MultiHead(R r ,R r ,R r ) Wherein R is r Is a historical reply information vector; then, the historical reply feature information and the multi-source knowledge feature information are input into the coding and decoding self-attention network layer 502-3, and as the multi-source knowledge feature information comprises all information of a historical dialogue, the historical reply feature information comprises generated reply information, the correlation between the multi-source knowledge feature information and the historical reply information can be obtained by adopting a self-attention mechanism to encode the historical reply feature information and the multi-source knowledge feature information; finally, the characteristic information output by the coding and decoding self-attention network layer 502-3 is subjected to full-connection processing through the feedforward neural network layer 502-4, so that the characteristic information to be replied can be obtained, and the characteristic information to be replied is the multisource knowledge special after the full-connection processing The correlation between the sign information and the historical reply information, the feature information to be recovered may be specifically expressed as o=ffn (multi head (H) r ,H enc ,H enc ) And) wherein H r ,H enc ,H enc Q, K, V, the computation flow of the multi-head self-attention method is the same as that of the multi-head self-attention method for obtaining the first sub-feature information, and will not be described herein.
In step S340, a target reply sentence is determined according to the emotion feature corresponding to the reply emotion prediction information and the feature information to be recovered.
In one embodiment of the present disclosure, the feature information to be replied obtained in step S330 may guide the reply that is finally output, and in order to match the finally generated reply with the sentence of the user, the target reply sentence may be determined according to the emotion feature corresponding to the reply emotion prediction information and the feature information to be replied.
In one embodiment of the present disclosure, the decoder 502 further includes a feature stitching layer 502-5, a normalization layer 502-6, and an output layer 502-7, where the feature stitching layer 502-5 stitches the emotion feature corresponding to the reply emotion prediction information with the feature information to be replied to obtain second stitched feature information, and then the normalization layer 502-6 normalizes the second stitched feature information to obtain current reply information, and determines a target reply sentence according to the current reply information and the historical reply information, and then outputs the target reply sentence through the output layer 502-7. The emotion characteristics corresponding to the recovered emotion prediction information may be determined according to the recovered emotion prediction information and the emotion lookup table.
In order to further improve the matching degree of the target reply sentence and the dialogue sentence of the dialogue participant, the emotion feature and the personality feature corresponding to the current dialogue participant can be introduced at the same time when the current reply information is predicted, and the target reply sentence with higher matching degree with the dialogue sentence of the dialogue participant can be obtained based on the feature information to be replied and the emotion feature and the personality feature of the current dialogue participant. Specifically, when feature stitching is performed, the feature stitching layer 502-5 is used to stitch the emotion feature corresponding to the reply emotion prediction information, the personalized feature corresponding to the current dialogue participant and the feature information to be replied to obtain second stitching feature information, and further normalize the second stitching feature information to obtain current reply information, and determine a target reply sentence according to the current reply information and the historical reply information. The personality characteristics of the current conversation participant are the personality characteristics of the robot, and can be determined according to the personality information and the personality lookup table of the current conversation participant, and further, as one person corresponds to a unique personality, a personality lookup table can be formed according to the name of the conversation participant and the personality characteristic vector corresponding to the conversation participant, and further, the personality characteristics of the current conversation participant can be determined according to the name of the current conversation participant and the personality lookup table.
The accuracy of emotion in the target reply sentence can be ensured by introducing the reply emotion prediction information and the personalized information of the current conversation participant, the matching degree of the target reply sentence and the historical conversation is improved, and the use experience of a user on a man-machine conversation product is further improved.
When the emotion characteristics corresponding to the recovered emotion prediction information and the characteristic information to be recovered are spliced, the second spliced characteristic information can be specifically expressed as O es =Concat(O;E g ) A) is provided; when the emotion characteristics corresponding to the reply emotion prediction information, the personality characteristics corresponding to the current conversation participant and the characteristic information to be replied are spliced, the second spliced characteristic information can be specifically expressed as O es =Concat(O;E g ;S g ) And), wherein E g And S is g Emotional and personality characteristics of the current conversation participants, respectively, and E g =[E p ;…;E p ],S g =[S p ;…;S p ]. At E g And S is g In the expression of (2), E p =∑(P·EMB1 Since P is probability distribution of each emotion category, emotion lookup table EMB1 is feature information corresponding to each emotion category, thus E p The comprehensive expression of various emotions is the comprehensive emotion characteristics; s is S p Is a personality trait corresponding to a current conversation participant.
After the second splicing characteristic information is obtained, the second splicing characteristic information can be normalized, and the current reply information, namely the t word, is obtained by decoding, wherein the expression of the current reply information is specifically shown as follows:
P(y t |y 1:t-1 ;G;E p ;θ)=Softmax(W o O es t ) Or alternatively
P(y t |y 1:t-1 ;G;E p ;S p ;θ)=Softmax(W o O es t )
Wherein W is o Is the weight.
From the above expression analysis, the current reply message is substantially the distribution probability of each word in the vocabulary, and the finally output word can be determined according to the distribution probability of each word, so as to obtain the target reply sentence. The vocabulary can be understood as a data set containing a large number of words, all the replies are formed according to word combinations in the vocabulary, and the emotion dialogue model only selects a plurality of words with highest probability values from the vocabulary according to the history dialogue, and then the corresponding replies can be generated by combining. In the embodiment of the disclosure, if the t-th word is the last word, forming a target reply sentence according to the first t-1 word and the t-th word, and if the t-th word is not the last word, repeating the processing flow of the decoder until the last word is obtained, thereby forming the target reply sentence.
In the above embodiments, the emotion conversation method of the present disclosure is described as being applied to the emotion conversation scene of a person VS one robot, the emotion conversation method of the present disclosure may also be applied to the scene of a multi-person VS one robot, for example, two persons A, B and a robot M are performing man-machine conversation, each round of conversation may be performed between a and B, between a and M, and between B and M, the robot M may acquire voice information and an expression image thereof for each sentence spoken by the user a or B, by analyzing the voice information and the expression image, it may be determined to whom the user a or B speaks, for example, "B" is spoken by the user a or B, and the clothes of today are beautiful, and by analyzing the voice information, it may be confirmed that a speaks by B, and the robot M may reply after replying by the user B. When determining the reply information, the robot M may collect the voice information and the expression image of the users a and B, and then infer the emotion information and personality information of the users a and B according to the voice information and the expression image, and further determine the content to be replied by combining the voice information, the expression image, the emotion information and the personality information, for example, the smile a says "B," the clothes today are beautiful, "the smile B says" thank, "the robot M can judge that the emotion information of the two persons is happy, and predict that the personality of the two persons is mild, then obtain the reply emotion prediction information to be happy according to the multi-source knowledge feature information extracted by the encoder, and determine the target reply sentence by combining the reply emotion prediction information and the personality feature of the robot M through the decoder, for example, the target reply sentence can be" nice, "like a flower child," "so as to be very beautiful," so that the whole emotion conversation is like communication between people, and no fault and logical deviation exists.
According to the emotion dialogue generation method, multi-source knowledge is encoded through an emotion dialogue model to obtain multi-source knowledge characteristic information, and then return emotion prediction information is determined according to the multi-source knowledge characteristic information; and then determining the characteristic information to be replied according to the characteristic information corresponding to the historical reply information and the multi-source knowledge characteristic information, introducing the reply emotion prediction information through the splicing operation, and decoding to obtain a target reply sentence. In the embodiment of the disclosure, the multi-source knowledge can specifically include a history dialogue, emotion information corresponding to each sentence in the history dialogue, an expression image and personality information of a dialogue participant, and when the reply emotion prediction information corresponding to the current dialogue participant is introduced, the personality characteristics of the current dialogue participant can also be introduced, so that on one hand, the multi-mode information is processed, the precision of the reply emotion prediction information corresponding to the current dialogue participant is improved, and a foundation is laid for obtaining high-quality target reply sentences; on the other hand, by comprehensively considering the characteristic information to be replied, the replying emotion prediction information and the individual characteristic information of the dialogue participants, the accuracy of target reply sentences can be improved, the matching degree with the user words is improved, and the user experience is further improved.
In one embodiment of the present disclosure, training of emotion conversation models to obtain stable emotion conversation models is required before processing multisource knowledge of conversation participants with emotion conversation models to obtain target reply sentences. During model training, a multi-source knowledge sample related to a conversation participant can be obtained, and the emotion conversation model to be trained is trained according to the multi-source knowledge sample, so that the emotion conversation model is obtained.
Fig. 7 shows a schematic flow chart of emotion dialogue model training, as shown in fig. 7, in step S701, a multi-source knowledge sample is obtained, and the multi-source knowledge sample is input into an emotion dialogue model to be trained, where the multi-source knowledge sample at least includes a history dialogue sample and a reply sample, and emotion samples corresponding to each sentence and reply sample in the history dialogue sample; in step S702, the historical dialog samples and emotion samples corresponding to each sentence in the historical dialog samples are encoded to obtain sample feature information, and predicted emotion information corresponding to the current dialog participant is determined according to the sample feature information; in step S703, the feature information corresponding to the reply sample and the sample feature information are encoded to obtain predicted reply feature information, and a predicted reply sentence is determined according to the emotion feature corresponding to the predicted emotion information and the predicted reply feature information; in step S704, a first loss function is constructed according to the predicted reply sentence and the reply sample, and a second loss function is constructed according to the predicted emotion information and the emotion sample corresponding to the reply sample, and parameters of the emotion conversation model to be trained are optimized according to the first loss function and the second loss function, so as to obtain the emotion conversation model.
In one embodiment of the present disclosure, similar to the composition of the multi-source knowledge in the above embodiment, the multi-source knowledge sample is divided by an emotion sample that may include a dialogue sample and each sentence in the dialogue sample; the system also comprises a dialogue sample, an expression image sample and an emotion sample corresponding to each sentence in the dialogue sample, and a personality sample of a dialogue participant; the system also comprises a dialogue sample, emotion samples corresponding to each sentence in the dialogue sample and personality samples of dialogue participants; the method also comprises a dialogue sample, and an expression image sample, an emotion sample and a personality sample of a dialogue participant corresponding to each sentence in the dialogue sample. In the model training process, a complete session is usually used as a session sample, and when model training is performed according to the session sample, the session sample can be divided into a history session sample and a reply sample, taking the session shown in fig. 4 as an example, sessions U1-U4 can be used as history session samples, session U5 can be used as reply samples, and when the multi-source knowledge sample comprises a session sample and emotion samples corresponding to each sentence in the session sample, emotion information corresponding to sessions U1-U4 can be used as emotion samples corresponding to each sentence in the history session sample, and emotion information corresponding to session U5 can be used as emotion samples corresponding to the reply samples; when the multi-source knowledge sample includes a dialogue sample, and an emotion image sample, an emotion sample and a personality sample of a dialogue participant corresponding to each sentence in the dialogue sample, emotion information, an emotion image and personality information corresponding to each sentence in the dialogue sample can be used as the emotion sample, the emotion image sample and the personality sample corresponding to each sentence in the history dialogue sample, emotion information, an emotion image and personality information corresponding to the dialogue U5 can be used as the emotion sample, the emotion image sample and the personality sample corresponding to the reply sample, and the like. Likewise, conversations U1-U5 may also be used as historical conversational samples, conversational U6 as reply samples, etc., to train the emotion conversational model to be trained multiple times according to the multi-source knowledge samples of different compositions.
Because the more the types of knowledge contained in the multi-source knowledge sample are, the better the emotion dialogue model obtained by training according to the multi-source knowledge sample is, in the embodiment of the disclosure, the emotion dialogue model to be trained is preferably trained by the multi-source knowledge sample containing the dialogue sample, the emotion image sample corresponding to each sentence in the dialogue sample, the emotion sample and the personality sample of the dialogue participant.
In step S704, after the predicted reply sentence is obtained, a first loss function may be constructed according to the predicted reply sentence and the reply sample, where the first loss function is used to characterize the deviation degree of the target predicted reply sentence and the reply sample, and a second loss function may be constructed according to the predicted emotion information and the emotion sample corresponding to the reply sample, where the second loss function is used to characterize the deviation degree of the predicted emotion information and the standard emotion sample, and then model parameters are optimized according to the first loss function and the second loss function until the training of the preset times is completed or the loss function reaches the minimum, so that a stable emotion dialogue model may be obtained. In an embodiment of the present disclosure, the first loss function and the second loss function may each be a cross entropy loss function, whose expressions are shown in formula (1) and formula (2), respectively:
Wherein L is MLL As a first loss function, y t To generate the t-th word, J is the total number of words contained in the target prediction reply sentence, L CLS As a function of the second loss,for the predictive value corresponding to the ith emotion category, < +.>C is the standard value corresponding to the ith emotion type 1 Is the total number of emotion categories, and may be specifically 7.
When the second loss function is calculated, the emotion characteristics corresponding to the emotion samples corresponding to the reply samples can be obtained by encoding the emotion samples in a one-hot encoding mode, namely, only the positions corresponding to the standard emotion categories are 1 in the vector representation corresponding to the emotion samples, and the positions corresponding to other emotion categories are 0; the emotion characteristics of the predicted emotion information are emotion probability distribution P obtained by processing the multisource knowledge sample through the method flow for acquiring the recovered emotion predicted information shown in FIG. 6; and finally, determining a cross entropy loss function according to the emotion characteristics of the marked emotion samples and the emotion characteristics of the predicted emotion information.
After the first loss function and the second loss function are obtained, the first loss function and the second loss function can be added, and the emotion dialogue model to be trained is subjected to reverse parameter adjustment according to the added loss functions until a stable emotion dialogue model is obtained. In embodiments of the present disclosure, the first and second loss functions may also be other types of loss functions, as the embodiments of the present disclosure are not particularly limited in this regard.
In one embodiment of the present disclosure, the structure of the emotion dialogue model to be trained is similar to that of the emotion dialogue model, including an encoder to be trained and a decoder to be trained, the data processing flow of steps S702-S703 in fig. 7 is similar to that of the emotion dialogue model, and when predicted emotion information is obtained, firstly, a historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample are initialized, and sub-feature information corresponding to the historical dialogue sample and the emotion samples is obtained; then, splicing and fully connecting the plurality of sub-feature information to obtain sample feature information; and finally, determining predicted emotion information corresponding to the current conversation participant according to the sample characteristic information. When a prediction reply statement is acquired, firstly, characteristic information corresponding to a reply sample is encoded based on a self-attention mechanism; then, based on a self-attention mechanism, the characteristic information of the coded reply sample and the sample characteristic information are coded and fully connected to obtain predicted reply characteristic information; and finally, carrying out splicing and normalization processing on the emotion characteristics corresponding to the predicted reply characteristic information and the predicted emotion information so as to obtain a predicted reply sentence. The specific processing method in each step is the same as the specific processing method in each step of emotion dialogue in the above embodiment, and will not be described herein. It should be noted that, in the model training process, the information input to the historical reply input layer of the decoder to be trained is a reply corresponding to the last sentence of the historical dialogue sample, namely a reply sample, and the expected output of the decoder is also the reply sample. In addition, according to different knowledge categories in the multi-source knowledge sample, the objects aimed by each data processing flow in the model training process are different, and specific reference may be made to the data processing flow and the model training flow related to the emotion dialogue generation method embodiment, which are not described herein again.
In one embodiment of the present disclosure, performance of an emotion dialogue model obtained by training the training method of the emotion dialogue model in the embodiment of the present disclosure is greatly improved compared with that of an existing Emo-HERD model, which shows performance parameters obtained by processing two-source knowledge by the emotion dialogue model and the existing Emo-HERD model in the embodiment of the present disclosure, specifically:
table 1 model Performance parameters from processing two source knowledge
Model PPL BLEU DIST-1 DIST-2 w-avg
Emo-HERD 116.14 1.87 1.23 3.77 29.67
The present disclosure 108.51 2.00 1.22 4.54 30.13
Five evaluation indices are shown in table 1, wherein PPL is a degree of confusion, characterizing the quality of the reply, the smaller the index value the better; BLEU is sentence continuity, and is generally judged according to continuity among 4 words, and the larger the value of the index is, the better the value is; dist-1 and Dist-2 are diversity, the diversity of words in the reply is represented, dist-1 represents the diversity of 1 word, dist-2 represents the diversity of 2 continuous words, and the larger the value of the index is, the better the value is; w-avg is an average value of emotion prediction, specifically an average value of F1 values corresponding to 7 types of emotion, and the larger the value of the index is, the better the value is. Since one of the main tasks of emotion dialogue models is to predict the emotion of the current dialogue participant, the index w-avg is of highest importance when evaluating model performance.
As can be seen from the analysis of Table 1, the emotion dialogue model in the embodiment of the present disclosure processes two-source knowledge, and the various performance parameters obtained by processing two-source knowledge are slightly inferior to the Dis-1 value obtained by processing two-source knowledge by the existing Emo-HERD model, and the other performance parameters are superior to the performance parameters obtained by processing two-source knowledge by the existing Emo-HERD model, that is, the emotion dialogue model of the present disclosure can process multi-source knowledge to obtain higher-quality replies. In addition, the more the types of knowledge contained in the multi-source knowledge sample, the better the performance of the emotion dialogue model obtained through training according to the multi-source knowledge sample, so that according to the multi-source knowledge sample containing more knowledge types, such as four-source knowledge, the emotion dialogue model to be trained is trained, the emotion dialogue model with better performance can be obtained, and further the emotion dialogue model is used for processing the multi-source knowledge, so that the emotion can be recovered more in fit context, more smooth logic and more proper words.
The following describes an embodiment of an apparatus of the present disclosure, which may be used to perform the emotion dialogue generation method in the above embodiment of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiment of the emotion dialogue generation method described in the present disclosure.
Fig. 8 schematically illustrates a block diagram of an emotion dialogue generation device according to one embodiment of the present disclosure.
Referring to fig. 8, an emotion dialogue generation device 800 according to an embodiment of the present disclosure includes: a multi-source knowledge acquisition module 801, a reply emotion prediction module 802, a feature to reply determination module 803 and a reply sentence generation module 804.
Wherein, the multi-source knowledge acquisition module 801 is configured to acquire multi-source knowledge related to a conversation participant; a reply emotion prediction module 802, configured to encode the multi-source knowledge to obtain multi-source knowledge feature information, and determine reply emotion prediction information corresponding to a current dialogue participant according to the multi-source knowledge feature information; the feature to be replied determining module 803 is configured to encode feature information corresponding to the historical reply information and the multi-source knowledge feature information to obtain feature information to be replied; and a reply sentence generating module 804, configured to determine a target reply sentence according to the emotion feature corresponding to the reply emotion prediction information and the feature information to be replied.
In some embodiments of the present disclosure, the return emotion prediction module 802 includes: the feature extraction unit is used for carrying out feature extraction on various knowledge in the multi-source knowledge so as to acquire a plurality of sub-feature information; the characteristic splicing unit is used for splicing the sub-characteristic information to obtain first spliced characteristic information; and the full-connection unit is used for carrying out full-connection processing on the first splicing characteristic information so as to acquire the multi-source knowledge characteristic information.
In some embodiments of the disclosure, the multi-source knowledge includes: a history dialogue, emotion information corresponding to each sentence in the history dialogue, expression images and personality information of dialogue participants; the feature extraction unit includes: the first sub-feature acquisition unit is used for sequentially encoding the context relation in the history dialogue through a long-period memory network and a self-attention network so as to acquire first sub-feature information; the second sub-feature acquisition unit is used for carrying out feature extraction on the expression image through an image feature extraction network to acquire facial expression features, and carrying out full-connection processing on the facial expression features through a feedforward neural network to acquire second sub-feature information; a third sub-feature obtaining unit, configured to query in an emotion lookup table according to the emotion information, so as to obtain third sub-feature information; and the fourth sub-feature acquisition unit is used for inquiring in the personality lookup table according to the personality information so as to acquire fourth sub-feature information.
In some embodiments of the present disclosure, the first sub-feature acquisition unit includes: the hidden layer information extraction unit is used for extracting the characteristics of each statement in the history dialogue through the long-short-term memory network so as to obtain hidden layer characteristic information corresponding to each statement; the position coding unit is used for carrying out position coding on the hidden layer characteristic information according to the appearance sequence of each statement so as to obtain dialogue characteristic information; and the sub-feature information acquisition unit is used for encoding the dialogue feature information through the self-attention network so as to acquire the first sub-feature information.
In some embodiments of the present disclosure, the hidden layer information extraction unit is configured to: and taking the hidden layer vector corresponding to the last word in each statement as the hidden layer characteristic information.
In some embodiments of the present disclosure, the return emotion prediction module 802 is further configured to: performing dimension reduction processing on the multi-source knowledge feature information to obtain a feature vector with a first dimension of 1; and carrying out full connection processing and normalization processing on the characteristic vector to obtain the return emotion prediction information.
In some embodiments of the present disclosure, the feature to be replied determining module 803 includes: the history reply feature acquisition unit is used for carrying out vector conversion on the history reply information to acquire a history reply vector, and encoding the history reply vector based on a self-attention mechanism to acquire history reply feature information; the to-be-replied feature acquisition unit is used for encoding the historical reply feature information and the multi-source knowledge feature information based on a self-attention mechanism, and performing full-connection processing on the encoded feature information to acquire the to-be-replied feature information.
In some embodiments of the present disclosure, the reply sentence generation module 804 is configured to: splicing the emotion characteristics and the characteristic information to be replied to obtain second spliced characteristic information; and carrying out normalization processing on the second splicing characteristic information to obtain current reply information, and determining the target reply statement according to the current reply information and the historical reply information.
In some embodiments of the disclosure, the reply sentence generation module is configured to: splicing the emotion characteristics and the characteristic information to be replied to obtain second spliced characteristic information; and carrying out normalization processing on the second splicing characteristic information to obtain current reply information, and determining the target reply statement according to the current reply information and the historical reply information.
In some embodiments of the present disclosure, the acquiring the second stitching characteristic information may be further configured to: and splicing the emotion characteristics, the personalized characteristics corresponding to the current dialogue participant and the characteristic information to be replied to obtain the second spliced characteristic information.
FIG. 9 schematically illustrates a block diagram of a training apparatus for emotion conversation models, in accordance with one embodiment of the present disclosure.
Referring to fig. 9, a training apparatus 900 of emotion conversation model according to an embodiment of the present disclosure includes: a sample acquisition module 901, a mood prediction module 902, a reply prediction module 903, and a parameter optimization module 904.
The sample acquiring module 901 is configured to acquire a multi-source knowledge sample, input the multi-source knowledge sample into an emotion dialogue model to be trained, where the multi-source knowledge sample at least includes a history dialogue sample and a reply sample, and emotion samples corresponding to each sentence in the history dialogue sample and the reply sample; the emotion prediction module 902 is configured to encode the historical dialog sample and emotion samples corresponding to each sentence in the historical dialog sample to obtain sample feature information, and determine predicted emotion information corresponding to a current dialog participant according to the sample feature information; the reply prediction module 903 is configured to encode feature information corresponding to the reply sample and the sample feature information to obtain predicted reply feature information, and determine a predicted reply sentence according to an emotion feature corresponding to the predicted emotion information and the predicted reply feature information; the parameter optimization module 904 is configured to construct a first loss function according to the predicted reply sentence and the reply sample, construct a second loss function according to the predicted emotion information and the emotion sample corresponding to the reply sample, and optimize parameters of the emotion conversation model to be trained according to the first loss function and the second loss function so as to obtain the emotion conversation model.
In some embodiments of the disclosure, the multi-source knowledge further includes an emoticon sample corresponding to each sentence in the historical dialog sample and a personality sample of the dialog participant, and an emoticon sample corresponding to the reply sample and a personality sample of the dialog participant.
In some embodiments of the present disclosure, the emotion prediction module is further configured to: encoding the historical dialogue sample, emotion samples, expression image samples and personality samples of dialogue participants corresponding to all sentences in the historical dialogue sample to acquire sample characteristic information; the reply prediction module is further configured to: and determining the predicted reply sentence according to the emotion characteristics corresponding to the predicted emotion information, the personality characteristics corresponding to the current dialogue participant and the predicted reply characteristic information, wherein the current dialogue participant is the dialogue participant corresponding to the reply sample.
FIG. 10 illustrates a schematic diagram of a computer system suitable for use in implementing the emotion conversation generation device, training device for emotion conversation models of embodiments of the present disclosure.
The computer system 1000 of the emotion dialogue generation device and the training device for emotion dialogue model shown in fig. 10 is only an example, and should not impose any limitation on the functions and the scope of use of the embodiment of the present disclosure.
As shown in fig. 10, the computer system 1000 includes a central processing unit (Central Processing Unit, CPU) 1001 which can perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1002 or a program loaded from a storage section 1008 into a random access Memory (Random Access Memory, RAM) 1003, implementing the emotion conversation generation method described in the above embodiment. In the RAM 1003, various programs and data required for system operation are also stored. The CPU 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. An Input/Output (I/O) interface 1005 is also connected to bus 1004.
The following components are connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output portion 1007 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and a speaker; a storage portion 1008 including a hard disk or the like; and a communication section 1009 including a network interface card such as a LAN (Local Area Network ) card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The drive 1010 is also connected to the I/O interface 1005 as needed. A removable medium 1011, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is installed as needed in the drive 1010, so that a computer program read out therefrom is installed as needed in the storage section 1008.
In particular, according to embodiments of the present disclosure, the processes described below with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 1009, and/or installed from the removable medium 1011. When executed by a Central Processing Unit (CPU) 1001, performs various functions defined in the system of the present disclosure.
It should be noted that, the computer readable medium shown in the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present disclosure, however, the computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units involved in the embodiments of the present disclosure may be implemented by means of software, or may be implemented by means of hardware, and the described units may also be provided in a processor. Wherein the names of the units do not constitute a limitation of the units themselves in some cases.
As another aspect, the present disclosure also provides a computer-readable medium that may be contained in the emotion dialogue generation device described in the above embodiment; or may exist alone without being incorporated into the electronic device. The computer-readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to implement the methods described in the above embodiments.
It should be noted that although in the above detailed description several modules or units of a device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit in accordance with embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into a plurality of modules or units to be embodied.
From the above description of embodiments, those skilled in the art will readily appreciate that the example embodiments described herein may be implemented in software, or may be implemented in software in combination with the necessary hardware. Thus, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.) or on a network, and includes several instructions to cause a computing device (may be a personal computer, a server, a touch terminal, or a network device, etc.) to perform the method according to the embodiments of the present disclosure.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains.
It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (13)

1. An emotion dialogue generation method, comprising:
acquiring multi-source knowledge related to a conversation participant, wherein the multi-source knowledge is a plurality of types of knowledge related to the conversation participant;
encoding the multi-source knowledge to obtain multi-source knowledge characteristic information, and determining reply emotion prediction information corresponding to the current conversation participant according to the multi-source knowledge characteristic information;
coding the characteristic information corresponding to the historical reply information and the multi-source knowledge characteristic information to acquire the characteristic information to be replied;
Determining a target reply sentence according to the emotion characteristics corresponding to the reply emotion prediction information and the characteristic information to be replied;
wherein the multi-source knowledge comprises: a history dialogue, emotion information corresponding to each sentence in the history dialogue, expression images and personality information of dialogue participants; the encoding the multi-source knowledge to obtain multi-source knowledge feature information includes:
sequentially encoding the context relation in the history dialogue through a long-short-term memory network and a self-attention network to obtain first sub-feature information; extracting features of the expression image through an image feature extraction network to obtain facial expression features, and performing full-connection processing on the facial expression features through a feedforward neural network to obtain second sub-feature information; inquiring in an emotion lookup table according to the emotion information to obtain third sub-feature information; inquiring in a personality lookup table according to the personality information to acquire fourth sub-feature information;
splicing the sub-feature information to obtain first spliced feature information;
and performing full connection processing on the first spliced characteristic information to acquire the multi-source knowledge characteristic information.
2. The emotion conversation generation method of claim 1, wherein the sequentially encoding the context relationships in the history conversation via a long-short-term memory network and a self-attention network to obtain first sub-feature information includes:
extracting features of each statement in the history dialogue through the long-short-term memory network so as to obtain hidden layer feature information corresponding to each statement;
position coding is carried out on the hidden layer characteristic information according to the appearance sequence of each statement so as to obtain dialogue characteristic information;
the dialogue feature information is encoded through the self-attention network to acquire the first sub-feature information.
3. The emotion conversation generation method according to claim 1 or 2, wherein the determining, from the multi-source knowledge feature information, the return emotion prediction information corresponding to the current conversation participant includes:
performing dimension reduction processing on the multi-source knowledge feature information to obtain a feature vector with a first dimension of 1;
and carrying out full connection processing and normalization processing on the characteristic vector to obtain the return emotion prediction information.
4. The emotion dialogue generation method according to claim 1, wherein the encoding the feature information corresponding to the historical reply information and the multi-source knowledge feature information to obtain feature information to be replied includes:
Performing vector conversion on the historical reply information to obtain a historical reply vector, and encoding the historical reply vector based on a self-attention mechanism to obtain historical reply characteristic information;
and encoding the historical reply feature information and the multi-source knowledge feature information based on a self-attention mechanism, and performing full-connection processing on the encoded feature information to acquire the feature information to be recovered.
5. The emotion conversation generation method of claim 4, wherein the determining a target reply sentence from the emotion feature corresponding to the reply emotion prediction information and the feature information to be recovered includes:
splicing the emotion characteristics and the characteristic information to be replied to obtain second spliced characteristic information;
and carrying out normalization processing on the second splicing characteristic information to obtain current reply information, and determining the target reply statement according to the current reply information and the historical reply information.
6. The emotion conversation generation method of claim 5, wherein the acquiring second concatenation feature information further includes:
and splicing the emotion characteristics, the personalized characteristics corresponding to the current dialogue participant and the characteristic information to be replied to obtain the second spliced characteristic information.
7. A training method of emotion dialogue model, wherein the emotion dialogue model is used for implementing the emotion dialogue generation method as described in any one of claims 1 to 6, and the training method comprises:
a multi-source knowledge sample is obtained and is input into an emotion dialogue model to be trained, wherein the multi-source knowledge sample at least comprises a history dialogue sample, a reply sample and emotion samples corresponding to sentences in the history dialogue sample and the reply sample;
encoding the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample to obtain sample characteristic information, and determining predicted emotion information corresponding to a current dialogue participant according to the sample characteristic information;
coding the characteristic information corresponding to the reply sample and the sample characteristic information to obtain predicted reply characteristic information, and determining a predicted reply sentence according to the emotion characteristic corresponding to the predicted emotion information and the predicted reply characteristic information;
constructing a first loss function according to the predicted reply sentence and the reply sample, constructing a second loss function according to the predicted emotion information and the emotion sample corresponding to the reply sample, and optimizing parameters of the emotion dialogue model to be trained according to the first loss function and the second loss function so as to obtain the emotion dialogue model;
The encoding the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample to obtain sample characteristic information includes:
initializing the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample, and obtaining sub-feature information corresponding to the historical dialogue sample and the emotion samples;
and splicing and fully connecting the plurality of sub-feature information to acquire the sample feature information.
8. The method of claim 7, wherein the multi-source knowledge further comprises an emoticon sample and a personality sample of a conversation party corresponding to each sentence in the historical conversation sample, and an emoticon sample and a personality sample of a conversation party corresponding to the reply sample.
9. The method of claim 8, wherein the obtaining sample characteristic information further comprises:
encoding the historical dialogue sample, emotion samples, expression image samples and personality samples of dialogue participants corresponding to all sentences in the historical dialogue sample to acquire sample characteristic information;
the determining a predicted reply sentence further includes:
And determining the predicted reply sentence according to the emotion characteristics corresponding to the predicted emotion information, the personality characteristics corresponding to the current dialogue participant and the predicted reply characteristic information, wherein the current dialogue participant is the dialogue participant corresponding to the reply sample.
10. An emotion dialogue generation device comprising:
a multi-source knowledge acquisition module, configured to acquire multi-source knowledge related to a conversation party, where the multi-source knowledge is a plurality of types of knowledge related to the conversation party;
the feedback emotion prediction module is used for encoding the multi-source knowledge to obtain multi-source knowledge characteristic information and determining feedback emotion prediction information corresponding to the current dialogue participant according to the multi-source knowledge characteristic information;
the to-be-replied characteristic determining module is used for encoding characteristic information corresponding to the historical reply information and the multi-source knowledge characteristic information so as to obtain the to-be-replied characteristic information;
the reply sentence generation module is used for determining a target reply sentence according to the emotion characteristics corresponding to the reply emotion prediction information and the characteristic information to be replied;
wherein the multi-source knowledge comprises: a history dialogue, emotion information corresponding to each sentence in the history dialogue, expression images and personality information of dialogue participants; the to-be-replied feature determining module is further configured to:
Sequentially encoding the context relation in the history dialogue through a long-short-term memory network and a self-attention network to obtain first sub-feature information; extracting features of the expression image through an image feature extraction network to obtain facial expression features, and performing full-connection processing on the facial expression features through a feedforward neural network to obtain second sub-feature information; inquiring in an emotion lookup table according to the emotion information to obtain third sub-feature information; inquiring in a personality lookup table according to the personality information to acquire fourth sub-feature information;
splicing the sub-feature information to obtain first spliced feature information;
and performing full connection processing on the first spliced characteristic information to acquire the multi-source knowledge characteristic information.
11. Training device for emotion dialogue models for implementing an emotion dialogue generation method according to any of claims 1-6, comprising:
the system comprises a sample acquisition module, a training module and a training module, wherein the sample acquisition module is used for acquiring a multi-source knowledge sample, inputting the multi-source knowledge sample into an emotion dialogue model to be trained, and the multi-source knowledge sample at least comprises a history dialogue sample, a reply sample and emotion samples corresponding to each sentence in the history dialogue sample and the reply sample;
The emotion prediction module is used for encoding the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample to obtain sample characteristic information, and determining predicted emotion information corresponding to a current dialogue participant according to the sample characteristic information;
the reply prediction module is used for encoding the characteristic information corresponding to the reply sample and the sample characteristic information to obtain prediction reply characteristic information, and determining a prediction reply statement according to the emotion characteristic corresponding to the prediction emotion information and the prediction reply characteristic information;
the parameter optimization module is used for constructing a first loss function according to the predicted reply statement and the reply sample, constructing a second loss function according to the predicted emotion information and the emotion sample corresponding to the reply sample, and optimizing parameters of the emotion dialogue model to be trained according to the first loss function and the second loss function so as to obtain the emotion dialogue model;
wherein, the emotion prediction module is further configured to:
initializing the historical dialogue sample and emotion samples corresponding to each sentence in the historical dialogue sample, and obtaining sub-feature information corresponding to the historical dialogue sample and the emotion samples;
And splicing and fully connecting the plurality of sub-feature information to acquire the sample feature information.
12. A computer-readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the emotion conversation generation method of any one of claims 1 to 6 and the training method of the emotion conversation model of any one of claims 7 to 9.
13. An electronic device, comprising:
one or more processors;
storage means for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement the emotion conversation generation method of any of claims 1 to 6 and the training method of the emotion conversation model of any of claims 7 to 9.
CN202010733061.4A 2020-07-27 2020-07-27 Emotion dialogue generation method and device and emotion dialogue model training method and device Active CN111897933B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010733061.4A CN111897933B (en) 2020-07-27 2020-07-27 Emotion dialogue generation method and device and emotion dialogue model training method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010733061.4A CN111897933B (en) 2020-07-27 2020-07-27 Emotion dialogue generation method and device and emotion dialogue model training method and device

Publications (2)

Publication Number Publication Date
CN111897933A CN111897933A (en) 2020-11-06
CN111897933B true CN111897933B (en) 2024-02-06

Family

ID=73190148

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010733061.4A Active CN111897933B (en) 2020-07-27 2020-07-27 Emotion dialogue generation method and device and emotion dialogue model training method and device

Country Status (1)

Country Link
CN (1) CN111897933B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112256856A (en) * 2020-11-16 2021-01-22 北京京东尚科信息技术有限公司 Robot dialogue method, device, electronic device and storage medium
CN112949684B (en) * 2021-01-28 2022-07-29 天津大学 Multimodal dialogue emotion information detection method based on reinforcement learning framework
CN113157874B (en) * 2021-02-20 2022-11-22 北京百度网讯科技有限公司 Method, apparatus, device, medium, and program product for determining user's intention
CN113221560B (en) * 2021-05-31 2023-04-18 平安科技(深圳)有限公司 Personality trait and emotion prediction method, personality trait and emotion prediction device, computer device, and medium
CN113326704B (en) * 2021-06-03 2022-07-19 北京聆心智能科技有限公司 Emotion support conversation generation method and system based on comprehensive strategy
CN115470325A (en) * 2021-06-10 2022-12-13 腾讯科技(深圳)有限公司 Message reply method, device and equipment
CN113094478B (en) * 2021-06-10 2021-08-13 平安科技(深圳)有限公司 Expression reply method, device, equipment and storage medium
CN114282549A (en) * 2021-08-06 2022-04-05 腾讯科技(深圳)有限公司 Method and device for identifying root relation between information, electronic equipment and storage medium
CN115730070B (en) * 2022-11-25 2023-08-08 重庆邮电大学 Man-machine co-emotion conversation method, system, electronic equipment and medium
CN115934909B (en) * 2022-12-02 2023-11-17 苏州复变医疗科技有限公司 Co-emotion reply generation method and device, terminal and storage medium
CN117556832B (en) * 2023-11-23 2024-04-09 合肥综合性国家科学中心人工智能研究院(安徽省人工智能实验室) Semantic constraint-based emotion support dialogue bidirectional generation method

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108121823A (en) * 2018-01-11 2018-06-05 哈尔滨工业大学 Babbling emotions dialog generation system and method
WO2018133761A1 (en) * 2017-01-17 2018-07-26 华为技术有限公司 Method and device for man-machine dialogue
CN110188182A (en) * 2019-05-31 2019-08-30 中国科学院深圳先进技术研究院 Model training method, dialogue generation method, device, equipment and medium
CN110188177A (en) * 2019-05-28 2019-08-30 北京搜狗科技发展有限公司 Talk with generation method and device
CN110347792A (en) * 2019-06-25 2019-10-18 腾讯科技(深圳)有限公司 Talk with generation method and device, storage medium, electronic equipment
CN110427490A (en) * 2019-07-03 2019-11-08 华中科技大学 A kind of emotion dialogue generation method and device based on from attention mechanism
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
RU2720359C1 (en) * 2019-04-16 2020-04-29 Хуавэй Текнолоджиз Ко., Лтд. Method and equipment for recognizing emotions in speech
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
CN111428015A (en) * 2020-03-20 2020-07-17 腾讯科技(深圳)有限公司 Information generation method, device, equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018133761A1 (en) * 2017-01-17 2018-07-26 华为技术有限公司 Method and device for man-machine dialogue
CN108121823A (en) * 2018-01-11 2018-06-05 哈尔滨工业大学 Babbling emotions dialog generation system and method
WO2020135194A1 (en) * 2018-12-26 2020-07-02 深圳Tcl新技术有限公司 Emotion engine technology-based voice interaction method, smart terminal, and storage medium
RU2720359C1 (en) * 2019-04-16 2020-04-29 Хуавэй Текнолоджиз Ко., Лтд. Method and equipment for recognizing emotions in speech
CN110188177A (en) * 2019-05-28 2019-08-30 北京搜狗科技发展有限公司 Talk with generation method and device
CN110188182A (en) * 2019-05-31 2019-08-30 中国科学院深圳先进技术研究院 Model training method, dialogue generation method, device, equipment and medium
CN110347792A (en) * 2019-06-25 2019-10-18 腾讯科技(深圳)有限公司 Talk with generation method and device, storage medium, electronic equipment
CN110427490A (en) * 2019-07-03 2019-11-08 华中科技大学 A kind of emotion dialogue generation method and device based on from attention mechanism
CN110609891A (en) * 2019-09-18 2019-12-24 合肥工业大学 Visual dialog generation method based on context awareness graph neural network
CN111428015A (en) * 2020-03-20 2020-07-17 腾讯科技(深圳)有限公司 Information generation method, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
MojiTalk: Generating Emotional Responses at Scale;Xianda Zhou等;Computation and Language (cs.CL);全文 *
任务型对话系统研究综述;赵阳洋等;计算机学报;第43卷(第10期);全文 *

Also Published As

Publication number Publication date
CN111897933A (en) 2020-11-06

Similar Documents

Publication Publication Date Title
CN111897933B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN111966800B (en) Emotion dialogue generation method and device and emotion dialogue model training method and device
CN113420807A (en) Multi-mode fusion emotion recognition system and method based on multi-task learning and attention mechanism and experimental evaluation method
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
Chiu et al. How to train your avatar: A data driven approach to gesture generation
CN108595436B (en) Method and system for generating emotional dialogue content and storage medium
CN110297887B (en) Service robot personalized dialogue system and method based on cloud platform
CN111159368A (en) Reply generation method for personalized dialogue
CN115329779B (en) Multi-person dialogue emotion recognition method
CN112214591B (en) Dialog prediction method and device
CN111930918B (en) Cross-modal bilateral personalized man-machine social interaction dialog generation method and system
CN110321418A (en) A kind of field based on deep learning, intention assessment and slot fill method
CN109857865B (en) Text classification method and system
CN112364148B (en) Deep learning method-based generative chat robot
CN114495927A (en) Multi-modal interactive virtual digital person generation method and device, storage medium and terminal
CN114895817B (en) Interactive information processing method, network model training method and device
CN111653270B (en) Voice processing method and device, computer readable storage medium and electronic equipment
CN113704419A (en) Conversation processing method and device
CN112214585A (en) Reply message generation method, system, computer equipment and storage medium
CN115937369A (en) Expression animation generation method and system, electronic equipment and storage medium
CN111598153A (en) Data clustering processing method and device, computer equipment and storage medium
CN116564338B (en) Voice animation generation method, device, electronic equipment and medium
CN117251057A (en) AIGC-based method and system for constructing AI number wisdom
CN112307179A (en) Text matching method, device, equipment and storage medium
CN114386426B (en) Gold medal speaking skill recommendation method and device based on multivariate semantic fusion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant