CN113947702A - Multi-modal emotion recognition method and system based on context awareness - Google Patents

Multi-modal emotion recognition method and system based on context awareness Download PDF

Info

Publication number
CN113947702A
CN113947702A CN202111080047.XA CN202111080047A CN113947702A CN 113947702 A CN113947702 A CN 113947702A CN 202111080047 A CN202111080047 A CN 202111080047A CN 113947702 A CN113947702 A CN 113947702A
Authority
CN
China
Prior art keywords
emotion
feature
modal
context
scene
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111080047.XA
Other languages
Chinese (zh)
Inventor
张立华
杨鼎康
王顺利
邝昊鹏
黄帅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN202111080047.XA priority Critical patent/CN113947702A/en
Publication of CN113947702A publication Critical patent/CN113947702A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)

Abstract

The invention relates to a multi-modal emotion recognition method and system based on context awareness, wherein the system comprises a multi-modal information acquisition unit, an emotion processing unit based on multi-modal behavioral expression, an emotion analysis unit based on scene context, an emotion analysis unit based on agent population body interaction, an emotion analysis unit based on agent and context interaction, a feature fusion unit based on self-adaptive planning, a recognition unit based on discrete emotion, a prediction unit based on continuous emotion and a display module, wherein one part of multi-modal data is derived from facial expression, gait and gesture information; another part is from the context of the scene in the context, the context of the agent's human interaction, and the context information of the scene's interaction with the agent. Compared with the prior art, the method effectively solves the problems that the emotion recognition efficiency in a real scene is low, the emotion recognition accuracy of the existing algorithm does not reach the standard, and the robustness and generalization capability of the recognition algorithm are interfered by external factors.

Description

Multi-modal emotion recognition method and system based on context awareness
Technical Field
The invention relates to the technical field of emotion recognition, in particular to a multi-modal emotion recognition method and system based on context awareness.
Background
Emotion recognition is used as a basis for man-machine emotion interaction, and enables a machine to understand the perceptual thinking of a human, so that the continuous development of machine intelligence is influenced, and the emotion recognition becomes a key element of man-machine natural interaction. In recent years, emotion recognition techniques based on multimodal have been receiving increasing attention from researchers. The method is promoted by emotional psychology research, aims to fuse various emotional signals such as facial expressions, voice, body gestures and gait, and improves the accuracy and precision of emotion recognition through various fusion modes.
Context awareness, as a current hot research in the field of computer vision, plays a significant role in understanding human emotion in real scenes. The situation context where the human is usually located contains rich semantic information, the emotion of the human is sensed in different situation contexts through a deep learning mode, and feature levels and decision levels are fused by using multi-modal emotion characteristics extracted from the situation context, so that more emotional clues except a human subject can be obtained to promote emotional expression and emotional understanding.
The situation perception technology under the real environment is used for multi-modal emotion recognition, and is a new field with great research value. Most of the work today is based on deep learning network architecture implementations. Early research work aimed at achieving emotion recognition in combination with intuitive emotional expression of facial expressions and overall contextual information; the subsequent work utilizes a Region Proposal Network (RPN) to extract context elements from the samples, the context elements are used as nodes of an emotion graph and fed into a graph convolution neural network (GCN) to encode context information, and finally multi-modal emotion recognition is realized based on a characteristic cascading mode. In recent work, some researchers have considered all information other than the face as context and extracted contextual emotional expressions by masking human facial expressions from the image. Some other famous work starts from the psychological point of view, after the expression of emotional features in the situation is learned based on the attention mechanism, the approach degree and the distance between human subjects under the situation of multi-person interaction are explored in a mode of thermodynamic diagram, so that the emotional context information among groups is mined, and the accuracy of emotion recognition is improved.
The existing multi-modal emotion recognition method based on context awareness usually only encodes and extracts features aiming at complete context semantic information collected in images and videos, and the method has the defects that the interference of emotion exposure changes of other subjects in the background environment on emotion prediction of an emotion recognition subject is not considered, and the approximate accuracy is greatly reduced; meanwhile, the encoding mode of emotion information in a multi-person interaction situation is simple, modeling analysis is usually performed in the form of a graph convolution network or a heat map, and the method treats high-dimensional emotion distances among multiple persons as constant values, and only roughly measures emotion tension change, so that situation context representations related to main emotion change are difficult to capture, and the method is lack of rationality. Meanwhile, the change of the interactive context of the human subject and the environment in the situation is rarely considered, so that redundancy and errors occur in the expression of most multi-modal emotional features, and the robustness of an emotion recognition model cannot be guaranteed.
Current multi-modal forms tend to focus on extracting emotional cues in the context, excluding the human subject, and neglecting the utilization of the human extrinsic emotional behavioral manifestations features. For example, facial expressions, gait, and gestures are combined with modal signals that are closely related to emotional information leakage. In addition, the traditional emotion definition mode based on the discrete emotion model cannot scientifically and effectively describe the nature of emotion change, so that the evaluation and analysis of emotion recognition results are lack of effectiveness.
In summary, a novel method based on context awareness is developed, human body external behavior expression modalities such as facial expressions, gaits and gestures are fully utilized, and a multi-modal emotion recognition system for modeling and analyzing emotion interaction behaviors among human subjects, scenes and human groups is combined, so that the problem to be solved by technical staff in the field of research is urgent.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a multi-modal emotion recognition method and system based on context awareness, which fully utilize human body external behavior expression modes such as facial expressions, gaits, gestures and the like and are combined with modeling analysis of emotion interaction behaviors among human subjects, scenes and human groups.
The purpose of the invention can be realized by the following technical scheme:
a multi-modal emotion recognition method based on context awareness comprises the following steps:
multi-mode information acquisition: collecting a video and a real world image for emotion recognition, wherein the video comprises a human main body and other agents to be subjected to emotion recognition;
emotion processing based on multi-modal behavioral expression: extracting facial expression features, facial landmark points, human postures and feature vectors of human gestures according to the video, and generating an external behavior expression modal vector in an initial feature cascade mode;
emotion analysis step based on scene context: adding masks to the human body in each video frame in the real world image and the video to obtain a scene image, and then extracting the features of scene emotion semantics to obtain a first emotion feature vector;
emotion analysis based on agent group interaction: respectively extracting human subject and other agent information from each video frame in the real world image and the video, then extracting initial characterization features, feeding each initial characterization feature into an image attention network as an emotion node, and constructing an emotion relationship image; according to the emotional relationship diagram, calculating the emotional influence strength and degree of different other agents on the human body, judging the weight of the emotional feature vector generated by interaction of the other agents through the emotional similarity coefficient, and performing weighted average operation with the initial characterization feature to obtain a second emotional feature vector;
emotion analysis based on agent and context interaction: adding masks to other agents in each video frame in the real-world image and the video to obtain a scene image, and extracting initial features of the scene; establishing a basic feature map according to the initial characteristic features of other agents, and performing feature aggregation on the initial features of the scene and the basic feature map to obtain a third emotional feature vector;
and (3) feature fusion step: performing feature fusion on the external behavior expression modal vector, the first emotion feature vector, the second emotion feature vector and the third emotion feature vector to obtain a fusion feature vector;
and emotion recognition: and performing emotion recognition according to the fusion feature vector.
Further, in the emotion analyzing step based on the scene context, the extracting the features of the scene emotion semantics specifically includes: selecting a residual error neural network as a backbone network of a main model, alternately embedding a plurality of residual error connecting blocks in the residual error neural network into an attention machine modeling module based on a channel and a space in sequence to form a complete attention extraction network, and loading the scene image into the attention extraction network to perform feature extraction of scene emotion semantics.
Further, the channel and space-based attention mechanism module includes a channel attention mechanism and a space attention mechanism, the channel attention mechanism including: 1D channel attention diagram is deduced through global average pooling
Figure BDA0003263615790000031
Then, feature combination is carried out on an output layer through multiplication at a channel level;
the spatial attention mechanism comprises: reasoning out a 2D spatial attention diagram through a global max-pooling layer
Figure BDA0003263615790000032
Feature merging is then performed at the output layer by multiplication at the channel level.
Further, the feature fusion performed in the feature fusion step specifically includes:
and selecting strongly-correlated feature vectors and weakly-correlated feature vectors from the external behavior expression modal vector, the first emotion feature vector, the second emotion feature vector and the third emotion feature vector, performing feature fusion on the strongly-correlated feature vectors through feature cascade operation, and performing feature fusion on the weakly-correlated feature vectors through a multiplicative fusion mode.
Further, the emotion recognition step specifically comprises a discrete emotion recognition sub-step and a continuous emotion prediction sub-step;
the discrete emotion recognition sub-step comprises the following steps: mapping the fusion feature vector to a range from 0 to 1, then calculating a cross entropy loss function for each output node and the corresponding label, and predicting the obtained expression label by calculating the probability of each type of possible output expression label.
Further, the continuous emotion prediction substep comprises data normalization, tag difference summation, error amplitude calculation and continuous numerical prediction in sequence, and is realized by a pre-constructed and trained network model, and the network model adopts mean square error loss to calculate the sum of squares of the difference between a predicted numerical value and a target numerical value so as to train the network model.
Further, the emoji labels in the discrete emotion recognition substep include happy, surprised, sad, hated, excited, peaceful, horror, and angry;
the output of the continuous emotion prediction sub-step is the predicted values in 1 to 10 of the VAD continuous model of emotion, which refers to the awakening degree, the control degree and the pleasure degree of emotion.
Furthermore, in the emotion processing step based on multi-modal behavioral expression, facial expression contours are extracted through a face detector, and then feature extraction operation is carried out through a designed convolutional neural network to obtain facial expression feature vectors;
extracting a plurality of facial landmark points through a facial detector, and acquiring and converting the facial landmark points into emotion feature vectors through a convolutional neural network;
extracting a plurality of coordinate points of the human body posture through a posture detector, and feeding the coordinate points to an encoder network for feature extraction to obtain a feature vector of the human body posture;
and extracting human hand representation key points through an attitude detector, and acquiring the characteristic vector of the human gesture by using a convolutional neural network.
The invention also provides a system adopting the multi-modal emotion recognition method based on context awareness, which comprises the following steps:
a multimodal information collection unit configured to perform the multimodal information collection step;
the emotion processing unit is configured to execute the emotion processing step based on the multi-modal behavioral expression;
a context-based emotion analysis unit configured to perform the context-based emotion analysis step;
an emotion analysis unit based on agent population interaction, configured to perform the emotion analysis step based on agent population interaction;
an emotion analysis unit based on agent and context interaction, configured to execute the emotion analysis step based on agent and context interaction;
an adaptive planning based feature fusion unit configured to perform the feature fusion step;
an emotion recognition unit configured to perform the emotion recognition step.
Further, the system further comprises a display module configured to harness the output result of the emotion recognition unit.
Compared with the prior art, the invention has the following advantages:
(1) different from the traditional multi-mode emotion recognition method, the invention provides emotion understanding and reasoning modes based on context perception, and attempts to perform emotion judgment and analysis by multi-mode semantic assistance except for an emotion recognition main body. Specifically, the emotion analysis unit of the scene context extracts things contained in the real world and context emotion semantic information in the background environment, so that emotion external representation is strengthened, and judgment capability of emotion recognition is improved; the emotion analysis unit for agent group interaction analyzes the emotion intensity between different agents by analyzing the emotion transfer relationship between the emotion recognition main body and other surrounding agents and utilizing an advanced pattern attention neural network so as to assist and enhance the emotion characterization capability of the recognition main body; the emotion analysis unit of the agent and the situation interaction aims at mining hidden emotion states triggered by social activities of other agents in a scene, and completely identifies the emotion expression space of a main body in a characteristic aggregation mode.
(2) The multi-modal information based on the external behavioral expression effectively solves the problem of system performance reduction caused by partial modal information loss and abnormity caused by shielding and sensor noise in life. Meanwhile, facial expression information and facial key point information are simultaneously adopted in facial analysis, and feature extraction and fusion are carried out through an advanced convolutional neural network, so that the representation capability of the external emotion is enhanced to the maximum extent.
(3) Different from a single feature splicing mode in the traditional feature fusion, the self-adaptive planning fusion unit provided by the invention can effectively and dynamically plan fusion strategies of different modes, adaptively considers the difference and the correlation between heterogeneous modes, and fully excavates potential emotional features and further strengthens the classification and prediction capabilities of the explicit emotional features through a multiplicative fusion and feature cascade intelligent selection mode.
(4) The multi-task learning mode based on emotion classification and prediction can effectively reveal the evolution and the divulgence process of emotion, redefines emotion recognition rules, is not like the traditional mode which only considers the classification of discrete emotion, but combines discrete emotion nodes in a high-dimensional emotion space and emotion states in continuous space change, and effectively promotes the reliability and the accuracy of a multi-mode emotion recognition model through co-training and learning between the discrete emotion nodes and the emotion states in the continuous space change.
Drawings
FIG. 1 is a schematic block diagram of a multi-modal emotion recognition system based on context awareness provided in an embodiment of the present invention;
FIG. 2 is a schematic block diagram of an emotion processing unit based on multi-modal behavioral expressions provided in an embodiment of the present invention;
FIG. 3 is a schematic block diagram of an emotion analysis unit based on context provided in an embodiment of the present invention;
FIG. 4 is a schematic block diagram of an emotion analysis unit based on agent group interaction provided in an embodiment of the present invention;
FIG. 5 is a schematic block diagram of an emotion analysis unit based on agent and context interaction provided in an embodiment of the present invention;
FIG. 6 is a schematic block diagram of a feature fusion unit based on adaptive programming according to an embodiment of the present invention;
FIG. 7 is a schematic block diagram of a discrete emotion-based recognition unit provided in an embodiment of the present invention;
FIG. 8 is a schematic block diagram of a prediction unit based on continuous emotion according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The embodiment provides a multi-modal emotion recognition method based on context awareness, which comprises the following steps:
multi-mode information acquisition: collecting a video and a real world image for emotion recognition, wherein the video comprises a human main body and other agents to be subjected to emotion recognition;
emotion processing based on multi-modal behavioral expression: extracting feature vectors of facial expression features, facial landmark points, human postures and human gestures according to the video, and generating an external behavior expression modal vector in an initial feature cascade mode;
emotion analysis step based on scene context: adding masks to human bodies in real world images and video frames to obtain scene images, and then extracting features of scene emotion semantics to obtain a first emotion feature vector;
emotion analysis based on agent group interaction: respectively extracting human subject and other agent information from each video frame in a real-world image and a video, then extracting initial characterization features, feeding each initial characterization feature into an image attention network as an emotion node, and constructing an emotion relationship image; according to the emotional relationship diagram, calculating the emotional influence strength and degree of different other agents on the human body, judging the weight of the emotional feature vector generated by interaction of the other agents through the emotional similarity coefficient, and performing weighted average operation with the initial characterization feature to obtain a second emotional feature vector;
emotion analysis based on agent and context interaction: adding masks to other agents in each video frame in the real world image and the video to obtain a scene image, and extracting initial features of the scene; establishing a basic feature map according to the initial characteristic features of other agents, and performing feature aggregation on the initial features of the scene and the basic feature map to obtain a third emotional feature vector;
and (3) feature fusion step: performing feature fusion on the external behavior expression modal vector, the first emotion feature vector, the second emotion feature vector and the third emotion feature vector to obtain a fusion feature vector;
and emotion recognition: and performing emotion recognition according to the fusion feature vector.
The steps are described in detail below.
1.1 Emotion processing procedure based on Multi-modal behavioral expressions
In the emotion processing step based on multi-modal behavioral expression, facial expression contours are extracted through a face detector, and then feature extraction operation is carried out through a designed convolutional neural network to obtain facial expression feature vectors;
extracting a plurality of facial landmark points through a facial detector, and acquiring and converting the facial landmark points into emotion feature vectors through a convolutional neural network;
extracting a plurality of coordinate points of the human body posture through a posture detector, and feeding the coordinate points to an encoder network for feature extraction to obtain a feature vector of the human body posture;
and extracting human hand representation key points through an attitude detector, and acquiring the characteristic vector of the human gesture by using a convolutional neural network.
1.2 Emotion analysis step based on scene context
In the emotion analysis step based on the scene context, the feature extraction of the scene emotion semantics is specifically as follows: selecting a residual error neural network as a backbone network of a main model, alternately embedding a plurality of residual error connecting blocks in the residual error neural network into an attention machine modeling module based on a channel and a space in sequence to form a complete attention extraction network, and loading a scene image into the attention extraction network to perform feature extraction of scene emotion semantics.
The channel and space-based attention mechanism module comprises a channel attention mechanism and a space attention mechanism, wherein the channel attention mechanism comprises: 1D channel attention diagram is deduced through global average pooling
Figure BDA0003263615790000081
Then, feature combination is carried out on an output layer through multiplication at a channel level;
the spatial attention mechanism comprises: reasoning out a 2D spatial attention diagram through a global max-pooling layer
Figure BDA0003263615790000082
Feature merging is then performed at the output layer by multiplication at the channel level.
1.3 feature fusion step
The feature fusion in the feature fusion step specifically comprises the following steps:
and selecting strongly-correlated feature vectors and weakly-correlated feature vectors from the external behavior expression modal vector, the first emotion feature vector, the second emotion feature vector and the third emotion feature vector, performing feature fusion on the strongly-correlated feature vectors through feature cascade operation, and performing feature fusion on the weakly-correlated feature vectors through a multiplicative fusion mode.
1.4 emotion recognition step
The emotion recognition step specifically comprises a discrete emotion recognition sub-step and a continuous emotion prediction sub-step;
the discrete emotion recognition sub-step comprises the following steps: mapping the fusion feature vector to a range from 0 to 1, then calculating a cross entropy loss function for each output node and the corresponding label, and predicting the obtained expression label by calculating the probability of each type of expression label which is possible to be output.
The continuous emotion prediction sub-step comprises the steps of sequentially carrying out data normalization, label difference summation, error amplitude calculation and continuous numerical value prediction, the continuous emotion prediction sub-step is realized through a pre-constructed and trained network model, and the network model adopts mean square error loss to calculate the sum of squares of the difference between a predicted numerical value and a target numerical value so as to train the network model.
The emoticons in the discrete emotion recognition substep include happy, surprised, sad, disgust, excited, peaceful, frightened, and angry;
the output of the continuous emotion prediction sub-step is the predicted values in 1 to 10 of the VAD continuous model of emotion, which refers to the arousal, control and pleasure of emotion.
The embodiment also provides a system adopting the multi-modal emotion recognition method based on context awareness, which comprises the following steps:
a multimodal information acquisition unit configured to perform a multimodal information acquisition step;
the emotion processing unit is configured to execute emotion processing steps based on the multi-modal behavioral expressions;
a context-based emotion analysis unit configured to perform a context-based emotion analysis step;
an emotion analysis unit based on agent crowd interaction, configured to perform emotion analysis steps based on agent crowd interaction;
the emotion analysis unit is used for carrying out emotion analysis steps based on the interaction between the agent person and the situation;
an adaptive planning based feature fusion unit configured to perform a feature fusion step;
an emotion recognition unit configured to perform an emotion recognition step;
specifically, the emotion recognition unit in this embodiment includes a recognition unit based on discrete emotion and a prediction unit based on continuous emotion.
Preferably, the system further comprises a display module configured to output the result of the wire harness emotion recognition unit.
Specifically, the multi-modal data is derived from face, posture and gesture information of the external expression of the human body on one hand, and from complete scene information, all agent interaction information and scene and agent interaction information in an image or video acquired in the preprocessing process on the other hand. And then extracting corresponding different emotional voice characteristics through different neural networks and processing technologies in different emotional processing and analyzing units. In a feature fusion unit of the adaptive programming, in order to resist signal interference and redundant information generated in a multi-modal data acquisition unit, the unit performs feature fusion in an adaptive mode combining multiplicative and cascading strategies to ensure the integrity and effectiveness of multi-modal emotional features. And then the recognition unit fed to the discrete emotion outputs emotion classes after training the network by using multi-label classification loss. Specifically, the emotional categories include happy, surprised, sad, disgust, excited, peaceful, frightened, and angry. After training the network by mean square error loss in the continuous emotion prediction unit, the predicted values in 1 to 10 of the VAD continuous model of emotion are output. Specifically, the VAD model refers to the arousal degree, the control degree and the pleasure degree of emotion, measures the change condition of the emotion space in a continuous state, and can more vividly depict the emotion intensity and represent the emotion difference. The results of the discrete emotion analysis and the continuous emotion prediction may then be presented by the display unit.
The specific implementation of each unit of the system is described in detail below.
2.1 Emotion processing Unit based on Multi-modal behavioral expressions
Fig. 2 is a schematic block diagram of an emotion processing unit based on multi-modal behavioral expression according to this embodiment, where the unit includes four sub-feature extraction units, which are a facial expression extraction unit, a facial key point extraction unit, an emotion gesture extraction unit, and an emotion gesture extraction unit. For the facial expression extraction unit, firstly, facial expression contours are extracted through an OpenFace face detector to obtain facial images with the size of 224 x 224, and then feature extraction operation is carried out through a designed five-layer convolutional neural network. The head part of the neural network comprises five convolutional layers and a batch normalization layer and a ReLu activation layer which are connected, and the tail part of the neural network comprises two maximum pooling layers for feature dimension reduction and scaling, so that a feature vector with the size of 26 x 1 is finally obtained. For the facial keypoint extraction unit, the facial keypoints are first extracted 68 by the OpenFace face detector and then converted into initial feature vectors of size 136 × 1. The vector passes through two one-dimensional convolution layers, a batch normalization layer and a ReLu activation layer which are connected, and finally a characteristic vector with the size of 26 x 1 is obtained through a full connection layer. And for the emotion gesture extraction unit, extracting coordinate points of 26 human body gestures through an AlphaPose gesture detector, converting the coordinate points into feature vectors of 26 x 2 x 1, feeding the feature vectors into an encoder network for feature extraction, and finally obtaining the feature vectors with the size of 26 x 1. Aiming at the emotional gesture extraction unit, human hand representation key points are extracted through an AlphaPose gesture detector, 512 x 1 feature vectors are obtained through three layers of transposed convolution layers, connected batch normalization layers and ReLu activation layers, and then the feature vectors with the size of 26 x 1 are obtained through the three layers of convolution layers. And then acquiring an external behavior expression modal vector with the size of 104 x 1 in an initial feature cascading mode to perform modal output.
2.2 Emotion analysis Unit based on scene context
Fig. 3 is a schematic block diagram of an emotion analysis unit based on context of a scene, which mainly includes several stages of data preprocessing, attention feature extraction based on channels, attention feature extraction based on space, thermodynamic diagram generation, feature extraction, and feature output. Specifically, in the data preprocessing stage, masks are added to input real world images and a human agent body to be subjected to emotion recognition represented by a video frame intercepted by a video to obtain a scene image only retaining scene information, then ResNet-18 of a residual error neural network is selected as a backbone network of a body model, 8 residual error connecting blocks contained in the scene image are sequentially and alternately embedded into an attention machine modeling module based on a channel and a space to form a complete attention extracting network for carrying out feature extraction of scene emotion semantics. For a channel attention mechanism, a 1D channel attention diagram is deduced through global average pooling, and then feature merging is carried out on an output layer through channel-level multiplication; for the spatial attention mechanism, a 2D spatial attention map is deduced through a global maximum pooling layer, and then feature merging is carried out through multiplication at a channel level at an output layer. With the help of the attention mechanism, the system can focus more on the emotion clues related to the corresponding agents, and further generate the attention heat map. In the attention heat map, the embodiment can visualize and label the scene semantics with higher weight, so as to facilitate analysis of the association degree and the coupling relationship between the scene information and the emotion. Finally, feature extraction is carried out and an emotional feature vector with the size of 26 x 1 is obtained.
2.3 Emotion analysis unit based on agent group interaction
Fig. 4 is a schematic block diagram of an emotion analysis unit based on agent group interaction according to this embodiment, which mainly includes four steps of data preprocessing, feature pre-extraction, emotion relationship diagram construction, and emotion feature output. Specifically, the data preprocessing stage is used for respectively extracting human subjects I to be emotion recognized from input real-world images and video frame representations I obtained by video captureagentAnd other agent information IpFirstly extracting two initial characteristics through a depth residual error network ResNet-18 and respectively recording the two initial characteristics as fagentAnd fpDifferent features are then fed into a Graph Attention Network (Graph Attention Network) as emotion nodes to construct an emotion relationship Graph. Then, considering that different other agents have different emotional influence strength and degree on the recognition subject, calculating the emotional similarity coefficient between the agents and recording the emotional similarity coefficient as nij=α([Wfagent||Wfp]). Where W represents a weight parameter, α (.) represents a feature mapping relationship and | | | represents a join operation. In order to enhance the relationship learning of emotion transfer, the embodiment simultaneously uses a multi-head attention mechanism to realize the fusion of domain node features, namely three times of emotion is carried out between two adjacent feature nodesAnd calculating a relation coefficient. Finally, based on the obtained other agent characteristics and original recognition main body characteristics under different weight influences, weighted average operation is carried out, and 26 x 1 sized emotional characteristic vector h is output3
2.4 Emotion analysis Unit based on agent and context interaction
FIG. 5 is a schematic block diagram of an emotion analysis unit based on agent and context interaction according to the present embodiment. The unit aims to explore the emotional influence of the emotional clues generated in the interaction process of other agents and scenes on the identification of the main agent. The method comprises four steps of data preprocessing, agent feature graph construction, feature aggregation and modal output. Specifically, image I, which retains only scene information, is obtained by adding masks to the input real world image and all other agents in I represented by video frames captured using videosFirstly extracting initial features f of a scene through a depth residual error network ResNet-18sThen using f obtained in emotion analysis unit of agent group interactionagentThe graph convolution neural network fed into the two layers builds a basic feature graph. Then using long-short term memory network to realize fsAnd fagentObtaining an emotional feature vector h with the size of 26 x 14
2.5 feature fusion unit based on adaptive programming
Fig. 6 is a schematic block diagram of a feature fusion unit based on adaptive programming according to this embodiment. Specifically, the unit can adaptively perform feature screening and fusion according to the characteristics of the multi-modal features obtained by the emotion analysis and recognition unit. When the input features are strongly related feature vectors based on facial expressions, facial key points and the like, in order to ensure the completeness and strong expression capability of an emotional feature space, the system automatically performs feature cascade operation to perform feature fusion; when the input features are based on the emotion information transmitted between agents and the weakly-related feature vectors such as the emotion semantic information of a scene, in order to fully mine and enhance the representation capability of the emotion information, the system can automatically execute a multiplicative fusion mode, and perform feature penetration and complementation by keeping the original dimension unchanged, so as to make up the difference between the features to the maximum extent. And finally outputting the fused features.
2.6 recognition Unit based on discrete Emotion
Fig. 7 is a schematic block diagram of a discrete emotion-based recognition unit provided in this embodiment, in the unit, normalization processing is performed on the obtained fused emotion feature vector, then an output feature value is mapped between 0 and 1 by using a Sigmoid function, then a cross entropy loss function is calculated for each output node and a corresponding label, and the probability of each class of possibly output expression labels is calculated, and the sum of all eight probabilities is 1. And then selecting the label with the maximum probability to output, namely the label is the expression label predicted to be obtained by the system.
2.7 prediction Unit based on continuous Emotion
Fig. 8 is a schematic block diagram of a prediction unit based on continuous emotion according to this embodiment, which mainly includes several steps of data normalization, tag difference summation, error magnitude calculation, and final continuous value prediction. In particular, the unit uses the mean square error loss to calculate the sum of the squares of the differences between the predicted values and the target values. Due to the reliability of data collection in this embodiment, it is robust to predict outliers. The model may strive to reduce errors caused by outliers, resulting in an improved overall performance of the model.
The mean-square error loss is optimized to obtain the mean value of all the observations, so that the gradient of the neural network has great jump at an extreme point in the training process of the neural network, and the error amplitude calculation has good characteristics at the extreme point. Meanwhile, a strategy of dynamically adjusting the learning rate is added, the gradient of the mean square error is reduced along with the reduction of the loss function, and the characteristic enables a more accurate result to be obtained in the final training process.
After the emotion information is obtained through the discrete emotion recognition unit and the continuous emotion prediction unit, result display and presentation are further achieved in a visualization mode in the display unit.
The multi-modal emotion recognition system based on context awareness disclosed by the embodiment introduces scene semantics, agent interaction and emotion characteristics contained in the agent and scene interaction in context awareness for the first time, and effectively promotes the development of multi-modal emotion recognition in the real world. Meanwhile, by fully collecting external emotion information such as facial expression information, facial key point information, gesture signals and posture signals of an emotion recognition main body and combining different recognition units to carry out preprocessing and feature extraction on heterogeneous modal information, organic fusion of multi-modal features is further realized by using a feature fusion mode of self-adaptive programming. Finally, the multi-task learning mode based on emotion classification and prediction greatly enhances the accuracy of multi-mode emotion recognition, and improves the generalization performance and accuracy of the model. The method provided by the embodiment can provide a complete and effective emotion distinguishing feature space, and provides reliable guarantee for subsequent human natural emotion understanding and emotion characterization of the open world.
The foregoing detailed description of the preferred embodiments of the invention has been presented. It should be understood that numerous modifications and variations could be devised by those skilled in the art in light of the present teachings without departing from the inventive concepts. Therefore, the technical solutions available to those skilled in the art through logic analysis, reasoning and limited experiments based on the prior art according to the concept of the present invention should be within the scope of protection defined by the claims.

Claims (10)

1. A multi-modal emotion recognition method based on context awareness is characterized by comprising the following steps:
multi-mode information acquisition: collecting a video and a real world image for emotion recognition, wherein the video comprises a human main body and other agents to be subjected to emotion recognition;
emotion processing based on multi-modal behavioral expression: extracting facial expression features, facial landmark points, human postures and feature vectors of human gestures according to the video, and generating an external behavior expression modal vector in an initial feature cascade mode;
emotion analysis step based on scene context: adding masks to the human body in each video frame in the real world image and the video to obtain a scene image, and then extracting the features of scene emotion semantics to obtain a first emotion feature vector;
emotion analysis based on agent group interaction: respectively extracting human subject and other agent information from each video frame in the real world image and the video, then extracting initial characterization features, feeding each initial characterization feature into an image attention network as an emotion node, and constructing an emotion relationship image; according to the emotional relationship diagram, calculating the emotional influence strength and degree of different other agents on the human body, judging the weight of the emotional feature vector generated by interaction of the other agents through the emotional similarity coefficient, and performing weighted average operation with the initial characterization feature to obtain a second emotional feature vector;
emotion analysis based on agent and context interaction: adding masks to other agents in each video frame in the real-world image and the video to obtain a scene image, and extracting initial features of the scene; establishing a basic feature map according to the initial characteristic features of other agents, and performing feature aggregation on the initial features of the scene and the basic feature map to obtain a third emotional feature vector;
and (3) feature fusion step: performing feature fusion on the external behavior expression modal vector, the first emotion feature vector, the second emotion feature vector and the third emotion feature vector to obtain a fusion feature vector;
and emotion recognition: and performing emotion recognition according to the fusion feature vector.
2. The multi-modal emotion recognition method based on context awareness of claim 1, wherein in the emotion analysis step based on context, the feature extraction of the scene emotion semantics is specifically: selecting a residual error neural network as a backbone network of a main model, alternately embedding a plurality of residual error connecting blocks in the residual error neural network into an attention machine modeling module based on a channel and a space in sequence to form a complete attention extraction network, and loading the scene image into the attention extraction network to perform feature extraction of scene emotion semantics.
3. The method of claim 2, wherein the channel and space-based attention mechanism module comprises a channel attention mechanism and a space attention mechanism, and the channel attention mechanism comprises: 1D channel attention diagram is deduced through global average pooling
Figure FDA0003263615780000021
Then, feature combination is carried out on an output layer through multiplication at a channel level;
the spatial attention mechanism comprises: reasoning out a 2D spatial attention diagram through a global max-pooling layer
Figure FDA0003263615780000022
Feature merging is then performed at the output layer by multiplication at the channel level.
4. The multi-modal emotion recognition method based on context awareness, as claimed in claim 1, wherein the feature fusion performed in the feature fusion step specifically comprises:
and selecting strongly-correlated feature vectors and weakly-correlated feature vectors from the external behavior expression modal vector, the first emotion feature vector, the second emotion feature vector and the third emotion feature vector, performing feature fusion on the strongly-correlated feature vectors through feature cascade operation, and performing feature fusion on the weakly-correlated feature vectors through a multiplicative fusion mode.
5. The multi-modal emotion recognition method based on context awareness, as claimed in claim 1, wherein the emotion recognition step specifically comprises a discrete emotion recognition sub-step and a continuous emotion prediction sub-step;
the discrete emotion recognition sub-step comprises the following steps: mapping the fusion feature vector to a range from 0 to 1, then calculating a cross entropy loss function for each output node and the corresponding label, and predicting the obtained expression label by calculating the probability of each type of possible output expression label.
6. The multi-modal emotion recognition method based on context awareness of claim 5, wherein the sequential emotion prediction sub-step comprises data normalization, tag difference summation, error magnitude calculation and sequential numerical prediction in sequence, and the sequential emotion prediction sub-step is implemented by a pre-constructed and trained network model which uses mean square error loss to calculate the sum of squares of the difference between the predicted numerical value and the target numerical value to train the network model.
7. The multi-modal emotion recognition method based on context awareness, as claimed in claim 5, wherein the emotion labels in the discrete emotion recognition sub-step include happy, surprised, sad, hated, excited, peaceful, frightened and angry;
the output of the continuous emotion prediction sub-step is the predicted values in 1 to 10 of the VAD continuous model of emotion, which refers to the awakening degree, the control degree and the pleasure degree of emotion.
8. The multi-modal emotion recognition method based on context awareness of claim 1, wherein in the emotion processing step based on multi-modal behavioral expression, facial expression contours are extracted through a face detector, and then feature extraction operation is performed through a designed convolutional neural network to obtain facial expression feature vectors;
extracting a plurality of facial landmark points through a facial detector, and acquiring and converting the facial landmark points into emotion feature vectors through a convolutional neural network;
extracting a plurality of coordinate points of the human body posture through a posture detector, and feeding the coordinate points to an encoder network for feature extraction to obtain a feature vector of the human body posture;
and extracting human hand representation key points through an attitude detector, and acquiring the characteristic vector of the human gesture by using a convolutional neural network.
9. A system for applying a context awareness based multi-modal emotion recognition method as claimed in any of claims 1-8, comprising:
a multimodal information collection unit configured to perform the multimodal information collection step;
the emotion processing unit is configured to execute the emotion processing step based on the multi-modal behavioral expression;
a context-based emotion analysis unit configured to perform the context-based emotion analysis step;
an emotion analysis unit based on agent population interaction, configured to perform the emotion analysis step based on agent population interaction;
an emotion analysis unit based on agent and context interaction, configured to execute the emotion analysis step based on agent and context interaction;
an adaptive planning based feature fusion unit configured to perform the feature fusion step;
an emotion recognition unit configured to perform the emotion recognition step.
10. The system of claim 9, further comprising a display module configured to harness the output of the emotion recognition unit.
CN202111080047.XA 2021-09-15 2021-09-15 Multi-modal emotion recognition method and system based on context awareness Pending CN113947702A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111080047.XA CN113947702A (en) 2021-09-15 2021-09-15 Multi-modal emotion recognition method and system based on context awareness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111080047.XA CN113947702A (en) 2021-09-15 2021-09-15 Multi-modal emotion recognition method and system based on context awareness

Publications (1)

Publication Number Publication Date
CN113947702A true CN113947702A (en) 2022-01-18

Family

ID=79328552

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111080047.XA Pending CN113947702A (en) 2021-09-15 2021-09-15 Multi-modal emotion recognition method and system based on context awareness

Country Status (1)

Country Link
CN (1) CN113947702A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496077A (en) * 2022-11-18 2022-12-20 之江实验室 Multimode emotion analysis method and device based on modal observation and grading
CN117058597A (en) * 2023-10-12 2023-11-14 清华大学 Dimension emotion recognition method, system, equipment and medium based on audio and video
CN117149944A (en) * 2023-08-07 2023-12-01 北京理工大学珠海学院 Multi-mode situation emotion recognition method and system based on wide time range
CN117235604A (en) * 2023-11-09 2023-12-15 江苏云幕智造科技有限公司 Deep learning-based humanoid robot emotion recognition and facial expression generation method

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496077A (en) * 2022-11-18 2022-12-20 之江实验室 Multimode emotion analysis method and device based on modal observation and grading
CN115496077B (en) * 2022-11-18 2023-04-18 之江实验室 Multimode emotion analysis method and device based on modal observation and grading
CN117149944A (en) * 2023-08-07 2023-12-01 北京理工大学珠海学院 Multi-mode situation emotion recognition method and system based on wide time range
CN117149944B (en) * 2023-08-07 2024-04-23 北京理工大学珠海学院 Multi-mode situation emotion recognition method and system based on wide time range
CN117058597A (en) * 2023-10-12 2023-11-14 清华大学 Dimension emotion recognition method, system, equipment and medium based on audio and video
CN117058597B (en) * 2023-10-12 2024-01-05 清华大学 Dimension emotion recognition method, system, equipment and medium based on audio and video
CN117235604A (en) * 2023-11-09 2023-12-15 江苏云幕智造科技有限公司 Deep learning-based humanoid robot emotion recognition and facial expression generation method

Similar Documents

Publication Publication Date Title
Wang et al. RGB-D-based human motion recognition with deep learning: A survey
KR101986002B1 (en) Artificial agents and method for human intention understanding based on perception-action connected learning, recording medium for performing the method
CN113947702A (en) Multi-modal emotion recognition method and system based on context awareness
CN111523378B (en) Human behavior prediction method based on deep learning
Areeb et al. Helping hearing-impaired in emergency situations: A deep learning-based approach
CN111434118B (en) Apparatus and method for generating user interest information
KR102132407B1 (en) Method and apparatus for estimating human emotion based on adaptive image recognition using incremental deep learning
CN110598587B (en) Expression recognition network training method, system, medium and terminal combined with weak supervision
Boualia et al. Pose-based human activity recognition: a review
Dharanya et al. Facial Expression Recognition through person-wise regeneration of expressions using Auxiliary Classifier Generative Adversarial Network (AC-GAN) based model
CN113673244B (en) Medical text processing method, medical text processing device, computer equipment and storage medium
Basly et al. DTR-HAR: deep temporal residual representation for human activity recognition
CN110633689B (en) Face recognition model based on semi-supervised attention network
Heidari et al. Progressive spatio-temporal bilinear network with Monte Carlo dropout for landmark-based facial expression recognition with uncertainty estimation
Bendre et al. Show why the answer is correct! towards explainable ai using compositional temporal attention
Zhang et al. Emotion recognition from body movements with as-lstm
Yee et al. Apex frame spotting using attention networks for micro-expression recognition system
Hussain et al. Deep learning for audio visual emotion recognition
Vayadande et al. Lipreadnet: A deep learning approach to lip reading
Fatima et al. Use of affect context in dyadic interactions for continuous emotion recognition
Dubey Usage of deep learning in recent applications
De et al. Computational intelligence for human action recognition
CN116844225B (en) Personalized human body action recognition method based on knowledge distillation
KR102585149B1 (en) Customized self-driving devices and methods based on deep learning using cognitive characteristics
CN116434335B (en) Method, device, equipment and storage medium for identifying action sequence and deducing intention

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination