CN114399818A - Multi-mode face emotion recognition method and device - Google Patents

Multi-mode face emotion recognition method and device Download PDF

Info

Publication number
CN114399818A
CN114399818A CN202210010258.4A CN202210010258A CN114399818A CN 114399818 A CN114399818 A CN 114399818A CN 202210010258 A CN202210010258 A CN 202210010258A CN 114399818 A CN114399818 A CN 114399818A
Authority
CN
China
Prior art keywords
sequence
modal
frame
face
emotion recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210010258.4A
Other languages
Chinese (zh)
Inventor
刘羽中
李华亮
范圣平
沈雅利
王琪如
谢庭军
翟永昌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Power Grid Co Ltd
Electric Power Research Institute of Guangdong Power Grid Co Ltd
Original Assignee
Guangdong Power Grid Co Ltd
Electric Power Research Institute of Guangdong Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Power Grid Co Ltd, Electric Power Research Institute of Guangdong Power Grid Co Ltd filed Critical Guangdong Power Grid Co Ltd
Priority to CN202210010258.4A priority Critical patent/CN114399818A/en
Publication of CN114399818A publication Critical patent/CN114399818A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Human Computer Interaction (AREA)
  • Signal Processing (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

The invention discloses a multi-modal face emotion recognition method and a device, wherein the method comprises the following steps: acquiring video data of an operator in a target scene; performing frame extraction on the video data according to a preset time interval to obtain a multi-mode information sequence; extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face; processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence; inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector; and inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score. By adopting the embodiment provided by the invention, the emotion recognition efficiency and accuracy are greatly improved.

Description

Multi-mode face emotion recognition method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a multi-mode face emotion recognition method and device.
Background
The face emotion recognition is a large research hotspot of modern artificial intelligence, and the technology is widely applied to the fields of automatic driving, human-computer interaction, health monitoring and the like. In the health monitoring field, the health monitoring system captures the health emotion of an operator in real time by analyzing the face and the sound state of the operator in a live video in real time, and is favorable for finding and leading psychological health problems in time.
The field of emotion recognition mainly classifies emotional indicators into two categories: discrete dimensions and continuous dimensions. With respect to discrete dimensions, Ekman et al defined 7 major expression states in earlier studies, including anger, fear, aversion, joy, sadness, surprise, and neutrality. For the continuous dimension, researchers then express it in two dimensions: the arousal level is expressed by arousal and the aggressiveness level is expressed by value. Because the continuous dimension is more consistent with the continuity of human emotion, more and more researchers have been attracted to research in recent years. At present, the mainstream video face emotion recognition method mainly comprises the steps of extracting frames through a video, extracting face features from an image by using a neural network, carrying out specific emotion recognition, and finally fusing emotions of different video frames. Patent No. 202010252867 patent No. 1 discloses an expression recognition method based on a lightweight convolutional neural network, which builds a lightweight convolutional network model by using packet convolution, and detects and corrects an input image by using a face corrector, thereby improving the model recognition efficiency. Patent No. 202011345478 patent No. 2 discloses a facial expression recognition method, device and equipment based on deep learning, in which an original facial image is input to a generation countermeasure network to generate a synthesized facial image, and the original facial image and the synthesized facial image are used together as a training set for training, thereby alleviating the over-fitting problem in a training strategy.
In the prior art, for example, in patent 1 and patent 2, only a single visual modality is used for face emotion recognition, and important sound modalities in a real office scene are ignored, for example, if a person is in a state of tension and fear, the voice of the person is trembling and unnatural. In addition, training data and test data in the prior art are both from a network, however, in the practical process, significant differences exist in the aspects of voice and language under different scenes, for example, different devices and recording environments affect the quality of video collection, so that data distribution of the training data and the real test environment is inconsistent, and thus the model cannot be effectively popularized to the real scene. In addition, the prior art mainly aims at fusing multi-frame classification after single frame identification, the operation efficiency is low, and the problem of incoherence of emotion identification results can occur.
Disclosure of Invention
The embodiment of the invention provides a multi-modal face emotion recognition method and device, which fully utilize complementary information among different modalities, perform fusion alignment on the different modalities based on space attention and channel attention, and greatly improve the emotion recognition efficiency and accuracy.
In order to achieve the above object, a first aspect of embodiments of the present application provides a multimodal facial emotion recognition method, including:
acquiring video data of an operator in a target scene;
performing frame extraction on the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information;
extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face;
processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence;
inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector;
and inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score.
In a possible implementation manner of the first aspect, the extracting visual modality information of each frame to obtain a facial expression feature sequence related to a face key point specifically includes:
extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;
normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;
calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.
In a possible implementation manner of the first aspect, the processing, by using the convolutional neural network model, the spectrogram corresponding to each frame of auditory modality information to obtain the speech feature sequence specifically includes:
processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;
and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.
In a possible implementation manner of the first aspect, the inputting the normalized facial expression feature sequence and the normalized speech feature sequence into an attention-based time series learning model for fusion coding to obtain a time series fusion feature vector specifically includes:
combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;
performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;
performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;
inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;
and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.
In a possible implementation manner of the first aspect, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold, the multi-modal emotion recognition model stops training.
A second aspect of an embodiment of the present application provides a multimodal facial emotion recognition apparatus, including:
the data acquisition module is used for acquiring video data of the operating personnel in a target scene;
the frame extracting module is used for extracting frames from the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information;
the visual modal module is used for extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face;
the auditory modality module is used for processing the spectrogram corresponding to each frame of auditory modality information by using a convolutional neural network model to obtain a voice characteristic sequence;
the fusion coding module is used for inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention to perform fusion coding to obtain a time sequence fusion feature vector;
and the recognition module is used for inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score and obtaining the face emotion of the operator according to the emotion score.
In a possible implementation manner of the second aspect, the visual modality module is specifically configured to:
extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;
normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;
calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.
In one possible implementation form of the second aspect, the hearing modality module is specifically configured to:
processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;
and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.
In a possible implementation manner of the second aspect, the fusion coding module is specifically configured to:
combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;
performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;
performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;
inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;
and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.
In a possible implementation manner of the second aspect, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.
Compared with the prior art, the multi-mode face emotion recognition method and device provided by the embodiment of the invention have the advantages that the normalized facial expression feature sequence and the normalized voice feature sequence are obtained by acquiring data in a real scene for model training, the time sequence fusion feature vector of an operator is efficiently coded by combining the constructed time sequence learning model based on attention, the implicit emotion contained in the voice mode is integrated into face emotion recognition in the whole coding process, different modes are fused and aligned based on space attention and channel attention, compared with the prior art that the visual mode is used independently, the method and device provided by the embodiment of the invention can fully utilize complementary information among the different modes, a more stable effect can be obtained in a single-mode noisy scene, and the prior art that single-frame recognition is carried out on a video in advance is avoided, and then fusing different frame results to solve the problem of inconsistent identification results caused by the method.
In addition, the embodiment of the invention carries out time sequence learning model training by collecting data in a real scene, solves the problem of inconsistent distribution of training data and real scene data in the prior art, and greatly improves the efficiency and accuracy rate of emotion recognition.
Drawings
Fig. 1 is a schematic flow chart of a multi-modal face emotion recognition method according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a face key point marker used in an embodiment of the present invention;
FIG. 3 is a spectrogram corresponding to a single frame of auditory modality information provided by an embodiment of the present invention;
fig. 4 is a schematic diagram of an attention-based time sequence learning model according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, an embodiment of the present invention provides a method for recognizing a multi-modal face emotion, where the method includes:
and S10, acquiring the video data of the operator in the target scene.
S11, performing frame extraction on the video data according to a preset time interval to obtain a multi-modal information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information.
And S12, extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face.
And S13, processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence.
And S14, inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector.
And S15, inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score.
The embodiment of the invention provides a multi-modal face emotion recognition method, which is a multi-modal continuous dimension emotion recognition technology based on a face video, solves the problem of inconsistent distribution of training data and real scene data in the prior art by collecting a video data set under a real scene, and migrates a pre-trained model to a new scene by adopting a continuous learning method for different real scenes. Secondly, the embodiment of the invention improves the operation efficiency of the model by coding different frame information in the video through the attention-based time sequence learning model. In addition, considering that the voice modality is also a non-negligible important modality in a real scene, the embodiment of the invention integrates the implicit emotion contained in the voice modality into the face emotion recognition, thereby improving the accuracy of the face emotion recognition.
S12 and S13 are processes of extracting human face features and voice features in video data, after extraction is completed, fusion coding is carried out in S14 by using a time sequence learning model based on attention, the effect of processing multi-frame multi-mode information at the same time is achieved, single-frame multi-mode information is not recognized one by one, and emotion recognition results obtained through the multi-mode emotion recognition model are consistent.
Illustratively, the video data of the workers in the real scene is collected in S11, and each worker includes a 3-minute lecture video with audio. If the speech video of the operator is recorded as S, the preset time interval is set to 3 seconds, the picture frame extraction is performed on the speech video S every 3 seconds, and the audio sequence within 3 seconds is used as the audio information corresponding to the picture frame, the multi-mode information sequence after the video processing is S ═ S [ [ S ] ]1,s2,...,sn]Wherein n is the length of the sequence information. sn={vn,anIs single frame multimodal information, where vnAs visual modality information, anIs auditory modality information.
Exemplarily, S12 specifically includes:
extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;
normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;
calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.
The multimodal frame information sequence extracted in step S11 is subjected to feature preprocessing. For the single frame visual modality information v extracted in step S1nThe invention uses a cascade residual regression tree model, realizes the algorithm of face region alignment by building a cascade regression tree, and trains a model structure which can accurately mark key points along the important contour of a face organ. Using the model structure from vnExtracting the face area, and marking the face area with 68 key points, including eyebrows, eyes, nose, lips, and lower outline of face, as shown in fig. 2.
Recording human face areaThe length and width of the pixel in the original image are respectively H and W, and each key point coordinate set can be recorded as { (x)i,yi) I ∈ 0, 1.. 67}, where i ═ 30 is a key point at the nose, and other key point coordinates can be normalized and normalized based on this central key point, each key point Pi=(xi,yi) The coordinates of (a) are converted as follows:
Figure BDA0003457069860000081
two points P are recordediAnd point PjHas a Euclidean distance d (i, j) and a point PiThe set of key points in the face region is d (i), and the Euclidean distance between each key point needs to be standardized as dnir(i, i +1), normalized as follows:
Figure BDA0003457069860000091
let an angle formed by the line segment ij and the line segment jk be a1(i, j, k) point P of note nose30And point PiThe angle between the connecting line and the vertical direction is a2(i) Then the two defined angles are respectively calculated as:
Figure BDA0003457069860000092
Figure BDA0003457069860000093
Figure BDA0003457069860000094
Figure BDA0003457069860000095
the normalized distance proportion features and the angle features of the main key points are spliced, each 68 face key points can be converted into 97 features of 61+8+14+7 + 2, namely each facial expression can be converted into an original one-dimensional vector P of n ═ 68 or a converted one-dimensional vector P' of n ═ 97, wherein the latter can reflect the overall characteristics of five sense organs on the face, and the method is more suitable for intuitively grasping the facial expression features by human beings.
Exemplarily, S13 specifically includes:
processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;
and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.
For the single frame of auditory modality information a extracted in step S11nIn the embodiment of the invention, an audio processing library librosa of python is adopted for processing, so that a spectrogram with the size of 496 multiplied by 369 can be obtained. As shown in fig. 3, the horizontal direction is a time axis, the vertical direction is a frequency axis, the color of a certain pixel point (t, w) in the spectrogram represents the volume in the Hz frequency domain at time t, and as shown by the color bar at the right side of fig. 3, the darker the color represents the higher the volume at the point. And obtaining a spectrogram, and then performing feature extraction on the spectrogram by using a convolutional neural network model.
Illustratively, the inputting the normalized facial expression feature sequence and the normalized voice feature sequence into an attention-based time sequence learning model for fusion coding to obtain a time sequence fusion feature vector specifically includes:
combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;
performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;
performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;
inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;
and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.
Firstly, the features obtained in steps S12 and S13 are normalized by recording a facial expression feature sequence or a speech feature sequence (RGB image sequence) as LS(i)(time series signature sequence), LS(i)∈Rt×c×w×hWherein, S (i) represents the sample of the ith S mode, t is the sequence length, w and h are the width and length of the image, respectively, and c is the number of channels. The characteristic value in each channel ranges from 0 to 255, and is divided by 255 for normalization before being input into a network.
Then, the normalized time sequence characteristic sequence is input into a ResNet-Attention-Bi _ LSTM network for fusion coding, as shown in FIG. 4, the training set is denoted as LSUnpacking LSEach mode sequence L inS(i)Each frame of (1), the j-th frame is characterized by LS(i) jThe stem belongs to the training set LSRandomly disorganizing the features of any i, j and combining them into a new training set M, MiRepresenting the ith feature, the training data set M is convolved with deep learning based on the common ResNet 18. The last convolutional layer of the network is followed by an attention mechanism, which is space-based attention and channel-based attention, respectively. Recording the characteristics of the last convolution layer output as F ∈ Rc×w×hF can be seen as a combination of c w × h feature maps. In the attention mechanism based on space, the maximum pooling and the average pooling are respectively carried out on the same position of all the feature maps to obtain two feature maps Fmax,Favg∈R1×w×hBoth are synthesized to R2×w×hAnd using convolution layer and Sigmoid function to process and learn a space attention filtering graph alphaspatial∈Rw×hBy alphaspatialMultiplying the attention value of each position with the value of the corresponding position on each feature map in the original feature F to obtain F after the space attention processingspatial. The process is as follows:
αspatial=Sigmoid(Conv(Maxpool(F),Avgpool(F)))
Figure BDA0003457069860000111
in the channel-based attention mechanism, firstly, each feature map is respectively subjected to maximum pooling and average pooling to obtain Fmax,Favg∈Rc×1×1Compressed to RcThen, these two signals are input to MLP (Multi-layer perceptron) for learning, and F 'is output separately'max,F’avg∈RcThe two outputs add the corresponding values along the channel and input the Sigmoid function to form a channel attention vector αchannel∈RcAll values on each feature map in the original feature F are compared with alphachannelMultiplying the attention values in the corresponding channels to obtain F subjected to the attention processing of the channelchannel. The process is as follows:
αchannel=Sigmoid(MLP(Maxpool(F)),MLP(Avgpool(F)))
Figure BDA0003457069860000112
and inputting the time sequence characteristics subjected to attention coding into Bi-LSTM processing, taking the obtained output of the last time step as the overall characteristics of the time sequence, compressing the overall characteristics into a one-dimensional vector, and recording the one-dimensional vector as a time sequence fusion characteristic vector.
Illustratively, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to the gradient of the mean square error loss function, and when the loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.
The time sequence fusion feature vector obtained in step S14 is used as input, and the full link layer is used as the coding layer, and the process is as follows:
Lx=relu(Wx*v+bx);
where v is the time series fusion feature vector obtained in step S3, LxContinuous dimension emotion score, W, output for the modelx、bxFor learnable parameters, the subscript x corresponds to different emotion terms, relu () is a linear commutation activation function defined as follows:
relu(x)=max(0,x);
using a mean square error loss function as an index of model convergence, updating model parameters according to the gradient of the loss function, and stopping training when a loss value is lower than a set threshold value, wherein the process is as follows:
Figure BDA0003457069860000121
and when the emotion recognition scene is subsequently migrated to a new emotion recognition scene, the model parameters can be synchronously migrated and updated according to the steps.
Compared with the prior art, the multi-mode face emotion recognition method provided by the embodiment of the invention acquires data in a real scene to perform model training to obtain a normalized facial expression feature sequence and a normalized voice feature sequence, and then combines with a constructed attention-based time sequence learning model to efficiently code the time sequence fusion feature vector of an operator, the implicit emotion contained in the voice mode is integrated into face emotion recognition in the whole coding process, and different modes are fused and aligned based on spatial attention and channel attention, compared with the prior art which singly uses a visual mode, the method and the device provided by the embodiment of the invention can fully utilize complementary information among different modes, can obtain a more stable effect in a single-mode noisy scene, and avoid the prior art which firstly performs single-frame recognition on a video, and then fusing different frame results to solve the problem of inconsistent identification results caused by the method.
In addition, the embodiment of the invention carries out time sequence learning model training by collecting data in a real scene, solves the problem of inconsistent distribution of training data and real scene data in the prior art, and greatly improves the efficiency and accuracy rate of emotion recognition.
An embodiment of the present application provides a multi-modal face emotion recognition apparatus, including: the device comprises a data acquisition module, a frame extraction module, a visual mode module, an auditory mode module, a fusion coding module and an identification module.
And the data acquisition module is used for acquiring the video data of the operating personnel in the target scene.
The frame extracting module is used for extracting frames from the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information.
And the visual modal module is used for extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face.
And the auditory modality module is used for processing the spectrogram corresponding to each frame of auditory modality information by using the convolutional neural network model to obtain a voice characteristic sequence.
And the fusion coding module is used for inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention to perform fusion coding to obtain a time sequence fusion feature vector.
And the recognition module is used for inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score and obtaining the face emotion of the operator according to the emotion score.
Illustratively, the visual modality module is specifically configured to:
extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;
normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;
calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.
Illustratively, the hearing modality module is specifically configured to:
processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;
and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.
In a possible implementation manner of the second aspect, the fusion coding module is specifically configured to:
combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;
performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;
performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;
inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;
and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.
Illustratively, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to the gradient of the mean square error loss function, and when the loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.
Compared with the prior art, the multi-mode face emotion recognition device provided by the embodiment of the invention acquires data in a real scene to perform model training to obtain a normalized facial expression feature sequence and a normalized voice feature sequence, and then combines with a constructed attention-based time sequence learning model to efficiently code time sequence fusion feature vectors of operators, the implicit emotion contained in a voice mode is integrated into face emotion recognition in the whole coding process, different modes are subjected to fusion alignment based on spatial attention and channel attention, compared with the prior art which singly uses a visual mode, the method and the device provided by the embodiment of the invention can fully utilize complementary information among different modes, can obtain a more stable effect in a single-mode noisy scene, and avoid the prior art which firstly performs single-frame recognition on a video, and then fusing different frame results to solve the problem of inconsistent identification results caused by the method.
In addition, the embodiment of the invention carries out time sequence learning model training by collecting data in a real scene, solves the problem of inconsistent distribution of training data and real scene data in the prior art, and greatly improves the efficiency and accuracy rate of emotion recognition.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims (10)

1. A multi-modal face emotion recognition method is characterized by comprising the following steps:
acquiring video data of an operator in a target scene;
performing frame extraction on the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information;
extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face;
processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence;
inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector;
and inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score.
2. The multi-modal facial emotion recognition method of claim 1, wherein the extracting visual modal information of each frame to obtain a sequence of facial expression features related to the key points of the face specifically comprises:
extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;
normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;
calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.
3. The multi-modal face emotion recognition method of claim 1, wherein the processing, using the convolutional neural network model, the spectrogram corresponding to each frame of auditory modal information to obtain the speech feature sequence specifically comprises:
processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;
and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.
4. The multi-modal facial emotion recognition method of claim 1, wherein the fusion encoding of the normalized facial expression feature sequence and the normalized speech feature sequence with a time-series learning model based on attention to obtain a time-series fusion feature vector specifically comprises:
combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;
performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;
performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;
inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;
and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.
5. The method for recognizing the multi-modal face emotion according to claim 1, wherein when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.
6. A multimodal facial emotion recognition apparatus, comprising:
the data acquisition module is used for acquiring video data of the operating personnel in a target scene;
the frame extracting module is used for extracting frames from the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information;
the visual modal module is used for extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face;
the auditory modality module is used for processing the spectrogram corresponding to each frame of auditory modality information by using a convolutional neural network model to obtain a voice characteristic sequence;
the fusion coding module is used for inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention to perform fusion coding to obtain a time sequence fusion feature vector;
and the recognition module is used for inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score and obtaining the face emotion of the operator according to the emotion score.
7. The multi-modal facial emotion recognition apparatus of claim 6, wherein the visual modality module is specifically configured to:
extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;
normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;
calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.
8. The multi-modal facial emotion recognition apparatus of claim 6, wherein the auditory modality module is specifically configured to:
processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;
and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.
9. The multi-modal facial emotion recognition apparatus of claim 6, wherein the fusion encoding module is specifically configured to:
combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;
performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;
performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;
inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;
and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.
10. The multi-modal facial emotion recognition apparatus of claim 6, wherein when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.
CN202210010258.4A 2022-01-05 2022-01-05 Multi-mode face emotion recognition method and device Pending CN114399818A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210010258.4A CN114399818A (en) 2022-01-05 2022-01-05 Multi-mode face emotion recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210010258.4A CN114399818A (en) 2022-01-05 2022-01-05 Multi-mode face emotion recognition method and device

Publications (1)

Publication Number Publication Date
CN114399818A true CN114399818A (en) 2022-04-26

Family

ID=81228988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210010258.4A Pending CN114399818A (en) 2022-01-05 2022-01-05 Multi-mode face emotion recognition method and device

Country Status (1)

Country Link
CN (1) CN114399818A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035438A (en) * 2022-05-27 2022-09-09 中国科学院半导体研究所 Emotion analysis method and device and electronic equipment
CN115375809A (en) * 2022-10-25 2022-11-22 科大讯飞股份有限公司 Virtual image generation method, device, equipment and storage medium
CN116129004A (en) * 2023-02-17 2023-05-16 华院计算技术(上海)股份有限公司 Digital person generating method and device, computer readable storage medium and terminal
CN116458852A (en) * 2023-06-16 2023-07-21 山东协和学院 Rehabilitation training system and method based on cloud platform and lower limb rehabilitation robot
CN116561533A (en) * 2023-07-05 2023-08-08 福建天晴数码有限公司 Emotion evolution method and terminal for virtual avatar in educational element universe
CN116912373A (en) * 2023-05-23 2023-10-20 苏州超次元网络科技有限公司 Animation processing method and system
CN117556084A (en) * 2023-12-27 2024-02-13 环球数科集团有限公司 Video emotion analysis system based on multiple modes

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035438A (en) * 2022-05-27 2022-09-09 中国科学院半导体研究所 Emotion analysis method and device and electronic equipment
CN115375809A (en) * 2022-10-25 2022-11-22 科大讯飞股份有限公司 Virtual image generation method, device, equipment and storage medium
CN115375809B (en) * 2022-10-25 2023-03-14 科大讯飞股份有限公司 Method, device and equipment for generating virtual image and storage medium
CN116129004B (en) * 2023-02-17 2023-09-15 华院计算技术(上海)股份有限公司 Digital person generating method and device, computer readable storage medium and terminal
CN116129004A (en) * 2023-02-17 2023-05-16 华院计算技术(上海)股份有限公司 Digital person generating method and device, computer readable storage medium and terminal
CN116912373A (en) * 2023-05-23 2023-10-20 苏州超次元网络科技有限公司 Animation processing method and system
CN116912373B (en) * 2023-05-23 2024-04-16 苏州超次元网络科技有限公司 Animation processing method and system
CN116458852B (en) * 2023-06-16 2023-09-01 山东协和学院 Rehabilitation training system and method based on cloud platform and lower limb rehabilitation robot
CN116458852A (en) * 2023-06-16 2023-07-21 山东协和学院 Rehabilitation training system and method based on cloud platform and lower limb rehabilitation robot
CN116561533A (en) * 2023-07-05 2023-08-08 福建天晴数码有限公司 Emotion evolution method and terminal for virtual avatar in educational element universe
CN116561533B (en) * 2023-07-05 2023-09-29 福建天晴数码有限公司 Emotion evolution method and terminal for virtual avatar in educational element universe
CN117556084A (en) * 2023-12-27 2024-02-13 环球数科集团有限公司 Video emotion analysis system based on multiple modes
CN117556084B (en) * 2023-12-27 2024-03-26 环球数科集团有限公司 Video emotion analysis system based on multiple modes

Similar Documents

Publication Publication Date Title
CN114399818A (en) Multi-mode face emotion recognition method and device
CN110516571B (en) Cross-library micro-expression recognition method and device based on optical flow attention neural network
Kuhnke et al. Two-stream aural-visual affect analysis in the wild
CN105447459B (en) A kind of unmanned plane detects target and tracking automatically
CN110287790B (en) Learning state hybrid analysis method oriented to static multi-user scene
CN110969106B (en) Multi-mode lie detection method based on expression, voice and eye movement characteristics
CN112115866A (en) Face recognition method and device, electronic equipment and computer readable storage medium
CN111985348B (en) Face recognition method and system
Wei et al. Real-time facial expression recognition for affective computing based on Kinect
CN111666845B (en) Small sample deep learning multi-mode sign language recognition method based on key frame sampling
CN111292765A (en) Bimodal emotion recognition method fusing multiple deep learning models
KR20180093632A (en) Method and apparatus of recognizing facial expression base on multi-modal
CN112241689A (en) Face recognition method and device, electronic equipment and computer readable storage medium
CN111079465A (en) Emotional state comprehensive judgment method based on three-dimensional imaging analysis
CN116226715A (en) Multi-mode feature fusion-based online polymorphic identification system for operators
Diyasa et al. Multi-face Recognition for the Detection of Prisoners in Jail using a Modified Cascade Classifier and CNN
Padhi et al. Hand Gesture Recognition using DenseNet201-Mediapipe Hybrid Modelling
CN116453024B (en) Video emotion recognition system and method
CN116758451A (en) Audio-visual emotion recognition method and system based on multi-scale and global cross attention
CN115798055A (en) Violent behavior detection method based on corersort tracking algorithm
CN116312512A (en) Multi-person scene-oriented audiovisual fusion wake-up word recognition method and device
CN113343773B (en) Facial expression recognition system based on shallow convolutional neural network
CN115170706A (en) Artificial intelligence neural network learning model construction system and construction method
CN114663796A (en) Target person continuous tracking method, device and system
CN114492579A (en) Emotion recognition method, camera device, emotion recognition device and storage device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination