CN114399818A

CN114399818A - Multi-mode face emotion recognition method and device

Info

Publication number: CN114399818A
Application number: CN202210010258.4A
Authority: CN
Inventors: 刘羽中; 李华亮; 范圣平; 沈雅利; 王琪如; 谢庭军; 翟永昌
Original assignee: Guangdong Power Grid Co Ltd; Electric Power Research Institute of Guangdong Power Grid Co Ltd
Current assignee: Guangdong Power Grid Co Ltd; Electric Power Research Institute of Guangdong Power Grid Co Ltd
Priority date: 2022-01-05
Filing date: 2022-01-05
Publication date: 2022-04-26

Abstract

The invention discloses a multi-modal face emotion recognition method and a device, wherein the method comprises the following steps: acquiring video data of an operator in a target scene; performing frame extraction on the video data according to a preset time interval to obtain a multi-mode information sequence; extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face; processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence; inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector; and inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score. By adopting the embodiment provided by the invention, the emotion recognition efficiency and accuracy are greatly improved.

Description

Multi-mode face emotion recognition method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-mode face emotion recognition method and device.

Background

The face emotion recognition is a large research hotspot of modern artificial intelligence, and the technology is widely applied to the fields of automatic driving, human-computer interaction, health monitoring and the like. In the health monitoring field, the health monitoring system captures the health emotion of an operator in real time by analyzing the face and the sound state of the operator in a live video in real time, and is favorable for finding and leading psychological health problems in time.

The field of emotion recognition mainly classifies emotional indicators into two categories: discrete dimensions and continuous dimensions. With respect to discrete dimensions, Ekman et al defined 7 major expression states in earlier studies, including anger, fear, aversion, joy, sadness, surprise, and neutrality. For the continuous dimension, researchers then express it in two dimensions: the arousal level is expressed by arousal and the aggressiveness level is expressed by value. Because the continuous dimension is more consistent with the continuity of human emotion, more and more researchers have been attracted to research in recent years. At present, the mainstream video face emotion recognition method mainly comprises the steps of extracting frames through a video, extracting face features from an image by using a neural network, carrying out specific emotion recognition, and finally fusing emotions of different video frames. Patent No. 202010252867 patent No. 1 discloses an expression recognition method based on a lightweight convolutional neural network, which builds a lightweight convolutional network model by using packet convolution, and detects and corrects an input image by using a face corrector, thereby improving the model recognition efficiency. Patent No. 202011345478 patent No. 2 discloses a facial expression recognition method, device and equipment based on deep learning, in which an original facial image is input to a generation countermeasure network to generate a synthesized facial image, and the original facial image and the synthesized facial image are used together as a training set for training, thereby alleviating the over-fitting problem in a training strategy.

In the prior art, for example, in patent 1 and patent 2, only a single visual modality is used for face emotion recognition, and important sound modalities in a real office scene are ignored, for example, if a person is in a state of tension and fear, the voice of the person is trembling and unnatural. In addition, training data and test data in the prior art are both from a network, however, in the practical process, significant differences exist in the aspects of voice and language under different scenes, for example, different devices and recording environments affect the quality of video collection, so that data distribution of the training data and the real test environment is inconsistent, and thus the model cannot be effectively popularized to the real scene. In addition, the prior art mainly aims at fusing multi-frame classification after single frame identification, the operation efficiency is low, and the problem of incoherence of emotion identification results can occur.

Disclosure of Invention

The embodiment of the invention provides a multi-modal face emotion recognition method and device, which fully utilize complementary information among different modalities, perform fusion alignment on the different modalities based on space attention and channel attention, and greatly improve the emotion recognition efficiency and accuracy.

In order to achieve the above object, a first aspect of embodiments of the present application provides a multimodal facial emotion recognition method, including:

acquiring video data of an operator in a target scene;

performing frame extraction on the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information;

extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face;

processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence;

inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector;

and inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score.

In a possible implementation manner of the first aspect, the extracting visual modality information of each frame to obtain a facial expression feature sequence related to a face key point specifically includes:

extracting a face region from each frame of visual modal information by using a cascade residual regression tree model, and marking face key points;

normalizing and standardizing coordinates of other face key points based on the face key points positioned at the nose;

calculating distance proportion characteristics and angle characteristics between key points according to the standardized coordinates of the key points of the human face, and splicing the standardized distance proportion characteristics and the standardized angle characteristics to obtain a facial expression characteristic sequence related to the key points of the human face.

In a possible implementation manner of the first aspect, the processing, by using the convolutional neural network model, the spectrogram corresponding to each frame of auditory modality information to obtain the speech feature sequence specifically includes:

processing each frame of auditory modal information by adopting an audio processing library to obtain a spectrogram corresponding to each frame of auditory modal information;

and performing feature extraction on the plurality of spectrogram by using a convolutional neural network model to obtain a voice feature sequence.

In a possible implementation manner of the first aspect, the inputting the normalized facial expression feature sequence and the normalized speech feature sequence into an attention-based time series learning model for fusion coding to obtain a time series fusion feature vector specifically includes:

combining the normalized sequence of facial expression features and the normalized sequence of speech features into a new training set;

performing convolution deep learning on the training set based on a ResNet18 network, and recording the output of the last convolution layer;

performing space attention processing and channel attention processing on the output of the last convolutional layer to obtain a coded time sequence characteristic;

inputting the coded time sequence characteristics into Bi-LSTM for processing to obtain overall characteristics corresponding to the coded time sequence characteristics;

and compressing the overall characteristics to obtain a one-dimensional time sequence fusion characteristic vector.

In a possible implementation manner of the first aspect, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold, the multi-modal emotion recognition model stops training.

A second aspect of an embodiment of the present application provides a multimodal facial emotion recognition apparatus, including:

the data acquisition module is used for acquiring video data of the operating personnel in a target scene;

the frame extracting module is used for extracting frames from the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information;

the visual modal module is used for extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face;

the auditory modality module is used for processing the spectrogram corresponding to each frame of auditory modality information by using a convolutional neural network model to obtain a voice characteristic sequence;

the fusion coding module is used for inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention to perform fusion coding to obtain a time sequence fusion feature vector;

and the recognition module is used for inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score and obtaining the face emotion of the operator according to the emotion score.

In a possible implementation manner of the second aspect, the visual modality module is specifically configured to:

In one possible implementation form of the second aspect, the hearing modality module is specifically configured to:

In a possible implementation manner of the second aspect, the fusion coding module is specifically configured to:

In a possible implementation manner of the second aspect, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.

Compared with the prior art, the multi-mode face emotion recognition method and device provided by the embodiment of the invention have the advantages that the normalized facial expression feature sequence and the normalized voice feature sequence are obtained by acquiring data in a real scene for model training, the time sequence fusion feature vector of an operator is efficiently coded by combining the constructed time sequence learning model based on attention, the implicit emotion contained in the voice mode is integrated into face emotion recognition in the whole coding process, different modes are fused and aligned based on space attention and channel attention, compared with the prior art that the visual mode is used independently, the method and device provided by the embodiment of the invention can fully utilize complementary information among the different modes, a more stable effect can be obtained in a single-mode noisy scene, and the prior art that single-frame recognition is carried out on a video in advance is avoided, and then fusing different frame results to solve the problem of inconsistent identification results caused by the method.

In addition, the embodiment of the invention carries out time sequence learning model training by collecting data in a real scene, solves the problem of inconsistent distribution of training data and real scene data in the prior art, and greatly improves the efficiency and accuracy rate of emotion recognition.

Drawings

Fig. 1 is a schematic flow chart of a multi-modal face emotion recognition method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a face key point marker used in an embodiment of the present invention;

FIG. 3 is a spectrogram corresponding to a single frame of auditory modality information provided by an embodiment of the present invention;

fig. 4 is a schematic diagram of an attention-based time sequence learning model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, an embodiment of the present invention provides a method for recognizing a multi-modal face emotion, where the method includes:

and S10, acquiring the video data of the operator in the target scene.

S11, performing frame extraction on the video data according to a preset time interval to obtain a multi-modal information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information.

And S12, extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face.

And S13, processing the spectrogram corresponding to each frame of auditory modal information by using a convolutional neural network model to obtain a voice characteristic sequence.

And S14, inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention for fusion coding to obtain a time sequence fusion feature vector.

And S15, inputting the time sequence fusion feature vector into a multi-modal emotion recognition model to obtain an emotion score, and obtaining the face emotion of the operator according to the emotion score.

The embodiment of the invention provides a multi-modal face emotion recognition method, which is a multi-modal continuous dimension emotion recognition technology based on a face video, solves the problem of inconsistent distribution of training data and real scene data in the prior art by collecting a video data set under a real scene, and migrates a pre-trained model to a new scene by adopting a continuous learning method for different real scenes. Secondly, the embodiment of the invention improves the operation efficiency of the model by coding different frame information in the video through the attention-based time sequence learning model. In addition, considering that the voice modality is also a non-negligible important modality in a real scene, the embodiment of the invention integrates the implicit emotion contained in the voice modality into the face emotion recognition, thereby improving the accuracy of the face emotion recognition.

S12 and S13 are processes of extracting human face features and voice features in video data, after extraction is completed, fusion coding is carried out in S14 by using a time sequence learning model based on attention, the effect of processing multi-frame multi-mode information at the same time is achieved, single-frame multi-mode information is not recognized one by one, and emotion recognition results obtained through the multi-mode emotion recognition model are consistent.

Illustratively, the video data of the workers in the real scene is collected in S11, and each worker includes a 3-minute lecture video with audio. If the speech video of the operator is recorded as S, the preset time interval is set to 3 seconds, the picture frame extraction is performed on the speech video S every 3 seconds, and the audio sequence within 3 seconds is used as the audio information corresponding to the picture frame, the multi-mode information sequence after the video processing is S ═ S [ [ S ] ]₁,s₂,...,s_n]Wherein n is the length of the sequence information. s_n＝{v_n,a_nIs single frame multimodal information, where v_nAs visual modality information, a_nIs auditory modality information.

Exemplarily, S12 specifically includes:

The multimodal frame information sequence extracted in step S11 is subjected to feature preprocessing. For the single frame visual modality information v extracted in step S1_nThe invention uses a cascade residual regression tree model, realizes the algorithm of face region alignment by building a cascade regression tree, and trains a model structure which can accurately mark key points along the important contour of a face organ. Using the model structure from v_nExtracting the face area, and marking the face area with 68 key points, including eyebrows, eyes, nose, lips, and lower outline of face, as shown in fig. 2.

Recording human face areaThe length and width of the pixel in the original image are respectively H and W, and each key point coordinate set can be recorded as { (x)_i,y_i) I ∈ 0, 1.. 67}, where i ═ 30 is a key point at the nose, and other key point coordinates can be normalized and normalized based on this central key point, each key point P_i＝(x_i,y_i) The coordinates of (a) are converted as follows:

two points P are recorded_iAnd point P_jHas a Euclidean distance d (i, j) and a point P_iThe set of key points in the face region is d (i), and the Euclidean distance between each key point needs to be standardized as d_nir(i, i +1), normalized as follows:

let an angle formed by the line segment ij and the line segment jk be a₁(i, j, k) point P of note nose₃₀And point P_iThe angle between the connecting line and the vertical direction is a₂(i) Then the two defined angles are respectively calculated as:

the normalized distance proportion features and the angle features of the main key points are spliced, each 68 face key points can be converted into 97 features of 61+8+14+7 + 2, namely each facial expression can be converted into an original one-dimensional vector P of n ═ 68 or a converted one-dimensional vector P' of n ═ 97, wherein the latter can reflect the overall characteristics of five sense organs on the face, and the method is more suitable for intuitively grasping the facial expression features by human beings.

Exemplarily, S13 specifically includes:

For the single frame of auditory modality information a extracted in step S11_nIn the embodiment of the invention, an audio processing library librosa of python is adopted for processing, so that a spectrogram with the size of 496 multiplied by 369 can be obtained. As shown in fig. 3, the horizontal direction is a time axis, the vertical direction is a frequency axis, the color of a certain pixel point (t, w) in the spectrogram represents the volume in the Hz frequency domain at time t, and as shown by the color bar at the right side of fig. 3, the darker the color represents the higher the volume at the point. And obtaining a spectrogram, and then performing feature extraction on the spectrogram by using a convolutional neural network model.

Illustratively, the inputting the normalized facial expression feature sequence and the normalized voice feature sequence into an attention-based time sequence learning model for fusion coding to obtain a time sequence fusion feature vector specifically includes:

Firstly, the features obtained in steps S12 and S13 are normalized by recording a facial expression feature sequence or a speech feature sequence (RGB image sequence) as L^S(i)(time series signature sequence), L^S(i)∈R^t×c×w×hWherein, S (i) represents the sample of the ith S mode, t is the sequence length, w and h are the width and length of the image, respectively, and c is the number of channels. The characteristic value in each channel ranges from 0 to 255, and is divided by 255 for normalization before being input into a network.

Then, the normalized time sequence characteristic sequence is input into a ResNet-Attention-Bi _ LSTM network for fusion coding, as shown in FIG. 4, the training set is denoted as L^SUnpacking L^SEach mode sequence L in^S(i)Each frame of (1), the j-th frame is characterized by L^S(i) _jThe stem belongs to the training set L^SRandomly disorganizing the features of any i, j and combining them into a new training set M, M_iRepresenting the ith feature, the training data set M is convolved with deep learning based on the common ResNet 18. The last convolutional layer of the network is followed by an attention mechanism, which is space-based attention and channel-based attention, respectively. Recording the characteristics of the last convolution layer output as F ∈ R^c×w×hF can be seen as a combination of c w × h feature maps. In the attention mechanism based on space, the maximum pooling and the average pooling are respectively carried out on the same position of all the feature maps to obtain two feature maps F_max，F_avg∈R^1×w×hBoth are synthesized to R^2×w×hAnd using convolution layer and Sigmoid function to process and learn a space attention filtering graph alpha_spatial∈R^w×hBy alpha_spatialMultiplying the attention value of each position with the value of the corresponding position on each feature map in the original feature F to obtain F after the space attention processing_spatial. The process is as follows:

α_spatial＝Sigmoid(Conv(Maxpool(F),Avgpool(F)))

in the channel-based attention mechanism, firstly, each feature map is respectively subjected to maximum pooling and average pooling to obtain F_max，F_avg∈R^c×1×1Compressed to R^cThen, these two signals are input to MLP (Multi-layer perceptron) for learning, and F 'is output separately'_max，F’_avg∈R^cThe two outputs add the corresponding values along the channel and input the Sigmoid function to form a channel attention vector α_channel∈R^cAll values on each feature map in the original feature F are compared with alpha_channelMultiplying the attention values in the corresponding channels to obtain F subjected to the attention processing of the channel_channel. The process is as follows:

α_channel＝Sigmoid(MLP(Maxpool(F)),MLP(Avgpool(F)))

and inputting the time sequence characteristics subjected to attention coding into Bi-LSTM processing, taking the obtained output of the last time step as the overall characteristics of the time sequence, compressing the overall characteristics into a one-dimensional vector, and recording the one-dimensional vector as a time sequence fusion characteristic vector.

Illustratively, when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to the gradient of the mean square error loss function, and when the loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.

The time sequence fusion feature vector obtained in step S14 is used as input, and the full link layer is used as the coding layer, and the process is as follows:

L_x＝relu(W_x*v+b_x)；

where v is the time series fusion feature vector obtained in step S3, L_xContinuous dimension emotion score, W, output for the model_x、b_xFor learnable parameters, the subscript x corresponds to different emotion terms, relu () is a linear commutation activation function defined as follows:

relu(x)＝max(0，x)；

using a mean square error loss function as an index of model convergence, updating model parameters according to the gradient of the loss function, and stopping training when a loss value is lower than a set threshold value, wherein the process is as follows:

and when the emotion recognition scene is subsequently migrated to a new emotion recognition scene, the model parameters can be synchronously migrated and updated according to the steps.

Compared with the prior art, the multi-mode face emotion recognition method provided by the embodiment of the invention acquires data in a real scene to perform model training to obtain a normalized facial expression feature sequence and a normalized voice feature sequence, and then combines with a constructed attention-based time sequence learning model to efficiently code the time sequence fusion feature vector of an operator, the implicit emotion contained in the voice mode is integrated into face emotion recognition in the whole coding process, and different modes are fused and aligned based on spatial attention and channel attention, compared with the prior art which singly uses a visual mode, the method and the device provided by the embodiment of the invention can fully utilize complementary information among different modes, can obtain a more stable effect in a single-mode noisy scene, and avoid the prior art which firstly performs single-frame recognition on a video, and then fusing different frame results to solve the problem of inconsistent identification results caused by the method.

An embodiment of the present application provides a multi-modal face emotion recognition apparatus, including: the device comprises a data acquisition module, a frame extraction module, a visual mode module, an auditory mode module, a fusion coding module and an identification module.

And the data acquisition module is used for acquiring the video data of the operating personnel in the target scene.

The frame extracting module is used for extracting frames from the video data according to a preset time interval to obtain a multi-mode information sequence; the multi-modal information sequence comprises a plurality of frames of multi-modal information, and each frame of multi-modal information comprises a frame of visual modal information and a frame of auditory modal information.

And the visual modal module is used for extracting visual modal information of each frame to obtain a facial expression feature sequence related to the key points of the human face.

And the auditory modality module is used for processing the spectrogram corresponding to each frame of auditory modality information by using the convolutional neural network model to obtain a voice characteristic sequence.

And the fusion coding module is used for inputting the normalized facial expression feature sequence and the normalized voice feature sequence into a time sequence learning model based on attention to perform fusion coding to obtain a time sequence fusion feature vector.

Illustratively, the visual modality module is specifically configured to:

Illustratively, the hearing modality module is specifically configured to:

Compared with the prior art, the multi-mode face emotion recognition device provided by the embodiment of the invention acquires data in a real scene to perform model training to obtain a normalized facial expression feature sequence and a normalized voice feature sequence, and then combines with a constructed attention-based time sequence learning model to efficiently code time sequence fusion feature vectors of operators, the implicit emotion contained in a voice mode is integrated into face emotion recognition in the whole coding process, different modes are subjected to fusion alignment based on spatial attention and channel attention, compared with the prior art which singly uses a visual mode, the method and the device provided by the embodiment of the invention can fully utilize complementary information among different modes, can obtain a more stable effect in a single-mode noisy scene, and avoid the prior art which firstly performs single-frame recognition on a video, and then fusing different frame results to solve the problem of inconsistent identification results caused by the method.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A multi-modal face emotion recognition method is characterized by comprising the following steps:

acquiring video data of an operator in a target scene;

2. The multi-modal facial emotion recognition method of claim 1, wherein the extracting visual modal information of each frame to obtain a sequence of facial expression features related to the key points of the face specifically comprises:

3. The multi-modal face emotion recognition method of claim 1, wherein the processing, using the convolutional neural network model, the spectrogram corresponding to each frame of auditory modal information to obtain the speech feature sequence specifically comprises:

4. The multi-modal facial emotion recognition method of claim 1, wherein the fusion encoding of the normalized facial expression feature sequence and the normalized speech feature sequence with a time-series learning model based on attention to obtain a time-series fusion feature vector specifically comprises:

5. The method for recognizing the multi-modal face emotion according to claim 1, wherein when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.

6. A multimodal facial emotion recognition apparatus, comprising:

7. The multi-modal facial emotion recognition apparatus of claim 6, wherein the visual modality module is specifically configured to:

8. The multi-modal facial emotion recognition apparatus of claim 6, wherein the auditory modality module is specifically configured to:

9. The multi-modal facial emotion recognition apparatus of claim 6, wherein the fusion encoding module is specifically configured to:

10. The multi-modal facial emotion recognition apparatus of claim 6, wherein when the multi-modal emotion recognition model is updated, a mean square error loss function is used as an index of model convergence, model parameters of the multi-modal emotion recognition model are updated according to a gradient of the mean square error loss function, and when a loss value of the mean square error loss function is lower than a set threshold value, the multi-modal emotion recognition model stops training.