CN113158727A

CN113158727A - Bimodal fusion emotion recognition method based on video and voice information

Info

Publication number: CN113158727A
Application number: CN202011613947.1A
Authority: CN
Inventors: 臧景峰; 史玉欢; 王鑫磊; 刘瑞
Original assignee: Changchun University of Science and Technology
Current assignee: Changchun University of Science and Technology
Priority date: 2020-12-31
Filing date: 2020-12-31
Publication date: 2021-07-23

Abstract

The invention provides a bimodal fusion emotion recognition method based on video information and voice information. Extracting face images and voice features from the video and voice information, carrying out normalization processing on face image feature vectors and voice feature vectors, transmitting the processed features into a Bi-GRU network for training, and then using input features in two single-mode sub-networks for calculating the weight of state information at each moment. And fusing the input features of the two single-mode sub-networks to obtain a multi-mode combined feature vector, and taking the combined feature vector as the input of a pre-trained deep neural network, wherein the deep neural network contains an emotion classifier, and different types of emotion evaluation information are obtained through the emotion classifier, so that the emotion evaluation information of the user is more objective and has reference value, and a more accurate emotion recognition result is obtained.

Description

Bimodal fusion emotion recognition method based on video and voice information

Technical Field

The application relates to the field of emotion recognition, in particular to a bimodal fusion emotion recognition method based on video and voice information.

Background

In general, the way humans naturally communicate and express emotions is multimodal. This means that we can express emotions orally or visually. When more emotions are expressed in tones, the audio data may contain a main clue for emotion recognition; when more face images are used to express emotions, it is considered that most of the clues required to mine emotion are present in face images using multimodal information such as human facial expressions, speech intonation, and language content, which is a fun and challenging problem.

The emotion calculation under the traditional mode research direction is mainly researched by a single mode. Such as speech emotion aspect, video motion and face image. These conventional single-modal emotion recognition calculations have been in the process of research in practical applications, although significant efforts have been made in the respective fields. However, since the emotional expression forms of people have complex and diverse characteristics, if a single expression form is considered, the emotion of people is judged, and the final result is one-sided and unobtrusive, so that a lot of valuable emotional information is lost.

With the deep development of artificial intelligence technology in the information era, most people pay more attention to the research on the aspect of emotion calculation, but the emotion of human bodies is complex and variable, the accuracy rate of judging emotion characteristics by independently measuring certain information is low, and the invention is provided for improving the accuracy rate.

Disclosure of Invention

The invention aims to provide a bimodal fusion emotion recognition method based on video and voice information, which fully utilizes bimodal fusion to obtain more emotion information and can judge the emotion state according to user voice and facial expression information.

The technical solution for realizing the purpose of the invention is as follows: a bimodal fusion emotion recognition method based on video information and voice information comprises the following steps:

the method comprises the steps of firstly, acquiring video information and voice information through a camera and a microphone of external equipment, carrying out feature extraction on the video information and the voice information, and respectively extracting facial image features and voice features.

Step two, after the video information and the voice information are obtained, the method further comprises the following steps: and preprocessing the video information and the voice information.

Step three, the video information preprocessing of the initial information specifically comprises the following steps:

1) acquiring a video file to be processed, and analyzing the video file to obtain a video frame;

2) generating a histogram corresponding to the video frame based on the pixel information of the video frame, determining the definition of the video frame, and clustering the video frame according to the histogram and an edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;

3) and carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image.

Step four, the voice information preprocessing of the initial information specifically comprises the following steps:

1) pre-emphasizing the collected voice information to flatten the frequency spectrum of the signal;

2) performing frame windowing on the collected voice signals to obtain voice analysis frames;

3) and carrying out short-time Fourier transform on the voice analysis frame to obtain a voice spectrogram.

Step five, the extraction of the facial image features specifically comprises the following steps:

1) preparing a pre-trained deep convolutional neural network model packet, wherein the model comprises a pooling layer, a full-link layer, a dropout layer in front of the full-link layer and a softmax layer behind the full-link layer;

2) and inputting the face image into a pre-trained image feature model for processing, wherein the feature vector output by a full connection layer in the model is the face image feature vector.

Step six, the extraction of the voice features specifically comprises the following steps:

1) pre-emphasizing the voice digital signal by using a first-order high-pass FIR digital filter, and performing frame processing on the pre-emphasized voice data by using a short-time analysis technology to obtain a voice characteristic parameter time sequence;

2) windowing the voice characteristic parameter time sequence by using a Hamming window function to obtain voice windowing data, and performing endpoint detection on the voice windowing data by using a double-threshold comparison method to obtain preprocessed voice data;

3) carrying out short-time Fourier transform on the preprocessed voice data to obtain a voice spectrogram;

4) and acquiring the voice spectrogram, and extracting voice characteristic data by utilizing a preprocessed AlexNet network to obtain the voice characteristic data.

5) And acquiring the feature data, performing Correlation Feature Selection (CFS) on the feature data, and filtering out irrelevant features with small correlation with the category label to obtain the final voice feature.

And seventhly, performing normalization processing on the extracted face image characteristics and the extracted voice characteristics.

And step eight, respectively transmitting the normalized feature vectors into a Bi-GRU network for training, extracting features through a maximum pooling layer and an average pooling layer of the network, and further calculating correlation among multi-modal state information and attention distribution of each mode at each moment.

Step nine, the step of calculating the correlation among the multi-modal state information and the attention distribution of each modality at each moment specifically comprises the following steps:

1) since the state information between the modalities is taken into account, the weights will focus on the state information of both modalities simultaneously, i.e. the correlation of the state information between the modalities and the state information of each instant of the two single-modality sub-networks. Correlation s_iThe calculation is as follows:

wherein

Is looked atStatus information, w, of Bi-GRU network outputs in frequency-mode subnetworks_vIs that

The correlation weight of (a);

is the state information of the Bi-GRU network output in the speech modality subnetwork, w_aIs that

The correlation weight of (a); b₁Is the correlation deviation, b₂Is the fusion bias, V is the weight of the multimodal fusion, and tanh is the activation function.

2) From multimodal status information s_iCan calculate the attention distribution at each moment in the multi-modality, i.e., the weight a corresponding to the status information₁The calculation is as follows:

where softmax is a normalized exponential function.

Tenth, performing feature fusion on the extracted facial expression features and the extracted voice features to obtain a combined feature vector, and including the following steps:

1) performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;

2) and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a combined feature vector of two channels.

Step eleven, inputting the combined feature vector to a pre-trained deep convolutional neural network, wherein the deep convolutional neural network comprises an emotion classifier, acquires emotion evaluation information and judges the emotion of the user.

Step twelve, the training of the pre-trained deep convolutional network comprises the following steps:

1) acquiring a face image open source emotion data set and a voice open source emotion data set, and acquiring face image emotion sample data and voice emotion sample data from the face image emotion data set and the voice emotion data set;

2) and enhancing the face emotion sample data, extracting face image feature data and performing feature selection on the feature data to obtain face image feature data. Performing short-time Fourier transform on the voice emotion sample data to obtain a voice spectrogram, extracting voice characteristic data by using an AlexNet network, and performing characteristic selection on the characteristic data to obtain voice characteristic data;

3) respectively carrying out feature fusion on feature vectors with the same emotion label in the face image feature data and the voice feature data to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;

4) the joint feature vector data set is trained using a temporal recurrent neural network.

Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:

the invention realizes the bimodal fusion of video information and voice information, and is an innovation of emotion recognition. Emotion recognition based on video and voice information is more persuasive than a single information recognition result;

in addition, feature fusion is selected in the aspect of a bimodal fusion mode, complementary information among different modalities and mutual influence among the complementary information are effectively fused, and the obtained combined feature result can more comprehensively display the emotional state of a user;

in addition, an idea is provided for other dual-mode fusion or multi-mode fusion, and various functions can be continuously improved subsequently, so that the traditional single-mode emotion recognition technology can achieve the purpose of upgrading and updating.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a schematic flow chart of a bimodal fusion emotion recognition method based on video and voice information provided by the present invention;

FIG. 2 is a schematic view of a video information processing flow provided by the present invention;

fig. 3 is a schematic diagram of a processing flow of voice information provided by the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

The invention is further illustrated by the following examples and figures of the specification.

Example 1

A bimodal fusion emotion recognition method based on video and voice information, as shown in FIG. 1, includes the following steps:

1) collecting voice signals and face images: carrying out non-contact acquisition on natural voice and a human face image by using a microphone and a camera;

the camera refers to a CMOS digital camera, and outputs an electric signal to be directly amplified and converted into a digital signal;

the microphone is a digital MEMS microphone and outputs 1/2-period pulse density modulation digital signals;

2) signal preprocessing: respectively preprocessing signals of two modes, namely video signals and voice signals, so that the signals meet the input requirements of corresponding models of different modes;

3) and (3) emotion feature extraction: respectively extracting the features of the face image signal and the voice signal preprocessed in the step 2) to obtain corresponding feature vectors;

4) training feature vectors and calculating relevance and attention distributions: normalizing the feature vectors obtained in the step 3), then respectively transmitting the normalized feature vectors into a Bi-GRU network for training, and further extracting features through a maximum pooling layer and an average pooling layer of the network to calculate the correlation among multi-modal state information and the attention distribution of each modal at each moment;

5) and (3) emotional characteristic fusion: performing feature fusion on the facial image and the voice feature vectors extracted in the step 3) by adopting a corresponding method;

6) judging the emotion: inputting the fusion characteristics of the step 5) into a pre-trained deep convolutional neural network, wherein the deep convolutional neural network comprises an emotion classifier, acquiring emotion evaluation information through the emotion classifier, and judging emotion according to the emotion evaluation information of the user.

Example 2

The video information processing flow, as shown in fig. 2, includes the following steps:

1) acquiring a video file to be processed; analyzing the video file to obtain a video frame; filtering the video frame based on the pixel information of the video frame, and taking the video frame obtained after filtering as the image of the face emotion to be recognized;

2) generating a histogram corresponding to the video frame and determining the definition of the video frame based on the pixel information of the video frame; clustering the video frames according to the histogram and the edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;

3) based on the filtered video frame, carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image;

4) based on the face image, inputting the face image into an image feature extraction model obtained by pre-training for processing, and determining a feature vector output by a full connection layer in the image feature extraction model as the image feature vector;

5) and normalizing the acquired human face image features, then transmitting the normalized human face image features into a Bi-GRU network for training, and further extracting the features through a maximum pooling layer and an average pooling layer of the network.

Example 3

The voice information processing flow, as shown in fig. 3, includes the following steps:

1) acquiring a human body voice signal by using a digital MEMS (micro electro mechanical system) microphone, pre-emphasizing the human body voice signal by using a first-order high-pass FIR (finite impulse response) digital filter, and outputting pre-emphasized voice data;

2) performing frame processing on the pre-emphasized voice data by using a short-time analysis technology to obtain a voice characteristic parameter time sequence;

3) windowing the voice characteristic parameter time sequence by using a Hamming window function to obtain voice windowing data

4) Carrying out endpoint detection on the voice windowing data by using a double-threshold comparison method to obtain preprocessed voice data;

5) carrying out short-time Fourier transform on the preprocessed voice data to obtain a voice spectrogram;

6) inputting the spectrogram into a preprocessed AlexNet network, and taking out voice characteristic data from a convolutional layer (Conv 4);

7) and performing feature selection on the feature data to obtain final voice features.

8) And normalizing the acquired voice features, transmitting the normalized voice features into a Bi-GRU network for training, and further extracting the features through a maximum pooling layer and an average pooling layer of the network.

Claims

1. A bimodal fusion emotion recognition method based on video and voice information is characterized by comprising the following steps:

step 1: acquiring face information and voice information of a user with emotion to be recognized through a camera and a microphone of external equipment, inputting the face information and the voice information into a pre-trained feature extraction network, and respectively extracting face image features and voice features;

step 2: and normalizing the extracted human face image features and the extracted voice features, then transmitting the normalized human face image features and the extracted voice features into a Bi-GRU network for training, and calculating correlation and attention distribution of each mode at each moment through input features in two single-mode sub-networks.

And step 3: and carrying out feature fusion on the extracted face image features and the extracted voice features to obtain a combined feature vector. The combined feature vector is obtained by fusing the human face image features and the voice features with the same emotion labels and performing dimensionality reduction and normalization processing;

and 4, step 4: and inputting the fusion characteristics into a pre-trained deep neural network, wherein the deep neural network comprises an emotion classifier and is used for acquiring emotion evaluation information of different types and finally evaluating the emotion of the user.

2. The method according to claim 1, wherein the video information is face image information.

3. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said obtaining facial image information and extracting facial image features comprises the steps of:

step 1: acquiring a video file to be processed; analyzing the video file to obtain a video frame; filtering the video frame based on the pixel information of the video frame, and taking the video frame obtained after filtering as the image of the face emotion to be recognized;

step 2: generating a histogram corresponding to the video frame and determining the definition of the video frame based on the pixel information of the video frame; clustering the video frames according to the histogram and the edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;

and step 3: based on the filtered video frame, carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image;

and 4, step 4: based on the face image, the face image is input into an image feature extraction model obtained through pre-training and processed, the feature vector output by a full connection layer in the image feature extraction model is determined to be the image feature vector, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.

4. The method according to claim 1, wherein the obtaining of the speech information and the extraction of the speech features are performed by a pre-processed AlexNet network.

5. The extracting of speech features according to claim 4 comprises the steps of:

step 1: the method comprises the steps of acquiring an original voice signal of a human body by using a microphone, and preprocessing the voice signal to obtain a spectrogram.

Step 2: inputting the spectrogram into a pre-trained AlexNet network, passing through a first input layer, a second convolution layer, a second pooling layer and a third convolution layer, taking out the obtained voice characteristics from a fourth convolution layer (Conv4), wherein ReLu is used as an activation function at the output end of each convolution layer.

6. The method for extracting the acquired voice features according to claim 5 is implemented as follows,

similarity between features is measured using Correlation Feature Selection (CFS). Irrelevant features that are less relevant to the category label are discarded. The evaluation criteria are as follows:

r_cfiis the feature classification relevance, k is the number of features,

representing the correlation between features.

7. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said normalizing said facial image features and voice features, and then transmitting them into Bi-GRU network for training, and calculating correlation and attention distribution comprises the steps of:

step 1: finding the maximum values of the facial image features and the voice features, dividing all feature vectors by the maximum values in the corresponding modes respectively, and converging to 0-1, so that the network training and convergence speed is improved;

step 2: the Bi-GRU network combines the model architectures of the GRU and the BRNN network. And respectively transmitting the normalized feature vectors into the network for training, extracting features through a maximum pooling layer and an average pooling layer of the network, and further calculating the correlation among multi-modal state information and the attention distribution of each modal at each moment.

8. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said feature fusion of said facial image features and said voice features to obtain a joint feature vector comprises the steps of:

step 1: performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;

step 2: and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a combined feature vector of two channels.

9. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein the training of the pre-trained deep convolutional network comprises the following steps:

step 1: acquiring a face image open source emotion data set and a voice open source emotion data set, and acquiring face image emotion sample data and voice emotion sample data from the face image emotion data set and the voice emotion data set;

step 2: and enhancing the face emotion sample data, extracting face image feature data and performing feature selection on the feature data to obtain face image feature data. Performing short-time Fourier transform on the voice emotion sample data to obtain a voice spectrogram, extracting voice characteristic data by using an AlexNet network, and performing characteristic selection on the characteristic data to obtain voice characteristic data;

and step 3: respectively carrying out feature fusion on feature vectors with the same emotion label in the face image feature data and the voice feature data to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;

and 4, step 4: the joint feature vector data set is trained using a temporal recurrent neural network.