CN113343860A

CN113343860A - Bimodal fusion emotion recognition method based on video image and voice

Info

Publication number: CN113343860A
Application number: CN202110650544.2A
Authority: CN
Inventors: 李为相; 王传昱; 程明
Original assignee: Nanjing Tech University
Current assignee: Nanjing Tech University
Priority date: 2021-06-10
Filing date: 2021-06-10
Publication date: 2021-09-03

Abstract

The invention discloses a bimodal fusion emotion recognition method based on video images and voice; the device comprises a camera, a microphone and an emotion recognition unit, wherein the emotion recognition unit comprises a video image mode and a voice mode. The training process of the bimodal model is as follows: inputting the image training data set into a convolutional neural network model for training to obtain a video image modal model; and inputting the voice training data set into the long-term and short-term memory neural network model for training to obtain a voice mode model. The camera collects video images and sends the video images to an emotion recognition unit, and facial expression characteristics are analyzed to obtain recognition results; the microphone collects voice data and sends the voice data to the emotion recognition unit, and the voice emotion characteristics are analyzed to obtain a recognition result; and fusing the recognition results of the two modes according to the weight criterion at the decision layer and outputting the result. The identification method adopted by the invention can improve the accuracy of emotion identification and realize real-time detection.

Description

Bimodal fusion emotion recognition method based on video image and voice

Technical Field

The invention relates to the field of emotion recognition, in particular to a bimodal fusion emotion recognition method based on video images and voice.

Background

With the rapid development of artificial intelligence technology, people hope to have a more vivid interaction mode between AI and users, and bring better user experience for users. From the aspect of engineering application value, emotional recognition is a research topic relating to various fields such as machine vision, medicine, psychology and the like, and research can not only promote the progress of other interdisciplines, but also bring huge commercial value and practical significance to the society. According to different analysis information, emotion recognition technology can be currently divided into two categories; one is based on physiological signals, such as electroencephalograms, electrocardiograms, etc.; one is to analyze emotional behavior such as facial expressions, body movements, speech, etc. In practical use, two or more identification modes are usually selected to form a multi-mode identification mode, multi-mode fusion can improve identification accuracy, and better robustness is achieved. Because the physiological parameter acquisition difficulty is high, the analysis mode is rarely adopted; the accuracy rate of limb movement identification is low, and the limb movement identification usually appears as an auxiliary identification mode; the method is not high in difficulty in voice and facial expression acquisition but good in recognition effect, and is the most widely applied emotion recognition method.

The emotion recognition method used at present mainly has the following disadvantages: the single-mode identification method is used more, and the single-mode identification accuracy rate is difficult to continuously improve; features need to be extracted manually, and real-time processing cannot be realized; most of fusion of different modes uses a feature fusion technology, so that feature dimensionality becomes high, and real-time processing cannot be realized. These disadvantages result in difficulty in increasing the recognition accuracy to a high level and in realizing real-time emotion recognition, and thus improvements are necessary.

Disclosure of Invention

The invention aims to provide a bimodal fusion emotion recognition method based on video images and voice, and aims to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera acquires real-time video image information and sends the real-time video image information to the emotion recognition unit; collecting real-time voice information by a microphone and sending the real-time voice information to an emotion recognition unit; the emotion recognition unit comprises a video image modality and a voice modality, and obtains a video emotion recognition result and a voice emotion recognition result respectively, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.

The improved convolutional neural network model training method specifically comprises the following steps: after converting the training set image into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network and training the modified convolutional neural network to obtain a video image modal model.

The improved long-short term memory neural network model training method specifically comprises the following steps: preprocessing and framing the training set voice data; four characteristics of the extracted data are respectively: prosodic features, mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and after the four features are processed by using a depth limited Boltzmann machine, the obtained depth features are input into an improved long-short term memory network and trained, so that a voice modal model can be obtained.

The improved neural network implementation method specifically comprises the following steps: for a convolutional neural network, fusing the local binary statistical histogram method and the facial expression characteristics acquired by the sparse automatic encoder and inputting the fused facial expression characteristics into the network; the depth of the network is increased, and in order to eliminate time consumption increase and overfitting caused by depth increase, global mean pooling is used for replacing a full-connection layer; and the output layer uses Softmax to classify expressions and output the possibility of various classifications to obtain the identification result of the video image modal model. For the long and short term memory network, a variable weight back propagation algorithm is used in the network for optimization, and the nonlinear mapping capability of the network is enhanced; deepening a network structure, and replacing a full connection layer with global mean pooling; and the output layer classifies the speech emotion by using Softmax and outputs the possibility of the speech emotion on various classifications to obtain a speech modal model recognition result.

The real-time emotion recognition scheme specifically comprises the following steps: video image information is collected by a camera, the frame rate is set to be 30 frames, each frame of image is input into an emotion recognition unit, and a recognition result is processed and output by a trained video image modal model; the microphone collects voice information, the frame length is 33 milliseconds, each section of voice information is input into the emotion recognition unit, and the voice information is processed by the trained voice modal model and a recognition result is output.

The emotion recognition unit specifically includes: the emotion recognition unit comprises a video image mode, a voice mode and a bimodal decision layer fusion method, wherein the two modes consist of trained neural network models, the output is the probability of different emotion types, and the maximum probability is the emotion type. The decision layer fusion method is that outputs of different modes are weighted and added, and the sum of a weight value adopted by a video image mode and a weight value adopted by a voice mode is 1.

Compared with the prior art, the invention has the beneficial effects that: the neural network is used for automatically extracting features, the improved two modal models replace a full-connection layer by using global mean pooling, the response speed of the neural network is improved, and real-time emotion recognition is realized; the video modality uses multiple characteristics for fusion for input, the identification accuracy of the video image modality is improved, the voice modality uses a back propagation algorithm to optimize the nonlinear processing capability of the model, the depth characteristics are extracted, the identification accuracy of the voice modality is improved, and the identification accuracy of the voice modality is further improved by using a dual-modality fusion identification method compared with that of two single modalities.

Drawings

FIG. 1 is a functional block diagram of the present invention.

FIG. 2 is a flow chart of real-time emotion recognition according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The invention provides a bimodal fusion emotion recognition method based on video images and voice, which comprises the following steps: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera and the microphone collect real-time video images and language information and send the real-time video images and the language information to the emotion recognition unit; the emotion recognition unit comprises a video image mode and a voice mode, and respectively obtains recognition results of the two modes, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.

According to the classification condition of the training data set, the emotion types can be classified into a plurality of classes which are the same as the training data set, and exemplarily, the emotion types can be classified into seven classes: angry, aversion, worry, happy, hurting heart, surprise and nature.

The improved convolutional neural network model is realized in the following mode: local binary histogram method (LBPH) + Sparse Autoencoder (SAE) + Convolutional Neural Network (CNN). After the image is converted into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network for classification to obtain a video image modal classification result.

The local binary histogram method is realized in the following way: dividing the characteristic image of the Extended LBP operator into local blocks and extracting a histogram, taking a neighborhood central pixel as a threshold value, comparing the gray value of an adjacent pixel with the center, if the gray value is larger than the central pixel value, marking the position of the pixel point as 1, otherwise, marking the position as 0, and then sequentially connecting the histograms to form the LBPH. For a given center point (x)_c，y_c) With a neighborhood pixel position of (x)_p，y_p) Let P be less than P, then (x)_p，y_p) Can be expressed by equation (1):

where R is the sampling radius, P is the P-th sampling point, and P is the total number of samples.

In the local binary histogram method, because the calculated value may not be an integer, that is, the calculated point is not on the image, a bilinear interpolation method is adopted to avoid the situation. Equation (2) is as follows:

the sparse automatic encoder is realized by the following modes: after an input image is compressed, sparse reconstruction is carried out, SAE is a 3-layer unsupervised network model, sparsity constraint is applied to a hidden layer, the number of hidden nodes is forced to be smaller than that of input nodes, and therefore the network can learn key features of the image. First, the average activity of the jth hidden neuron is calculated, and formula (3) is as follows:

in the formula, x_iAnd n represents the sample and number of input layers, respectively;

indicating the activation degree of the jth hidden neuron.

The sparse automatic encoder comprises: to satisfy the constraint condition, a sparsity penalty term s (x) is added to the cost function. Equation (4) is as follows:

the sparse automatic encoder comprises: after the constraint condition is satisfied, the overall cost function of the SAE network is as follows:

in equation (5), γ represents the weight of the sparsity penalty term; w and b represent the weight and offset of each layer of neurons, respectively. The parameters of the SAE network are adjusted through training, so that the total cost function is minimized, and the detailed characteristics of the input image can be captured.

The improvement mode of the convolutional neural network is as follows: the method mainly comprises a convolution layer, a pooling layer, a global mean pooling layer and an output layer. The global mean pooling layer is used for replacing a full-connection layer, so that the calculation amount of parameters can be effectively reduced; assuming that the final output of the convolutional layer is a three-dimensional feature map of h × w × d, h × w of each layer is averaged to a value after GAP conversion.

The improvement mode of the convolutional neural network is as follows: the convolution operation is a deep separable convolution, which can greatly reduce the calculation amount. Assume that the input feature map has a size D_L*D_LIf stride step length is 1, the calculation amount of the standard convolution kernel output characteristic graph is D_K*D_K*M*N*D_L*D_LThe depth separable algorithm is calculated as D_K*D_K*M*D_L*D_L+M*N*D_L* D_L. Comparing the two, we can obtain the following formula (6):

wherein D_LRepresenting the length of the input picture, D_KRepresenting the spatial dimension, M being the number of input channels and N being the number of output channels.

The convolutional neural network: the method comprises the following steps of (1) containing 6 convolutional layers, wherein an activation function is ReLU, the convolutional layers are connected by using a maximum pooling method, and a full-connection layer is replaced by global mean pooling; the input layers are LBPH and SAE, and the output layer uses Softmax to classify expressions.

The improved long-short term memory neural network is realized by the following steps: a depth-limited boltzmann machine (DBM) + a long short-term memory neural network (LSTM). Processing the training set voice data by using an FFmpeg and Spleeter audio separation tool, and performing noise reduction, and then performing frame division processing by using a Hamming window function, wherein the frame length is set to be 33 ms; preprocessing and framing a voice signal, and extracting prosodic features, Mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and processing the four features by using a DBM to obtain depth features, inputting the depth features into an LSTM for classification, and obtaining a voice mode recognition result.

The voice signal preprocessing implementation mode is as follows: the Spleeter tool is used for extracting sound signals in the video training data set, and the FFmpeg tool is used for further processing the sound to distinguish human voice from background music. And performing noise reduction processing on the extracted voice data.

The framing operation specifically comprises: using Hamming window function omega_nAnd a speech signal S_nMultiplying to obtain a windowed speech signal s_ω(n), completing the framing operation, wherein the Hamming window function formula is as follows (7):

after the framing processing is completed, feature extraction can be performed on the preprocessed voice segments.

The depth limited Boltzmann machine is realized by the following steps: the DBM is formed by stacking a plurality of limited Boltzmann machines (RBMs) from bottom to top, and the output of the lower layer becomes the input of the upper layer. The RBM is an energy-based probability distribution model. The neuron array comprises a visible layer and a hidden layer, neurons in the same layer are independent, and bidirectional connections exist among the neurons in different layers.

The DBM is composed of three layers of RBMs, and an energy function is as follows:

E(v，h⁽¹⁾，h⁽²⁾，h⁽³⁾；θ)＝-v^TW⁽¹⁾h⁽¹⁾-h^(1)TW⁽²⁾h⁽²⁾-h^(2)TW⁽³⁾h⁽³⁾ (8)

the DBM joint probability is as follows (9):

the DBM loss function is as follows (10):

wherein the matrix W represents the weight of information flowing in the network, the vector a and the vector b represent bias, h and v represent the state vector of the neuron, and theta represents a parameter set consisting of W, the vector a and the vector b. And obtaining the optimal characteristic vectors h and v of each RBM through refreshing parameters, connecting the RBMs into a DBM, and obtaining the output vector of the last RBM, namely the deep expression of the input characteristic.

The improvement mode of the long-short term memory neural network is as follows: the nonlinear mapping capability of the network is enhanced by using a variable weight back propagation algorithm (BP), and the processing speed of the system is improved by using a cross entropy cost function. Using GAP instead of full connectivity layer increases processing speed.

The BP uses a gradient descent method to adjust the weight omega between the nodes_ijAnd node b_jThreshold, functional expression (11) is:

where eta represents the learning rate of the neural network,

representing a partial differential operation, E denotes the standard error.

To solve the problem that the learning rate η decreases with increasing number of iterations, the improved BP neural network learning rate is updated according to equation (12):

wherein m is a constant with the iteration number a larger than 1 and smaller than 2, and S is a search range of the iterative learning rate.

The cross entropy cost function expression (13) is:

wherein x_iRepresenting speech data, y (x)_i) Denotes x_iCorresponding label, a represents the output value of the data, a (x)_i) Represents a specific x_iThe output value, n, is the total amount of data. The cross entropy cost function has faster weight adjustment speed when the error is large, and has slow weight update when the error is small, thereby improving the processing speed of the system.

The convolutional neural network is formed by stacking three LSTM layers, wherein one LSTM layer is an input layer, two hidden layers are formed, the input layer is connected with the DBM, and the output layer is used for classifying expressions by Softrnax.

The implementation mode of the decision layer fusion strategy is as follows: the output of the video image channel and the voice channel are classified on the belonging emotion types, and fusion is carried out according to a weight criterion formula to output a bimodal recognition result.

The weight criterion formula is shown as (14):

where E is the category of emotion, P_pProbability of classification for video image channel, P_vFor the probability of classification on the voice channel, α and β are weights on the two channels, respectively; the specific weights may be adjusted according to the training data set and the model.

Table 1 below is a comparison table of the accuracy of the modal recognition using video images, the accuracy of the modal recognition using speech, and the accuracy of the bimodal recognition provided in this embodiment, where the data set selected for training and testing is the chaadd 2.0 (chinese natural emotion video data set).

Table 1:

associated modality	Neural network model	Weight selection	Rate of identification accuracy
				Video image	LBPH+SAE+CNN	0.6	72.3％
Speech sound	DBM+LSTM	0.4	62.8％
				Bimodal	Fusion model		74.9％

As can be appreciated from table 1, recognition accuracy using the dual modality is higher than the video image modality and the voice modality.

Fig. 1 is a schematic diagram of functional modules of the present invention, which specifically includes: the camera collects video image data, the microphone collects voice data, and the collected data are input into the emotion recognition unit; the emotion recognition unit comprises a video image modal model, a language modal model and a decision layer fusion method; and obtaining a classification result of each mode after the data is subjected to model processing, fusing the classification results in a decision layer, and outputting a final recognition result.

FIG. 2 is a flow chart of the real-time emotion recognition of the present invention, specifically: after an application program is started, a camera and a microphone collect data and input the data into an emotion recognition unit; the video image modal model analyzes the video data to obtain emotion classification, and the voice modal analyzes the voice data to obtain emotion classification; and fusing the recognition results of the two modes in a decision layer to output a bimodal recognition result.

The aspects of the present invention may be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable data processing apparatus to implement some or all of the functions of aspects described herein.

While the present invention has been described with reference to the above embodiments, the specific implementation of the present invention is not limited to the above embodiments, and any person skilled in the art can easily think of the changes and substitutions within the scope of the calculation disclosed in the present application, and the ways of changing the data set, the number of emotion types, the weight parameters, etc. are all covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A bimodal fusion emotion recognition method based on video images and voice is characterized in that: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera acquires real-time video image information and sends the real-time video image information to the emotion recognition unit; collecting real-time voice information by a microphone and sending the real-time voice information to an emotion recognition unit; the emotion recognition unit comprises a video image modality and a voice modality, and obtains a video emotion recognition result and a voice emotion recognition result respectively, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.

2. The improved convolutional neural network model training method of claim 1, wherein: after converting the training set image into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network and training the fused features to obtain a video image modal model.

3. The improved long-short term memory neural network model training method of claim 1, wherein: preprocessing and framing the speech data of the training set, and extracting four characteristics of the data, wherein the four characteristics are respectively as follows: prosodic features, mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and after the four features are processed by using a depth limited Boltzmann machine, inputting the obtained depth features into an improved long-short term memory network and training to obtain a voice modal model.

4. The improved convolutional neural network implementation of claim 2, wherein: the two characteristics are fused and then input into a network; the depth of the network is increased, and in order to eliminate time consumption increase and overfitting caused by depth increase, global mean pooling is used for replacing a full-connection layer; and the output layer uses Softmax to classify expressions and output the possibility of various classifications to obtain the identification result of the video image modal model.

5. The method of claim 3, wherein the method comprises: the network is optimized by using a variable weight back propagation algorithm, so that the nonlinear mapping capability of the network is enhanced; deepening a network structure, and replacing a full connection layer with global mean pooling; and the output layer classifies the speech emotion by using Softmax and outputs the possibility of the speech emotion on various classifications to obtain a speech modal model recognition result.

6. The real-time emotion recognition method of claim 1, wherein: video image information is collected by a camera, each frame of image is input to an emotion recognition unit, and a recognition result is processed and output by a trained video image modal model; and voice information is collected by a microphone, each section of voice information is input into an emotion recognition unit, and a recognition result is processed and output by the trained voice modal model.

7. The method of designing an emotion recognition unit according to claim 1, wherein: the emotion recognition unit comprises a video image mode, a voice mode and a bimodal decision layer fusion method, wherein the two modes consist of trained neural network models, the output is the probability on different emotion types, and the maximum probability is the emotion type; the decision layer fusion method is that outputs of different modes are weighted and added, and the sum of a weight value adopted by a video image mode and a weight value adopted by a voice mode is 1.