CN113343860A - Bimodal fusion emotion recognition method based on video image and voice - Google Patents
Bimodal fusion emotion recognition method based on video image and voice Download PDFInfo
- Publication number
- CN113343860A CN113343860A CN202110650544.2A CN202110650544A CN113343860A CN 113343860 A CN113343860 A CN 113343860A CN 202110650544 A CN202110650544 A CN 202110650544A CN 113343860 A CN113343860 A CN 113343860A
- Authority
- CN
- China
- Prior art keywords
- voice
- emotion
- emotion recognition
- training
- video image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008909 emotion recognition Effects 0.000 title claims abstract description 48
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000004927 fusion Effects 0.000 title claims abstract description 18
- 230000002902 bimodal effect Effects 0.000 title claims abstract description 17
- 238000012549 training Methods 0.000 claims abstract description 35
- 230000008451 emotion Effects 0.000 claims abstract description 21
- 238000013527 convolutional neural network Methods 0.000 claims abstract description 20
- 238000003062 neural network model Methods 0.000 claims abstract description 12
- 230000008569 process Effects 0.000 claims abstract description 4
- 238000011176 pooling Methods 0.000 claims description 10
- 230000015654 memory Effects 0.000 claims description 9
- 230000014509 gene expression Effects 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 6
- 238000007500 overflow downdraw method Methods 0.000 claims description 5
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000008921 facial expression Effects 0.000 abstract description 5
- 230000006403 short-term memory Effects 0.000 abstract description 3
- 230000007787 long-term memory Effects 0.000 abstract description 2
- 238000011897 real-time detection Methods 0.000 abstract 1
- 238000012545 processing Methods 0.000 description 15
- 230000006870 function Effects 0.000 description 13
- 238000013528 artificial neural network Methods 0.000 description 8
- 210000002569 neuron Anatomy 0.000 description 7
- 239000013598 vector Substances 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 230000004913 activation Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 206010063659 Aversion Diseases 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 230000008918 emotional behaviour Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000005236 sound signal Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Signal Processing (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Psychiatry (AREA)
- Hospice & Palliative Care (AREA)
- Child & Adolescent Psychology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a bimodal fusion emotion recognition method based on video images and voice; the device comprises a camera, a microphone and an emotion recognition unit, wherein the emotion recognition unit comprises a video image mode and a voice mode. The training process of the bimodal model is as follows: inputting the image training data set into a convolutional neural network model for training to obtain a video image modal model; and inputting the voice training data set into the long-term and short-term memory neural network model for training to obtain a voice mode model. The camera collects video images and sends the video images to an emotion recognition unit, and facial expression characteristics are analyzed to obtain recognition results; the microphone collects voice data and sends the voice data to the emotion recognition unit, and the voice emotion characteristics are analyzed to obtain a recognition result; and fusing the recognition results of the two modes according to the weight criterion at the decision layer and outputting the result. The identification method adopted by the invention can improve the accuracy of emotion identification and realize real-time detection.
Description
Technical Field
The invention relates to the field of emotion recognition, in particular to a bimodal fusion emotion recognition method based on video images and voice.
Background
With the rapid development of artificial intelligence technology, people hope to have a more vivid interaction mode between AI and users, and bring better user experience for users. From the aspect of engineering application value, emotional recognition is a research topic relating to various fields such as machine vision, medicine, psychology and the like, and research can not only promote the progress of other interdisciplines, but also bring huge commercial value and practical significance to the society. According to different analysis information, emotion recognition technology can be currently divided into two categories; one is based on physiological signals, such as electroencephalograms, electrocardiograms, etc.; one is to analyze emotional behavior such as facial expressions, body movements, speech, etc. In practical use, two or more identification modes are usually selected to form a multi-mode identification mode, multi-mode fusion can improve identification accuracy, and better robustness is achieved. Because the physiological parameter acquisition difficulty is high, the analysis mode is rarely adopted; the accuracy rate of limb movement identification is low, and the limb movement identification usually appears as an auxiliary identification mode; the method is not high in difficulty in voice and facial expression acquisition but good in recognition effect, and is the most widely applied emotion recognition method.
The emotion recognition method used at present mainly has the following disadvantages: the single-mode identification method is used more, and the single-mode identification accuracy rate is difficult to continuously improve; features need to be extracted manually, and real-time processing cannot be realized; most of fusion of different modes uses a feature fusion technology, so that feature dimensionality becomes high, and real-time processing cannot be realized. These disadvantages result in difficulty in increasing the recognition accuracy to a high level and in realizing real-time emotion recognition, and thus improvements are necessary.
Disclosure of Invention
The invention aims to provide a bimodal fusion emotion recognition method based on video images and voice, and aims to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera acquires real-time video image information and sends the real-time video image information to the emotion recognition unit; collecting real-time voice information by a microphone and sending the real-time voice information to an emotion recognition unit; the emotion recognition unit comprises a video image modality and a voice modality, and obtains a video emotion recognition result and a voice emotion recognition result respectively, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.
The improved convolutional neural network model training method specifically comprises the following steps: after converting the training set image into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network and training the modified convolutional neural network to obtain a video image modal model.
The improved long-short term memory neural network model training method specifically comprises the following steps: preprocessing and framing the training set voice data; four characteristics of the extracted data are respectively: prosodic features, mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and after the four features are processed by using a depth limited Boltzmann machine, the obtained depth features are input into an improved long-short term memory network and trained, so that a voice modal model can be obtained.
The improved neural network implementation method specifically comprises the following steps: for a convolutional neural network, fusing the local binary statistical histogram method and the facial expression characteristics acquired by the sparse automatic encoder and inputting the fused facial expression characteristics into the network; the depth of the network is increased, and in order to eliminate time consumption increase and overfitting caused by depth increase, global mean pooling is used for replacing a full-connection layer; and the output layer uses Softmax to classify expressions and output the possibility of various classifications to obtain the identification result of the video image modal model. For the long and short term memory network, a variable weight back propagation algorithm is used in the network for optimization, and the nonlinear mapping capability of the network is enhanced; deepening a network structure, and replacing a full connection layer with global mean pooling; and the output layer classifies the speech emotion by using Softmax and outputs the possibility of the speech emotion on various classifications to obtain a speech modal model recognition result.
The real-time emotion recognition scheme specifically comprises the following steps: video image information is collected by a camera, the frame rate is set to be 30 frames, each frame of image is input into an emotion recognition unit, and a recognition result is processed and output by a trained video image modal model; the microphone collects voice information, the frame length is 33 milliseconds, each section of voice information is input into the emotion recognition unit, and the voice information is processed by the trained voice modal model and a recognition result is output.
The emotion recognition unit specifically includes: the emotion recognition unit comprises a video image mode, a voice mode and a bimodal decision layer fusion method, wherein the two modes consist of trained neural network models, the output is the probability of different emotion types, and the maximum probability is the emotion type. The decision layer fusion method is that outputs of different modes are weighted and added, and the sum of a weight value adopted by a video image mode and a weight value adopted by a voice mode is 1.
Compared with the prior art, the invention has the beneficial effects that: the neural network is used for automatically extracting features, the improved two modal models replace a full-connection layer by using global mean pooling, the response speed of the neural network is improved, and real-time emotion recognition is realized; the video modality uses multiple characteristics for fusion for input, the identification accuracy of the video image modality is improved, the voice modality uses a back propagation algorithm to optimize the nonlinear processing capability of the model, the depth characteristics are extracted, the identification accuracy of the voice modality is improved, and the identification accuracy of the voice modality is further improved by using a dual-modality fusion identification method compared with that of two single modalities.
Drawings
FIG. 1 is a functional block diagram of the present invention.
FIG. 2 is a flow chart of real-time emotion recognition according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The invention provides a bimodal fusion emotion recognition method based on video images and voice, which comprises the following steps: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera and the microphone collect real-time video images and language information and send the real-time video images and the language information to the emotion recognition unit; the emotion recognition unit comprises a video image mode and a voice mode, and respectively obtains recognition results of the two modes, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.
According to the classification condition of the training data set, the emotion types can be classified into a plurality of classes which are the same as the training data set, and exemplarily, the emotion types can be classified into seven classes: angry, aversion, worry, happy, hurting heart, surprise and nature.
The improved convolutional neural network model is realized in the following mode: local binary histogram method (LBPH) + Sparse Autoencoder (SAE) + Convolutional Neural Network (CNN). After the image is converted into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network for classification to obtain a video image modal classification result.
The local binary histogram method is realized in the following way: dividing the characteristic image of the Extended LBP operator into local blocks and extracting a histogram, taking a neighborhood central pixel as a threshold value, comparing the gray value of an adjacent pixel with the center, if the gray value is larger than the central pixel value, marking the position of the pixel point as 1, otherwise, marking the position as 0, and then sequentially connecting the histograms to form the LBPH. For a given center point (x)c,yc) With a neighborhood pixel position of (x)p,yp) Let P be less than P, then (x)p,yp) Can be expressed by equation (1):
where R is the sampling radius, P is the P-th sampling point, and P is the total number of samples.
In the local binary histogram method, because the calculated value may not be an integer, that is, the calculated point is not on the image, a bilinear interpolation method is adopted to avoid the situation. Equation (2) is as follows:
the sparse automatic encoder is realized by the following modes: after an input image is compressed, sparse reconstruction is carried out, SAE is a 3-layer unsupervised network model, sparsity constraint is applied to a hidden layer, the number of hidden nodes is forced to be smaller than that of input nodes, and therefore the network can learn key features of the image. First, the average activity of the jth hidden neuron is calculated, and formula (3) is as follows:
in the formula, xiAnd n represents the sample and number of input layers, respectively;indicating the activation degree of the jth hidden neuron.
The sparse automatic encoder comprises: to satisfy the constraint condition, a sparsity penalty term s (x) is added to the cost function. Equation (4) is as follows:
the sparse automatic encoder comprises: after the constraint condition is satisfied, the overall cost function of the SAE network is as follows:
in equation (5), γ represents the weight of the sparsity penalty term; w and b represent the weight and offset of each layer of neurons, respectively. The parameters of the SAE network are adjusted through training, so that the total cost function is minimized, and the detailed characteristics of the input image can be captured.
The improvement mode of the convolutional neural network is as follows: the method mainly comprises a convolution layer, a pooling layer, a global mean pooling layer and an output layer. The global mean pooling layer is used for replacing a full-connection layer, so that the calculation amount of parameters can be effectively reduced; assuming that the final output of the convolutional layer is a three-dimensional feature map of h × w × d, h × w of each layer is averaged to a value after GAP conversion.
The improvement mode of the convolutional neural network is as follows: the convolution operation is a deep separable convolution, which can greatly reduce the calculation amount. Assume that the input feature map has a size DL*DLIf stride step length is 1, the calculation amount of the standard convolution kernel output characteristic graph is DK*DK*M*N*DL*DLThe depth separable algorithm is calculated as DK*DK*M*DL*DL+M*N*DL* DL. Comparing the two, we can obtain the following formula (6):
wherein DLRepresenting the length of the input picture, DKRepresenting the spatial dimension, M being the number of input channels and N being the number of output channels.
The convolutional neural network: the method comprises the following steps of (1) containing 6 convolutional layers, wherein an activation function is ReLU, the convolutional layers are connected by using a maximum pooling method, and a full-connection layer is replaced by global mean pooling; the input layers are LBPH and SAE, and the output layer uses Softmax to classify expressions.
The improved long-short term memory neural network is realized by the following steps: a depth-limited boltzmann machine (DBM) + a long short-term memory neural network (LSTM). Processing the training set voice data by using an FFmpeg and Spleeter audio separation tool, and performing noise reduction, and then performing frame division processing by using a Hamming window function, wherein the frame length is set to be 33 ms; preprocessing and framing a voice signal, and extracting prosodic features, Mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and processing the four features by using a DBM to obtain depth features, inputting the depth features into an LSTM for classification, and obtaining a voice mode recognition result.
The voice signal preprocessing implementation mode is as follows: the Spleeter tool is used for extracting sound signals in the video training data set, and the FFmpeg tool is used for further processing the sound to distinguish human voice from background music. And performing noise reduction processing on the extracted voice data.
The framing operation specifically comprises: using Hamming window function omeganAnd a speech signal SnMultiplying to obtain a windowed speech signal sω(n), completing the framing operation, wherein the Hamming window function formula is as follows (7):
after the framing processing is completed, feature extraction can be performed on the preprocessed voice segments.
The depth limited Boltzmann machine is realized by the following steps: the DBM is formed by stacking a plurality of limited Boltzmann machines (RBMs) from bottom to top, and the output of the lower layer becomes the input of the upper layer. The RBM is an energy-based probability distribution model. The neuron array comprises a visible layer and a hidden layer, neurons in the same layer are independent, and bidirectional connections exist among the neurons in different layers.
The DBM is composed of three layers of RBMs, and an energy function is as follows:
E(v,h(1),h(2),h(3);θ)=-vTW(1)h(1)-h(1)TW(2)h(2)-h(2)TW(3)h(3) (8)
the DBM joint probability is as follows (9):
the DBM loss function is as follows (10):
wherein the matrix W represents the weight of information flowing in the network, the vector a and the vector b represent bias, h and v represent the state vector of the neuron, and theta represents a parameter set consisting of W, the vector a and the vector b. And obtaining the optimal characteristic vectors h and v of each RBM through refreshing parameters, connecting the RBMs into a DBM, and obtaining the output vector of the last RBM, namely the deep expression of the input characteristic.
The improvement mode of the long-short term memory neural network is as follows: the nonlinear mapping capability of the network is enhanced by using a variable weight back propagation algorithm (BP), and the processing speed of the system is improved by using a cross entropy cost function. Using GAP instead of full connectivity layer increases processing speed.
The BP uses a gradient descent method to adjust the weight omega between the nodesijAnd node bjThreshold, functional expression (11) is:
where eta represents the learning rate of the neural network,representing a partial differential operation, E denotes the standard error.
To solve the problem that the learning rate η decreases with increasing number of iterations, the improved BP neural network learning rate is updated according to equation (12):
wherein m is a constant with the iteration number a larger than 1 and smaller than 2, and S is a search range of the iterative learning rate.
The cross entropy cost function expression (13) is:
wherein xiRepresenting speech data, y (x)i) Denotes xiCorresponding label, a represents the output value of the data, a (x)i) Represents a specific xiThe output value, n, is the total amount of data. The cross entropy cost function has faster weight adjustment speed when the error is large, and has slow weight update when the error is small, thereby improving the processing speed of the system.
The convolutional neural network is formed by stacking three LSTM layers, wherein one LSTM layer is an input layer, two hidden layers are formed, the input layer is connected with the DBM, and the output layer is used for classifying expressions by Softrnax.
The implementation mode of the decision layer fusion strategy is as follows: the output of the video image channel and the voice channel are classified on the belonging emotion types, and fusion is carried out according to a weight criterion formula to output a bimodal recognition result.
The weight criterion formula is shown as (14):
where E is the category of emotion, PpProbability of classification for video image channel, PvFor the probability of classification on the voice channel, α and β are weights on the two channels, respectively; the specific weights may be adjusted according to the training data set and the model.
Table 1 below is a comparison table of the accuracy of the modal recognition using video images, the accuracy of the modal recognition using speech, and the accuracy of the bimodal recognition provided in this embodiment, where the data set selected for training and testing is the chaadd 2.0 (chinese natural emotion video data set).
Table 1:
associated modality | Neural network model | Weight selection | Rate of identification accuracy |
Video image | LBPH+SAE+CNN | 0.6 | 72.3% |
Speech sound | DBM+LSTM | 0.4 | 62.8% |
Bimodal | Fusion model | 74.9% |
As can be appreciated from table 1, recognition accuracy using the dual modality is higher than the video image modality and the voice modality.
Fig. 1 is a schematic diagram of functional modules of the present invention, which specifically includes: the camera collects video image data, the microphone collects voice data, and the collected data are input into the emotion recognition unit; the emotion recognition unit comprises a video image modal model, a language modal model and a decision layer fusion method; and obtaining a classification result of each mode after the data is subjected to model processing, fusing the classification results in a decision layer, and outputting a final recognition result.
FIG. 2 is a flow chart of the real-time emotion recognition of the present invention, specifically: after an application program is started, a camera and a microphone collect data and input the data into an emotion recognition unit; the video image modal model analyzes the video data to obtain emotion classification, and the voice modal analyzes the voice data to obtain emotion classification; and fusing the recognition results of the two modes in a decision layer to output a bimodal recognition result.
The aspects of the present invention may be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable data processing apparatus to implement some or all of the functions of aspects described herein.
While the present invention has been described with reference to the above embodiments, the specific implementation of the present invention is not limited to the above embodiments, and any person skilled in the art can easily think of the changes and substitutions within the scope of the calculation disclosed in the present application, and the ways of changing the data set, the number of emotion types, the weight parameters, etc. are all covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (7)
1. A bimodal fusion emotion recognition method based on video images and voice is characterized in that: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera acquires real-time video image information and sends the real-time video image information to the emotion recognition unit; collecting real-time voice information by a microphone and sending the real-time voice information to an emotion recognition unit; the emotion recognition unit comprises a video image modality and a voice modality, and obtains a video emotion recognition result and a voice emotion recognition result respectively, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.
2. The improved convolutional neural network model training method of claim 1, wherein: after converting the training set image into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network and training the fused features to obtain a video image modal model.
3. The improved long-short term memory neural network model training method of claim 1, wherein: preprocessing and framing the speech data of the training set, and extracting four characteristics of the data, wherein the four characteristics are respectively as follows: prosodic features, mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and after the four features are processed by using a depth limited Boltzmann machine, inputting the obtained depth features into an improved long-short term memory network and training to obtain a voice modal model.
4. The improved convolutional neural network implementation of claim 2, wherein: the two characteristics are fused and then input into a network; the depth of the network is increased, and in order to eliminate time consumption increase and overfitting caused by depth increase, global mean pooling is used for replacing a full-connection layer; and the output layer uses Softmax to classify expressions and output the possibility of various classifications to obtain the identification result of the video image modal model.
5. The method of claim 3, wherein the method comprises: the network is optimized by using a variable weight back propagation algorithm, so that the nonlinear mapping capability of the network is enhanced; deepening a network structure, and replacing a full connection layer with global mean pooling; and the output layer classifies the speech emotion by using Softmax and outputs the possibility of the speech emotion on various classifications to obtain a speech modal model recognition result.
6. The real-time emotion recognition method of claim 1, wherein: video image information is collected by a camera, each frame of image is input to an emotion recognition unit, and a recognition result is processed and output by a trained video image modal model; and voice information is collected by a microphone, each section of voice information is input into an emotion recognition unit, and a recognition result is processed and output by the trained voice modal model.
7. The method of designing an emotion recognition unit according to claim 1, wherein: the emotion recognition unit comprises a video image mode, a voice mode and a bimodal decision layer fusion method, wherein the two modes consist of trained neural network models, the output is the probability on different emotion types, and the maximum probability is the emotion type; the decision layer fusion method is that outputs of different modes are weighted and added, and the sum of a weight value adopted by a video image mode and a weight value adopted by a voice mode is 1.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110650544.2A CN113343860A (en) | 2021-06-10 | 2021-06-10 | Bimodal fusion emotion recognition method based on video image and voice |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110650544.2A CN113343860A (en) | 2021-06-10 | 2021-06-10 | Bimodal fusion emotion recognition method based on video image and voice |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113343860A true CN113343860A (en) | 2021-09-03 |
Family
ID=77476688
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110650544.2A Pending CN113343860A (en) | 2021-06-10 | 2021-06-10 | Bimodal fusion emotion recognition method based on video image and voice |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113343860A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807249A (en) * | 2021-09-17 | 2021-12-17 | 广州大学 | Multi-mode feature fusion based emotion recognition method, system, device and medium |
CN114533063A (en) * | 2022-02-23 | 2022-05-27 | 金华高等研究院(金华理工学院筹建工作领导小组办公室) | Multi-source monitoring combined emotion calculation system and method |
CN114973490A (en) * | 2022-05-26 | 2022-08-30 | 南京大学 | Monitoring and early warning system based on face recognition |
CN115455129A (en) * | 2022-10-14 | 2022-12-09 | 阿里巴巴(中国)有限公司 | POI processing method and device, electronic equipment and storage medium |
CN117708375A (en) * | 2024-02-05 | 2024-03-15 | 北京搜狐新媒体信息技术有限公司 | Video processing method and device and related products |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190311188A1 (en) * | 2018-12-05 | 2019-10-10 | Sichuan University | Face emotion recognition method based on dual-stream convolutional neural network |
CN110516696A (en) * | 2019-07-12 | 2019-11-29 | 东南大学 | It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression |
CN110556129A (en) * | 2019-09-09 | 2019-12-10 | 北京大学深圳研究生院 | Bimodal emotion recognition model training method and bimodal emotion recognition method |
CN110826466A (en) * | 2019-10-31 | 2020-02-21 | 南京励智心理大数据产业研究院有限公司 | Emotion identification method, device and storage medium based on LSTM audio-video fusion |
CN111242155A (en) * | 2019-10-08 | 2020-06-05 | 台州学院 | Bimodal emotion recognition method based on multimode deep learning |
WO2020173133A1 (en) * | 2019-02-27 | 2020-09-03 | 平安科技(深圳)有限公司 | Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium |
WO2020248376A1 (en) * | 2019-06-14 | 2020-12-17 | 平安科技(深圳)有限公司 | Emotion detection method and apparatus, electronic device, and storage medium |
-
2021
- 2021-06-10 CN CN202110650544.2A patent/CN113343860A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190311188A1 (en) * | 2018-12-05 | 2019-10-10 | Sichuan University | Face emotion recognition method based on dual-stream convolutional neural network |
WO2020173133A1 (en) * | 2019-02-27 | 2020-09-03 | 平安科技(深圳)有限公司 | Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium |
WO2020248376A1 (en) * | 2019-06-14 | 2020-12-17 | 平安科技(深圳)有限公司 | Emotion detection method and apparatus, electronic device, and storage medium |
CN110516696A (en) * | 2019-07-12 | 2019-11-29 | 东南大学 | It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression |
CN110556129A (en) * | 2019-09-09 | 2019-12-10 | 北京大学深圳研究生院 | Bimodal emotion recognition model training method and bimodal emotion recognition method |
CN111242155A (en) * | 2019-10-08 | 2020-06-05 | 台州学院 | Bimodal emotion recognition method based on multimode deep learning |
CN110826466A (en) * | 2019-10-31 | 2020-02-21 | 南京励智心理大数据产业研究院有限公司 | Emotion identification method, device and storage medium based on LSTM audio-video fusion |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807249A (en) * | 2021-09-17 | 2021-12-17 | 广州大学 | Multi-mode feature fusion based emotion recognition method, system, device and medium |
CN113807249B (en) * | 2021-09-17 | 2024-01-12 | 广州大学 | Emotion recognition method, system, device and medium based on multi-mode feature fusion |
CN114533063A (en) * | 2022-02-23 | 2022-05-27 | 金华高等研究院(金华理工学院筹建工作领导小组办公室) | Multi-source monitoring combined emotion calculation system and method |
CN114533063B (en) * | 2022-02-23 | 2023-10-27 | 金华高等研究院(金华理工学院筹建工作领导小组办公室) | Multi-source monitoring combined emotion computing system and method |
CN114973490A (en) * | 2022-05-26 | 2022-08-30 | 南京大学 | Monitoring and early warning system based on face recognition |
CN115455129A (en) * | 2022-10-14 | 2022-12-09 | 阿里巴巴(中国)有限公司 | POI processing method and device, electronic equipment and storage medium |
CN115455129B (en) * | 2022-10-14 | 2023-08-25 | 阿里巴巴(中国)有限公司 | POI processing method, POI processing device, electronic equipment and storage medium |
CN117708375A (en) * | 2024-02-05 | 2024-03-15 | 北京搜狐新媒体信息技术有限公司 | Video processing method and device and related products |
CN117708375B (en) * | 2024-02-05 | 2024-05-28 | 北京搜狐新媒体信息技术有限公司 | Video processing method and device and related products |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112784798B (en) | Multi-modal emotion recognition method based on feature-time attention mechanism | |
CN110188343B (en) | Multi-mode emotion recognition method based on fusion attention network | |
CN113343860A (en) | Bimodal fusion emotion recognition method based on video image and voice | |
CN106682616B (en) | Method for recognizing neonatal pain expression based on two-channel feature deep learning | |
CN105976809B (en) | Identification method and system based on speech and facial expression bimodal emotion fusion | |
CN108960337B (en) | Multi-modal complex activity recognition method based on deep learning model | |
CN112784763B (en) | Expression recognition method and system based on local and overall feature adaptive fusion | |
CN108597541A (en) | A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying | |
CN111292765B (en) | Bimodal emotion recognition method integrating multiple deep learning models | |
CN110575663B (en) | Physical education auxiliary training method based on artificial intelligence | |
CN111128242B (en) | Multi-mode emotion information fusion and identification method based on double-depth network | |
CN113158727A (en) | Bimodal fusion emotion recognition method based on video and voice information | |
CN112818861A (en) | Emotion classification method and system based on multi-mode context semantic features | |
US20220328065A1 (en) | Speech emotion recognition method and system based on fused population information | |
CN112101096B (en) | Multi-mode fusion suicide emotion perception method based on voice and micro-expression | |
CN111242155A (en) | Bimodal emotion recognition method based on multimode deep learning | |
CN112380924B (en) | Depression tendency detection method based on facial micro expression dynamic recognition | |
CN111709284B (en) | Dance emotion recognition method based on CNN-LSTM | |
CN112989920A (en) | Electroencephalogram emotion classification system based on frame-level feature distillation neural network | |
Rwelli et al. | Gesture based Arabic sign language recognition for impaired people based on convolution neural network | |
CN114724224A (en) | Multi-mode emotion recognition method for medical care robot | |
CN111967361A (en) | Emotion detection method based on baby expression recognition and crying | |
CN115731595A (en) | Fuzzy rule-based multi-level decision fusion emotion recognition method | |
CN114209319B (en) | fNIRS emotion recognition method and system based on graph network and self-adaptive denoising | |
CN112529054B (en) | Multi-dimensional convolution neural network learner modeling method for multi-source heterogeneous data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
DD01 | Delivery of document by public notice |
Addressee: Wang Chuanyu Document name: Deemed withdrawal notice |
|
DD01 | Delivery of document by public notice |