CN113343860A - Bimodal fusion emotion recognition method based on video image and voice - Google Patents

Bimodal fusion emotion recognition method based on video image and voice Download PDF

Info

Publication number
CN113343860A
CN113343860A CN202110650544.2A CN202110650544A CN113343860A CN 113343860 A CN113343860 A CN 113343860A CN 202110650544 A CN202110650544 A CN 202110650544A CN 113343860 A CN113343860 A CN 113343860A
Authority
CN
China
Prior art keywords
voice
emotion
emotion recognition
training
video image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110650544.2A
Other languages
Chinese (zh)
Inventor
李为相
王传昱
程明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN202110650544.2A priority Critical patent/CN113343860A/en
Publication of CN113343860A publication Critical patent/CN113343860A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Signal Processing (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Psychiatry (AREA)
  • Hospice & Palliative Care (AREA)
  • Child & Adolescent Psychology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a bimodal fusion emotion recognition method based on video images and voice; the device comprises a camera, a microphone and an emotion recognition unit, wherein the emotion recognition unit comprises a video image mode and a voice mode. The training process of the bimodal model is as follows: inputting the image training data set into a convolutional neural network model for training to obtain a video image modal model; and inputting the voice training data set into the long-term and short-term memory neural network model for training to obtain a voice mode model. The camera collects video images and sends the video images to an emotion recognition unit, and facial expression characteristics are analyzed to obtain recognition results; the microphone collects voice data and sends the voice data to the emotion recognition unit, and the voice emotion characteristics are analyzed to obtain a recognition result; and fusing the recognition results of the two modes according to the weight criterion at the decision layer and outputting the result. The identification method adopted by the invention can improve the accuracy of emotion identification and realize real-time detection.

Description

Bimodal fusion emotion recognition method based on video image and voice
Technical Field
The invention relates to the field of emotion recognition, in particular to a bimodal fusion emotion recognition method based on video images and voice.
Background
With the rapid development of artificial intelligence technology, people hope to have a more vivid interaction mode between AI and users, and bring better user experience for users. From the aspect of engineering application value, emotional recognition is a research topic relating to various fields such as machine vision, medicine, psychology and the like, and research can not only promote the progress of other interdisciplines, but also bring huge commercial value and practical significance to the society. According to different analysis information, emotion recognition technology can be currently divided into two categories; one is based on physiological signals, such as electroencephalograms, electrocardiograms, etc.; one is to analyze emotional behavior such as facial expressions, body movements, speech, etc. In practical use, two or more identification modes are usually selected to form a multi-mode identification mode, multi-mode fusion can improve identification accuracy, and better robustness is achieved. Because the physiological parameter acquisition difficulty is high, the analysis mode is rarely adopted; the accuracy rate of limb movement identification is low, and the limb movement identification usually appears as an auxiliary identification mode; the method is not high in difficulty in voice and facial expression acquisition but good in recognition effect, and is the most widely applied emotion recognition method.
The emotion recognition method used at present mainly has the following disadvantages: the single-mode identification method is used more, and the single-mode identification accuracy rate is difficult to continuously improve; features need to be extracted manually, and real-time processing cannot be realized; most of fusion of different modes uses a feature fusion technology, so that feature dimensionality becomes high, and real-time processing cannot be realized. These disadvantages result in difficulty in increasing the recognition accuracy to a high level and in realizing real-time emotion recognition, and thus improvements are necessary.
Disclosure of Invention
The invention aims to provide a bimodal fusion emotion recognition method based on video images and voice, and aims to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera acquires real-time video image information and sends the real-time video image information to the emotion recognition unit; collecting real-time voice information by a microphone and sending the real-time voice information to an emotion recognition unit; the emotion recognition unit comprises a video image modality and a voice modality, and obtains a video emotion recognition result and a voice emotion recognition result respectively, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.
The improved convolutional neural network model training method specifically comprises the following steps: after converting the training set image into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network and training the modified convolutional neural network to obtain a video image modal model.
The improved long-short term memory neural network model training method specifically comprises the following steps: preprocessing and framing the training set voice data; four characteristics of the extracted data are respectively: prosodic features, mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and after the four features are processed by using a depth limited Boltzmann machine, the obtained depth features are input into an improved long-short term memory network and trained, so that a voice modal model can be obtained.
The improved neural network implementation method specifically comprises the following steps: for a convolutional neural network, fusing the local binary statistical histogram method and the facial expression characteristics acquired by the sparse automatic encoder and inputting the fused facial expression characteristics into the network; the depth of the network is increased, and in order to eliminate time consumption increase and overfitting caused by depth increase, global mean pooling is used for replacing a full-connection layer; and the output layer uses Softmax to classify expressions and output the possibility of various classifications to obtain the identification result of the video image modal model. For the long and short term memory network, a variable weight back propagation algorithm is used in the network for optimization, and the nonlinear mapping capability of the network is enhanced; deepening a network structure, and replacing a full connection layer with global mean pooling; and the output layer classifies the speech emotion by using Softmax and outputs the possibility of the speech emotion on various classifications to obtain a speech modal model recognition result.
The real-time emotion recognition scheme specifically comprises the following steps: video image information is collected by a camera, the frame rate is set to be 30 frames, each frame of image is input into an emotion recognition unit, and a recognition result is processed and output by a trained video image modal model; the microphone collects voice information, the frame length is 33 milliseconds, each section of voice information is input into the emotion recognition unit, and the voice information is processed by the trained voice modal model and a recognition result is output.
The emotion recognition unit specifically includes: the emotion recognition unit comprises a video image mode, a voice mode and a bimodal decision layer fusion method, wherein the two modes consist of trained neural network models, the output is the probability of different emotion types, and the maximum probability is the emotion type. The decision layer fusion method is that outputs of different modes are weighted and added, and the sum of a weight value adopted by a video image mode and a weight value adopted by a voice mode is 1.
Compared with the prior art, the invention has the beneficial effects that: the neural network is used for automatically extracting features, the improved two modal models replace a full-connection layer by using global mean pooling, the response speed of the neural network is improved, and real-time emotion recognition is realized; the video modality uses multiple characteristics for fusion for input, the identification accuracy of the video image modality is improved, the voice modality uses a back propagation algorithm to optimize the nonlinear processing capability of the model, the depth characteristics are extracted, the identification accuracy of the voice modality is improved, and the identification accuracy of the voice modality is further improved by using a dual-modality fusion identification method compared with that of two single modalities.
Drawings
FIG. 1 is a functional block diagram of the present invention.
FIG. 2 is a flow chart of real-time emotion recognition according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
The invention provides a bimodal fusion emotion recognition method based on video images and voice, which comprises the following steps: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera and the microphone collect real-time video images and language information and send the real-time video images and the language information to the emotion recognition unit; the emotion recognition unit comprises a video image mode and a voice mode, and respectively obtains recognition results of the two modes, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.
According to the classification condition of the training data set, the emotion types can be classified into a plurality of classes which are the same as the training data set, and exemplarily, the emotion types can be classified into seven classes: angry, aversion, worry, happy, hurting heart, surprise and nature.
The improved convolutional neural network model is realized in the following mode: local binary histogram method (LBPH) + Sparse Autoencoder (SAE) + Convolutional Neural Network (CNN). After the image is converted into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network for classification to obtain a video image modal classification result.
The local binary histogram method is realized in the following way: dividing the characteristic image of the Extended LBP operator into local blocks and extracting a histogram, taking a neighborhood central pixel as a threshold value, comparing the gray value of an adjacent pixel with the center, if the gray value is larger than the central pixel value, marking the position of the pixel point as 1, otherwise, marking the position as 0, and then sequentially connecting the histograms to form the LBPH. For a given center point (x)c,yc) With a neighborhood pixel position of (x)p,yp) Let P be less than P, then (x)p,yp) Can be expressed by equation (1):
Figure BSA0000244521510000031
where R is the sampling radius, P is the P-th sampling point, and P is the total number of samples.
In the local binary histogram method, because the calculated value may not be an integer, that is, the calculated point is not on the image, a bilinear interpolation method is adopted to avoid the situation. Equation (2) is as follows:
Figure BSA0000244521510000032
the sparse automatic encoder is realized by the following modes: after an input image is compressed, sparse reconstruction is carried out, SAE is a 3-layer unsupervised network model, sparsity constraint is applied to a hidden layer, the number of hidden nodes is forced to be smaller than that of input nodes, and therefore the network can learn key features of the image. First, the average activity of the jth hidden neuron is calculated, and formula (3) is as follows:
Figure BSA0000244521510000033
in the formula, xiAnd n represents the sample and number of input layers, respectively;
Figure BSA0000244521510000034
indicating the activation degree of the jth hidden neuron.
The sparse automatic encoder comprises: to satisfy the constraint condition, a sparsity penalty term s (x) is added to the cost function. Equation (4) is as follows:
Figure BSA0000244521510000041
the sparse automatic encoder comprises: after the constraint condition is satisfied, the overall cost function of the SAE network is as follows:
Figure BSA0000244521510000042
in equation (5), γ represents the weight of the sparsity penalty term; w and b represent the weight and offset of each layer of neurons, respectively. The parameters of the SAE network are adjusted through training, so that the total cost function is minimized, and the detailed characteristics of the input image can be captured.
The improvement mode of the convolutional neural network is as follows: the method mainly comprises a convolution layer, a pooling layer, a global mean pooling layer and an output layer. The global mean pooling layer is used for replacing a full-connection layer, so that the calculation amount of parameters can be effectively reduced; assuming that the final output of the convolutional layer is a three-dimensional feature map of h × w × d, h × w of each layer is averaged to a value after GAP conversion.
The improvement mode of the convolutional neural network is as follows: the convolution operation is a deep separable convolution, which can greatly reduce the calculation amount. Assume that the input feature map has a size DL*DLIf stride step length is 1, the calculation amount of the standard convolution kernel output characteristic graph is DK*DK*M*N*DL*DLThe depth separable algorithm is calculated as DK*DK*M*DL*DL+M*N*DL* DL. Comparing the two, we can obtain the following formula (6):
Figure BSA0000244521510000043
wherein DLRepresenting the length of the input picture, DKRepresenting the spatial dimension, M being the number of input channels and N being the number of output channels.
The convolutional neural network: the method comprises the following steps of (1) containing 6 convolutional layers, wherein an activation function is ReLU, the convolutional layers are connected by using a maximum pooling method, and a full-connection layer is replaced by global mean pooling; the input layers are LBPH and SAE, and the output layer uses Softmax to classify expressions.
The improved long-short term memory neural network is realized by the following steps: a depth-limited boltzmann machine (DBM) + a long short-term memory neural network (LSTM). Processing the training set voice data by using an FFmpeg and Spleeter audio separation tool, and performing noise reduction, and then performing frame division processing by using a Hamming window function, wherein the frame length is set to be 33 ms; preprocessing and framing a voice signal, and extracting prosodic features, Mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and processing the four features by using a DBM to obtain depth features, inputting the depth features into an LSTM for classification, and obtaining a voice mode recognition result.
The voice signal preprocessing implementation mode is as follows: the Spleeter tool is used for extracting sound signals in the video training data set, and the FFmpeg tool is used for further processing the sound to distinguish human voice from background music. And performing noise reduction processing on the extracted voice data.
The framing operation specifically comprises: using Hamming window function omeganAnd a speech signal SnMultiplying to obtain a windowed speech signal sω(n), completing the framing operation, wherein the Hamming window function formula is as follows (7):
Figure 1
after the framing processing is completed, feature extraction can be performed on the preprocessed voice segments.
The depth limited Boltzmann machine is realized by the following steps: the DBM is formed by stacking a plurality of limited Boltzmann machines (RBMs) from bottom to top, and the output of the lower layer becomes the input of the upper layer. The RBM is an energy-based probability distribution model. The neuron array comprises a visible layer and a hidden layer, neurons in the same layer are independent, and bidirectional connections exist among the neurons in different layers.
The DBM is composed of three layers of RBMs, and an energy function is as follows:
E(v,h(1),h(2),h(3);θ)=-vTW(1)h(1)-h(1)TW(2)h(2)-h(2)TW(3)h(3) (8)
the DBM joint probability is as follows (9):
Figure BSA0000244521510000052
the DBM loss function is as follows (10):
Figure BSA0000244521510000053
wherein the matrix W represents the weight of information flowing in the network, the vector a and the vector b represent bias, h and v represent the state vector of the neuron, and theta represents a parameter set consisting of W, the vector a and the vector b. And obtaining the optimal characteristic vectors h and v of each RBM through refreshing parameters, connecting the RBMs into a DBM, and obtaining the output vector of the last RBM, namely the deep expression of the input characteristic.
The improvement mode of the long-short term memory neural network is as follows: the nonlinear mapping capability of the network is enhanced by using a variable weight back propagation algorithm (BP), and the processing speed of the system is improved by using a cross entropy cost function. Using GAP instead of full connectivity layer increases processing speed.
The BP uses a gradient descent method to adjust the weight omega between the nodesijAnd node bjThreshold, functional expression (11) is:
Figure BSA0000244521510000054
where eta represents the learning rate of the neural network,
Figure BSA0000244521510000055
representing a partial differential operation, E denotes the standard error.
To solve the problem that the learning rate η decreases with increasing number of iterations, the improved BP neural network learning rate is updated according to equation (12):
Figure 3
wherein m is a constant with the iteration number a larger than 1 and smaller than 2, and S is a search range of the iterative learning rate.
The cross entropy cost function expression (13) is:
Figure BSA0000244521510000062
wherein xiRepresenting speech data, y (x)i) Denotes xiCorresponding label, a represents the output value of the data, a (x)i) Represents a specific xiThe output value, n, is the total amount of data. The cross entropy cost function has faster weight adjustment speed when the error is large, and has slow weight update when the error is small, thereby improving the processing speed of the system.
The convolutional neural network is formed by stacking three LSTM layers, wherein one LSTM layer is an input layer, two hidden layers are formed, the input layer is connected with the DBM, and the output layer is used for classifying expressions by Softrnax.
The implementation mode of the decision layer fusion strategy is as follows: the output of the video image channel and the voice channel are classified on the belonging emotion types, and fusion is carried out according to a weight criterion formula to output a bimodal recognition result.
The weight criterion formula is shown as (14):
Figure BSA0000244521510000063
where E is the category of emotion, PpProbability of classification for video image channel, PvFor the probability of classification on the voice channel, α and β are weights on the two channels, respectively; the specific weights may be adjusted according to the training data set and the model.
Table 1 below is a comparison table of the accuracy of the modal recognition using video images, the accuracy of the modal recognition using speech, and the accuracy of the bimodal recognition provided in this embodiment, where the data set selected for training and testing is the chaadd 2.0 (chinese natural emotion video data set).
Table 1:
associated modality Neural network model Weight selection Rate of identification accuracy
Video image LBPH+SAE+CNN 0.6 72.3%
Speech sound DBM+LSTM 0.4 62.8%
Bimodal Fusion model 74.9%
As can be appreciated from table 1, recognition accuracy using the dual modality is higher than the video image modality and the voice modality.
Fig. 1 is a schematic diagram of functional modules of the present invention, which specifically includes: the camera collects video image data, the microphone collects voice data, and the collected data are input into the emotion recognition unit; the emotion recognition unit comprises a video image modal model, a language modal model and a decision layer fusion method; and obtaining a classification result of each mode after the data is subjected to model processing, fusing the classification results in a decision layer, and outputting a final recognition result.
FIG. 2 is a flow chart of the real-time emotion recognition of the present invention, specifically: after an application program is started, a camera and a microphone collect data and input the data into an emotion recognition unit; the video image modal model analyzes the video data to obtain emotion classification, and the voice modal analyzes the voice data to obtain emotion classification; and fusing the recognition results of the two modes in a decision layer to output a bimodal recognition result.
The aspects of the present invention may be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable data processing apparatus to implement some or all of the functions of aspects described herein.
While the present invention has been described with reference to the above embodiments, the specific implementation of the present invention is not limited to the above embodiments, and any person skilled in the art can easily think of the changes and substitutions within the scope of the calculation disclosed in the present application, and the ways of changing the data set, the number of emotion types, the weight parameters, etc. are all covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (7)

1. A bimodal fusion emotion recognition method based on video images and voice is characterized in that: inputting the image training data set into an improved convolutional neural network model for training to obtain a video image modal model; inputting the voice training data set into an improved long-short term memory neural network model for training to obtain a voice mode model; the camera acquires real-time video image information and sends the real-time video image information to the emotion recognition unit; collecting real-time voice information by a microphone and sending the real-time voice information to an emotion recognition unit; the emotion recognition unit comprises a video image modality and a voice modality, and obtains a video emotion recognition result and a voice emotion recognition result respectively, and the recognition process is carried out on a trained neural network model; and performing decision layer fusion on the recognition result to obtain a bimodal recognition result as a final emotion recognition result.
2. The improved convolutional neural network model training method of claim 1, wherein: after converting the training set image into a gray level image, extracting the face features of the image by using a local binary statistical histogram method, and acquiring the emotion detail features of the face image by using a sparse automatic encoder; and after the two features are fused, inputting the fused features into an improved convolutional neural network and training the fused features to obtain a video image modal model.
3. The improved long-short term memory neural network model training method of claim 1, wherein: preprocessing and framing the speech data of the training set, and extracting four characteristics of the data, wherein the four characteristics are respectively as follows: prosodic features, mel cepstrum coefficients, nonlinear attribute features and nonlinear geometric features; and after the four features are processed by using a depth limited Boltzmann machine, inputting the obtained depth features into an improved long-short term memory network and training to obtain a voice modal model.
4. The improved convolutional neural network implementation of claim 2, wherein: the two characteristics are fused and then input into a network; the depth of the network is increased, and in order to eliminate time consumption increase and overfitting caused by depth increase, global mean pooling is used for replacing a full-connection layer; and the output layer uses Softmax to classify expressions and output the possibility of various classifications to obtain the identification result of the video image modal model.
5. The method of claim 3, wherein the method comprises: the network is optimized by using a variable weight back propagation algorithm, so that the nonlinear mapping capability of the network is enhanced; deepening a network structure, and replacing a full connection layer with global mean pooling; and the output layer classifies the speech emotion by using Softmax and outputs the possibility of the speech emotion on various classifications to obtain a speech modal model recognition result.
6. The real-time emotion recognition method of claim 1, wherein: video image information is collected by a camera, each frame of image is input to an emotion recognition unit, and a recognition result is processed and output by a trained video image modal model; and voice information is collected by a microphone, each section of voice information is input into an emotion recognition unit, and a recognition result is processed and output by the trained voice modal model.
7. The method of designing an emotion recognition unit according to claim 1, wherein: the emotion recognition unit comprises a video image mode, a voice mode and a bimodal decision layer fusion method, wherein the two modes consist of trained neural network models, the output is the probability on different emotion types, and the maximum probability is the emotion type; the decision layer fusion method is that outputs of different modes are weighted and added, and the sum of a weight value adopted by a video image mode and a weight value adopted by a voice mode is 1.
CN202110650544.2A 2021-06-10 2021-06-10 Bimodal fusion emotion recognition method based on video image and voice Pending CN113343860A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110650544.2A CN113343860A (en) 2021-06-10 2021-06-10 Bimodal fusion emotion recognition method based on video image and voice

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110650544.2A CN113343860A (en) 2021-06-10 2021-06-10 Bimodal fusion emotion recognition method based on video image and voice

Publications (1)

Publication Number Publication Date
CN113343860A true CN113343860A (en) 2021-09-03

Family

ID=77476688

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110650544.2A Pending CN113343860A (en) 2021-06-10 2021-06-10 Bimodal fusion emotion recognition method based on video image and voice

Country Status (1)

Country Link
CN (1) CN113343860A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN114533063A (en) * 2022-02-23 2022-05-27 金华高等研究院(金华理工学院筹建工作领导小组办公室) Multi-source monitoring combined emotion calculation system and method
CN114973490A (en) * 2022-05-26 2022-08-30 南京大学 Monitoring and early warning system based on face recognition
CN115455129A (en) * 2022-10-14 2022-12-09 阿里巴巴(中国)有限公司 POI processing method and device, electronic equipment and storage medium
CN117708375A (en) * 2024-02-05 2024-03-15 北京搜狐新媒体信息技术有限公司 Video processing method and device and related products

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190311188A1 (en) * 2018-12-05 2019-10-10 Sichuan University Face emotion recognition method based on dual-stream convolutional neural network
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression
CN110556129A (en) * 2019-09-09 2019-12-10 北京大学深圳研究生院 Bimodal emotion recognition model training method and bimodal emotion recognition method
CN110826466A (en) * 2019-10-31 2020-02-21 南京励智心理大数据产业研究院有限公司 Emotion identification method, device and storage medium based on LSTM audio-video fusion
CN111242155A (en) * 2019-10-08 2020-06-05 台州学院 Bimodal emotion recognition method based on multimode deep learning
WO2020173133A1 (en) * 2019-02-27 2020-09-03 平安科技(深圳)有限公司 Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
WO2020248376A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Emotion detection method and apparatus, electronic device, and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190311188A1 (en) * 2018-12-05 2019-10-10 Sichuan University Face emotion recognition method based on dual-stream convolutional neural network
WO2020173133A1 (en) * 2019-02-27 2020-09-03 平安科技(深圳)有限公司 Training method of emotion recognition model, emotion recognition method, device, apparatus, and storage medium
WO2020248376A1 (en) * 2019-06-14 2020-12-17 平安科技(深圳)有限公司 Emotion detection method and apparatus, electronic device, and storage medium
CN110516696A (en) * 2019-07-12 2019-11-29 东南大学 It is a kind of that emotion identification method is merged based on the adaptive weighting bimodal of voice and expression
CN110556129A (en) * 2019-09-09 2019-12-10 北京大学深圳研究生院 Bimodal emotion recognition model training method and bimodal emotion recognition method
CN111242155A (en) * 2019-10-08 2020-06-05 台州学院 Bimodal emotion recognition method based on multimode deep learning
CN110826466A (en) * 2019-10-31 2020-02-21 南京励智心理大数据产业研究院有限公司 Emotion identification method, device and storage medium based on LSTM audio-video fusion

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN113807249B (en) * 2021-09-17 2024-01-12 广州大学 Emotion recognition method, system, device and medium based on multi-mode feature fusion
CN114533063A (en) * 2022-02-23 2022-05-27 金华高等研究院(金华理工学院筹建工作领导小组办公室) Multi-source monitoring combined emotion calculation system and method
CN114533063B (en) * 2022-02-23 2023-10-27 金华高等研究院(金华理工学院筹建工作领导小组办公室) Multi-source monitoring combined emotion computing system and method
CN114973490A (en) * 2022-05-26 2022-08-30 南京大学 Monitoring and early warning system based on face recognition
CN115455129A (en) * 2022-10-14 2022-12-09 阿里巴巴(中国)有限公司 POI processing method and device, electronic equipment and storage medium
CN115455129B (en) * 2022-10-14 2023-08-25 阿里巴巴(中国)有限公司 POI processing method, POI processing device, electronic equipment and storage medium
CN117708375A (en) * 2024-02-05 2024-03-15 北京搜狐新媒体信息技术有限公司 Video processing method and device and related products
CN117708375B (en) * 2024-02-05 2024-05-28 北京搜狐新媒体信息技术有限公司 Video processing method and device and related products

Similar Documents

Publication Publication Date Title
CN112784798B (en) Multi-modal emotion recognition method based on feature-time attention mechanism
CN110188343B (en) Multi-mode emotion recognition method based on fusion attention network
CN113343860A (en) Bimodal fusion emotion recognition method based on video image and voice
CN106682616B (en) Method for recognizing neonatal pain expression based on two-channel feature deep learning
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN108960337B (en) Multi-modal complex activity recognition method based on deep learning model
CN112784763B (en) Expression recognition method and system based on local and overall feature adaptive fusion
CN108597541A (en) A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
CN111292765B (en) Bimodal emotion recognition method integrating multiple deep learning models
CN110575663B (en) Physical education auxiliary training method based on artificial intelligence
CN111128242B (en) Multi-mode emotion information fusion and identification method based on double-depth network
CN113158727A (en) Bimodal fusion emotion recognition method based on video and voice information
CN112818861A (en) Emotion classification method and system based on multi-mode context semantic features
US20220328065A1 (en) Speech emotion recognition method and system based on fused population information
CN112101096B (en) Multi-mode fusion suicide emotion perception method based on voice and micro-expression
CN111242155A (en) Bimodal emotion recognition method based on multimode deep learning
CN112380924B (en) Depression tendency detection method based on facial micro expression dynamic recognition
CN111709284B (en) Dance emotion recognition method based on CNN-LSTM
CN112989920A (en) Electroencephalogram emotion classification system based on frame-level feature distillation neural network
Rwelli et al. Gesture based Arabic sign language recognition for impaired people based on convolution neural network
CN114724224A (en) Multi-mode emotion recognition method for medical care robot
CN111967361A (en) Emotion detection method based on baby expression recognition and crying
CN115731595A (en) Fuzzy rule-based multi-level decision fusion emotion recognition method
CN114209319B (en) fNIRS emotion recognition method and system based on graph network and self-adaptive denoising
CN112529054B (en) Multi-dimensional convolution neural network learner modeling method for multi-source heterogeneous data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
DD01 Delivery of document by public notice

Addressee: Wang Chuanyu

Document name: Deemed withdrawal notice

DD01 Delivery of document by public notice