CN113158727A - Bimodal fusion emotion recognition method based on video and voice information - Google Patents

Bimodal fusion emotion recognition method based on video and voice information Download PDF

Info

Publication number
CN113158727A
CN113158727A CN202011613947.1A CN202011613947A CN113158727A CN 113158727 A CN113158727 A CN 113158727A CN 202011613947 A CN202011613947 A CN 202011613947A CN 113158727 A CN113158727 A CN 113158727A
Authority
CN
China
Prior art keywords
voice
emotion
features
information
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011613947.1A
Other languages
Chinese (zh)
Inventor
臧景峰
史玉欢
王鑫磊
刘瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changchun University of Science and Technology
Original Assignee
Changchun University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changchun University of Science and Technology filed Critical Changchun University of Science and Technology
Priority to CN202011613947.1A priority Critical patent/CN113158727A/en
Publication of CN113158727A publication Critical patent/CN113158727A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • G06V40/176Dynamic expression
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • Child & Adolescent Psychology (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Signal Processing (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a bimodal fusion emotion recognition method based on video information and voice information. Extracting face images and voice features from the video and voice information, carrying out normalization processing on face image feature vectors and voice feature vectors, transmitting the processed features into a Bi-GRU network for training, and then using input features in two single-mode sub-networks for calculating the weight of state information at each moment. And fusing the input features of the two single-mode sub-networks to obtain a multi-mode combined feature vector, and taking the combined feature vector as the input of a pre-trained deep neural network, wherein the deep neural network contains an emotion classifier, and different types of emotion evaluation information are obtained through the emotion classifier, so that the emotion evaluation information of the user is more objective and has reference value, and a more accurate emotion recognition result is obtained.

Description

Bimodal fusion emotion recognition method based on video and voice information
Technical Field
The application relates to the field of emotion recognition, in particular to a bimodal fusion emotion recognition method based on video and voice information.
Background
In general, the way humans naturally communicate and express emotions is multimodal. This means that we can express emotions orally or visually. When more emotions are expressed in tones, the audio data may contain a main clue for emotion recognition; when more face images are used to express emotions, it is considered that most of the clues required to mine emotion are present in face images using multimodal information such as human facial expressions, speech intonation, and language content, which is a fun and challenging problem.
The emotion calculation under the traditional mode research direction is mainly researched by a single mode. Such as speech emotion aspect, video motion and face image. These conventional single-modal emotion recognition calculations have been in the process of research in practical applications, although significant efforts have been made in the respective fields. However, since the emotional expression forms of people have complex and diverse characteristics, if a single expression form is considered, the emotion of people is judged, and the final result is one-sided and unobtrusive, so that a lot of valuable emotional information is lost.
With the deep development of artificial intelligence technology in the information era, most people pay more attention to the research on the aspect of emotion calculation, but the emotion of human bodies is complex and variable, the accuracy rate of judging emotion characteristics by independently measuring certain information is low, and the invention is provided for improving the accuracy rate.
Disclosure of Invention
The invention aims to provide a bimodal fusion emotion recognition method based on video and voice information, which fully utilizes bimodal fusion to obtain more emotion information and can judge the emotion state according to user voice and facial expression information.
The technical solution for realizing the purpose of the invention is as follows: a bimodal fusion emotion recognition method based on video information and voice information comprises the following steps:
the method comprises the steps of firstly, acquiring video information and voice information through a camera and a microphone of external equipment, carrying out feature extraction on the video information and the voice information, and respectively extracting facial image features and voice features.
Step two, after the video information and the voice information are obtained, the method further comprises the following steps: and preprocessing the video information and the voice information.
Step three, the video information preprocessing of the initial information specifically comprises the following steps:
1) acquiring a video file to be processed, and analyzing the video file to obtain a video frame;
2) generating a histogram corresponding to the video frame based on the pixel information of the video frame, determining the definition of the video frame, and clustering the video frame according to the histogram and an edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;
3) and carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image.
Step four, the voice information preprocessing of the initial information specifically comprises the following steps:
1) pre-emphasizing the collected voice information to flatten the frequency spectrum of the signal;
2) performing frame windowing on the collected voice signals to obtain voice analysis frames;
3) and carrying out short-time Fourier transform on the voice analysis frame to obtain a voice spectrogram.
Step five, the extraction of the facial image features specifically comprises the following steps:
1) preparing a pre-trained deep convolutional neural network model packet, wherein the model comprises a pooling layer, a full-link layer, a dropout layer in front of the full-link layer and a softmax layer behind the full-link layer;
2) and inputting the face image into a pre-trained image feature model for processing, wherein the feature vector output by a full connection layer in the model is the face image feature vector.
Step six, the extraction of the voice features specifically comprises the following steps:
1) pre-emphasizing the voice digital signal by using a first-order high-pass FIR digital filter, and performing frame processing on the pre-emphasized voice data by using a short-time analysis technology to obtain a voice characteristic parameter time sequence;
2) windowing the voice characteristic parameter time sequence by using a Hamming window function to obtain voice windowing data, and performing endpoint detection on the voice windowing data by using a double-threshold comparison method to obtain preprocessed voice data;
3) carrying out short-time Fourier transform on the preprocessed voice data to obtain a voice spectrogram;
4) and acquiring the voice spectrogram, and extracting voice characteristic data by utilizing a preprocessed AlexNet network to obtain the voice characteristic data.
5) And acquiring the feature data, performing Correlation Feature Selection (CFS) on the feature data, and filtering out irrelevant features with small correlation with the category label to obtain the final voice feature.
And seventhly, performing normalization processing on the extracted face image characteristics and the extracted voice characteristics.
And step eight, respectively transmitting the normalized feature vectors into a Bi-GRU network for training, extracting features through a maximum pooling layer and an average pooling layer of the network, and further calculating correlation among multi-modal state information and attention distribution of each mode at each moment.
Step nine, the step of calculating the correlation among the multi-modal state information and the attention distribution of each modality at each moment specifically comprises the following steps:
1) since the state information between the modalities is taken into account, the weights will focus on the state information of both modalities simultaneously, i.e. the correlation of the state information between the modalities and the state information of each instant of the two single-modality sub-networks. Correlation siThe calculation is as follows:
Figure BDA0002875659330000031
wherein
Figure BDA0002875659330000041
Is looked atStatus information, w, of Bi-GRU network outputs in frequency-mode subnetworksvIs that
Figure BDA0002875659330000042
The correlation weight of (a);
Figure BDA0002875659330000043
is the state information of the Bi-GRU network output in the speech modality subnetwork, waIs that
Figure BDA0002875659330000044
The correlation weight of (a); b1Is the correlation deviation, b2Is the fusion bias, V is the weight of the multimodal fusion, and tanh is the activation function.
2) From multimodal status information siCan calculate the attention distribution at each moment in the multi-modality, i.e., the weight a corresponding to the status information1The calculation is as follows:
Figure BDA0002875659330000045
where softmax is a normalized exponential function.
Tenth, performing feature fusion on the extracted facial expression features and the extracted voice features to obtain a combined feature vector, and including the following steps:
1) performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
2) and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a combined feature vector of two channels.
Step eleven, inputting the combined feature vector to a pre-trained deep convolutional neural network, wherein the deep convolutional neural network comprises an emotion classifier, acquires emotion evaluation information and judges the emotion of the user.
Step twelve, the training of the pre-trained deep convolutional network comprises the following steps:
1) acquiring a face image open source emotion data set and a voice open source emotion data set, and acquiring face image emotion sample data and voice emotion sample data from the face image emotion data set and the voice emotion data set;
2) and enhancing the face emotion sample data, extracting face image feature data and performing feature selection on the feature data to obtain face image feature data. Performing short-time Fourier transform on the voice emotion sample data to obtain a voice spectrogram, extracting voice characteristic data by using an AlexNet network, and performing characteristic selection on the characteristic data to obtain voice characteristic data;
3) respectively carrying out feature fusion on feature vectors with the same emotion label in the face image feature data and the voice feature data to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
4) the joint feature vector data set is trained using a temporal recurrent neural network.
Has the advantages that: compared with the prior art, the technical scheme of the invention has the following beneficial technical effects:
the invention realizes the bimodal fusion of video information and voice information, and is an innovation of emotion recognition. Emotion recognition based on video and voice information is more persuasive than a single information recognition result;
in addition, feature fusion is selected in the aspect of a bimodal fusion mode, complementary information among different modalities and mutual influence among the complementary information are effectively fused, and the obtained combined feature result can more comprehensively display the emotional state of a user;
in addition, an idea is provided for other dual-mode fusion or multi-mode fusion, and various functions can be continuously improved subsequently, so that the traditional single-mode emotion recognition technology can achieve the purpose of upgrading and updating.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic flow chart of a bimodal fusion emotion recognition method based on video and voice information provided by the present invention;
FIG. 2 is a schematic view of a video information processing flow provided by the present invention;
fig. 3 is a schematic diagram of a processing flow of voice information provided by the present invention.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system structures, techniques, etc. in order to provide a thorough understanding of the embodiments of the invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.
The invention is further illustrated by the following examples and figures of the specification.
Example 1
A bimodal fusion emotion recognition method based on video and voice information, as shown in FIG. 1, includes the following steps:
1) collecting voice signals and face images: carrying out non-contact acquisition on natural voice and a human face image by using a microphone and a camera;
the camera refers to a CMOS digital camera, and outputs an electric signal to be directly amplified and converted into a digital signal;
the microphone is a digital MEMS microphone and outputs 1/2-period pulse density modulation digital signals;
2) signal preprocessing: respectively preprocessing signals of two modes, namely video signals and voice signals, so that the signals meet the input requirements of corresponding models of different modes;
3) and (3) emotion feature extraction: respectively extracting the features of the face image signal and the voice signal preprocessed in the step 2) to obtain corresponding feature vectors;
4) training feature vectors and calculating relevance and attention distributions: normalizing the feature vectors obtained in the step 3), then respectively transmitting the normalized feature vectors into a Bi-GRU network for training, and further extracting features through a maximum pooling layer and an average pooling layer of the network to calculate the correlation among multi-modal state information and the attention distribution of each modal at each moment;
5) and (3) emotional characteristic fusion: performing feature fusion on the facial image and the voice feature vectors extracted in the step 3) by adopting a corresponding method;
6) judging the emotion: inputting the fusion characteristics of the step 5) into a pre-trained deep convolutional neural network, wherein the deep convolutional neural network comprises an emotion classifier, acquiring emotion evaluation information through the emotion classifier, and judging emotion according to the emotion evaluation information of the user.
Example 2
The video information processing flow, as shown in fig. 2, includes the following steps:
1) acquiring a video file to be processed; analyzing the video file to obtain a video frame; filtering the video frame based on the pixel information of the video frame, and taking the video frame obtained after filtering as the image of the face emotion to be recognized;
2) generating a histogram corresponding to the video frame and determining the definition of the video frame based on the pixel information of the video frame; clustering the video frames according to the histogram and the edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;
3) based on the filtered video frame, carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image;
4) based on the face image, inputting the face image into an image feature extraction model obtained by pre-training for processing, and determining a feature vector output by a full connection layer in the image feature extraction model as the image feature vector;
5) and normalizing the acquired human face image features, then transmitting the normalized human face image features into a Bi-GRU network for training, and further extracting the features through a maximum pooling layer and an average pooling layer of the network.
Example 3
The voice information processing flow, as shown in fig. 3, includes the following steps:
1) acquiring a human body voice signal by using a digital MEMS (micro electro mechanical system) microphone, pre-emphasizing the human body voice signal by using a first-order high-pass FIR (finite impulse response) digital filter, and outputting pre-emphasized voice data;
2) performing frame processing on the pre-emphasized voice data by using a short-time analysis technology to obtain a voice characteristic parameter time sequence;
3) windowing the voice characteristic parameter time sequence by using a Hamming window function to obtain voice windowing data
4) Carrying out endpoint detection on the voice windowing data by using a double-threshold comparison method to obtain preprocessed voice data;
5) carrying out short-time Fourier transform on the preprocessed voice data to obtain a voice spectrogram;
6) inputting the spectrogram into a preprocessed AlexNet network, and taking out voice characteristic data from a convolutional layer (Conv 4);
7) and performing feature selection on the feature data to obtain final voice features.
8) And normalizing the acquired voice features, transmitting the normalized voice features into a Bi-GRU network for training, and further extracting the features through a maximum pooling layer and an average pooling layer of the network.

Claims (9)

1. A bimodal fusion emotion recognition method based on video and voice information is characterized by comprising the following steps:
step 1: acquiring face information and voice information of a user with emotion to be recognized through a camera and a microphone of external equipment, inputting the face information and the voice information into a pre-trained feature extraction network, and respectively extracting face image features and voice features;
step 2: and normalizing the extracted human face image features and the extracted voice features, then transmitting the normalized human face image features and the extracted voice features into a Bi-GRU network for training, and calculating correlation and attention distribution of each mode at each moment through input features in two single-mode sub-networks.
And step 3: and carrying out feature fusion on the extracted face image features and the extracted voice features to obtain a combined feature vector. The combined feature vector is obtained by fusing the human face image features and the voice features with the same emotion labels and performing dimensionality reduction and normalization processing;
and 4, step 4: and inputting the fusion characteristics into a pre-trained deep neural network, wherein the deep neural network comprises an emotion classifier and is used for acquiring emotion evaluation information of different types and finally evaluating the emotion of the user.
2. The method according to claim 1, wherein the video information is face image information.
3. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said obtaining facial image information and extracting facial image features comprises the steps of:
step 1: acquiring a video file to be processed; analyzing the video file to obtain a video frame; filtering the video frame based on the pixel information of the video frame, and taking the video frame obtained after filtering as the image of the face emotion to be recognized;
step 2: generating a histogram corresponding to the video frame and determining the definition of the video frame based on the pixel information of the video frame; clustering the video frames according to the histogram and the edge detection operator to obtain at least one class; filtering repeated video frames in each class and video frames with the definition smaller than a definition threshold value;
and step 3: based on the filtered video frame, carrying out face detection, alignment, rotation and size adjustment on the video frame by adopting a method based on a convolutional neural network to obtain a face image;
and 4, step 4: based on the face image, the face image is input into an image feature extraction model obtained through pre-training and processed, the feature vector output by a full connection layer in the image feature extraction model is determined to be the image feature vector, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
4. The method according to claim 1, wherein the obtaining of the speech information and the extraction of the speech features are performed by a pre-processed AlexNet network.
5. The extracting of speech features according to claim 4 comprises the steps of:
step 1: the method comprises the steps of acquiring an original voice signal of a human body by using a microphone, and preprocessing the voice signal to obtain a spectrogram.
Step 2: inputting the spectrogram into a pre-trained AlexNet network, passing through a first input layer, a second convolution layer, a second pooling layer and a third convolution layer, taking out the obtained voice characteristics from a fourth convolution layer (Conv4), wherein ReLu is used as an activation function at the output end of each convolution layer.
6. The method for extracting the acquired voice features according to claim 5 is implemented as follows,
similarity between features is measured using Correlation Feature Selection (CFS). Irrelevant features that are less relevant to the category label are discarded. The evaluation criteria are as follows:
Figure FDA0002875659320000021
rcfiis the feature classification relevance, k is the number of features,
Figure FDA0002875659320000022
representing the correlation between features.
7. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said normalizing said facial image features and voice features, and then transmitting them into Bi-GRU network for training, and calculating correlation and attention distribution comprises the steps of:
step 1: finding the maximum values of the facial image features and the voice features, dividing all feature vectors by the maximum values in the corresponding modes respectively, and converging to 0-1, so that the network training and convergence speed is improved;
step 2: the Bi-GRU network combines the model architectures of the GRU and the BRNN network. And respectively transmitting the normalized feature vectors into the network for training, extracting features through a maximum pooling layer and an average pooling layer of the network, and further calculating the correlation among multi-modal state information and the attention distribution of each modal at each moment.
8. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein said feature fusion of said facial image features and said voice features to obtain a joint feature vector comprises the steps of:
step 1: performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
step 2: and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a combined feature vector of two channels.
9. The bimodal fusion emotion recognition method based on video and voice information, as claimed in claim 1, wherein the training of the pre-trained deep convolutional network comprises the following steps:
step 1: acquiring a face image open source emotion data set and a voice open source emotion data set, and acquiring face image emotion sample data and voice emotion sample data from the face image emotion data set and the voice emotion data set;
step 2: and enhancing the face emotion sample data, extracting face image feature data and performing feature selection on the feature data to obtain face image feature data. Performing short-time Fourier transform on the voice emotion sample data to obtain a voice spectrogram, extracting voice characteristic data by using an AlexNet network, and performing characteristic selection on the characteristic data to obtain voice characteristic data;
and step 3: respectively carrying out feature fusion on feature vectors with the same emotion label in the face image feature data and the voice feature data to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and 4, step 4: the joint feature vector data set is trained using a temporal recurrent neural network.
CN202011613947.1A 2020-12-31 2020-12-31 Bimodal fusion emotion recognition method based on video and voice information Pending CN113158727A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011613947.1A CN113158727A (en) 2020-12-31 2020-12-31 Bimodal fusion emotion recognition method based on video and voice information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011613947.1A CN113158727A (en) 2020-12-31 2020-12-31 Bimodal fusion emotion recognition method based on video and voice information

Publications (1)

Publication Number Publication Date
CN113158727A true CN113158727A (en) 2021-07-23

Family

ID=76878273

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011613947.1A Pending CN113158727A (en) 2020-12-31 2020-12-31 Bimodal fusion emotion recognition method based on video and voice information

Country Status (1)

Country Link
CN (1) CN113158727A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113742599A (en) * 2021-11-05 2021-12-03 太平金融科技服务(上海)有限公司深圳分公司 Content recommendation method, device, equipment and computer readable storage medium
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN113920568A (en) * 2021-11-02 2022-01-11 中电万维信息技术有限责任公司 Face and human body posture emotion recognition method based on video image
CN114022192A (en) * 2021-10-20 2022-02-08 百融云创科技股份有限公司 Data modeling method and system based on intelligent marketing scene
CN114973490A (en) * 2022-05-26 2022-08-30 南京大学 Monitoring and early warning system based on face recognition
CN115100329A (en) * 2022-06-27 2022-09-23 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN115424108A (en) * 2022-11-08 2022-12-02 四川大学 Cognitive dysfunction evaluation method based on audio-visual fusion perception
CN115496226A (en) * 2022-09-29 2022-12-20 中国电信股份有限公司 Multi-modal emotion analysis method, device, equipment and storage based on gradient adjustment
CN117152668A (en) * 2023-10-30 2023-12-01 成都方顷科技有限公司 Intelligent logistics implementation method, device and equipment based on Internet of things
CN117312992A (en) * 2023-11-30 2023-12-29 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Emotion recognition method and system for fusion of multi-view face features and audio features
CN117349792A (en) * 2023-10-25 2024-01-05 中国人民解放军空军军医大学 Emotion recognition method based on facial features and voice features
CN118055300A (en) * 2024-04-10 2024-05-17 深圳云天畅想信息科技有限公司 Cloud video generation method and device based on large model and computer equipment
CN117349792B (en) * 2023-10-25 2024-06-07 中国人民解放军空军军医大学 Emotion recognition method based on facial features and voice features

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188343A (en) * 2019-04-22 2019-08-30 浙江工业大学 Multi-modal emotion identification method based on fusion attention network
CN111242155A (en) * 2019-10-08 2020-06-05 台州学院 Bimodal emotion recognition method based on multimode deep learning
CN111339913A (en) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 Method and device for recognizing emotion of character in video
CN111563422A (en) * 2020-04-17 2020-08-21 五邑大学 Service evaluation obtaining method and device based on bimodal emotion recognition network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188343A (en) * 2019-04-22 2019-08-30 浙江工业大学 Multi-modal emotion identification method based on fusion attention network
CN111242155A (en) * 2019-10-08 2020-06-05 台州学院 Bimodal emotion recognition method based on multimode deep learning
CN111339913A (en) * 2020-02-24 2020-06-26 湖南快乐阳光互动娱乐传媒有限公司 Method and device for recognizing emotion of character in video
CN111563422A (en) * 2020-04-17 2020-08-21 五邑大学 Service evaluation obtaining method and device based on bimodal emotion recognition network

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807249B (en) * 2021-09-17 2024-01-12 广州大学 Emotion recognition method, system, device and medium based on multi-mode feature fusion
CN113807249A (en) * 2021-09-17 2021-12-17 广州大学 Multi-mode feature fusion based emotion recognition method, system, device and medium
CN114022192A (en) * 2021-10-20 2022-02-08 百融云创科技股份有限公司 Data modeling method and system based on intelligent marketing scene
CN113920568A (en) * 2021-11-02 2022-01-11 中电万维信息技术有限责任公司 Face and human body posture emotion recognition method based on video image
CN113742599A (en) * 2021-11-05 2021-12-03 太平金融科技服务(上海)有限公司深圳分公司 Content recommendation method, device, equipment and computer readable storage medium
CN114973490A (en) * 2022-05-26 2022-08-30 南京大学 Monitoring and early warning system based on face recognition
CN115100329A (en) * 2022-06-27 2022-09-23 太原理工大学 Multi-mode driving-based emotion controllable facial animation generation method
CN115496226A (en) * 2022-09-29 2022-12-20 中国电信股份有限公司 Multi-modal emotion analysis method, device, equipment and storage based on gradient adjustment
CN115424108A (en) * 2022-11-08 2022-12-02 四川大学 Cognitive dysfunction evaluation method based on audio-visual fusion perception
CN115424108B (en) * 2022-11-08 2023-03-28 四川大学 Cognitive dysfunction evaluation method based on audio-visual fusion perception
CN117349792A (en) * 2023-10-25 2024-01-05 中国人民解放军空军军医大学 Emotion recognition method based on facial features and voice features
CN117349792B (en) * 2023-10-25 2024-06-07 中国人民解放军空军军医大学 Emotion recognition method based on facial features and voice features
CN117152668A (en) * 2023-10-30 2023-12-01 成都方顷科技有限公司 Intelligent logistics implementation method, device and equipment based on Internet of things
CN117152668B (en) * 2023-10-30 2024-02-06 成都方顷科技有限公司 Intelligent logistics implementation method, device and equipment based on Internet of things
CN117312992A (en) * 2023-11-30 2023-12-29 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Emotion recognition method and system for fusion of multi-view face features and audio features
CN117312992B (en) * 2023-11-30 2024-03-12 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Emotion recognition method and system for fusion of multi-view face features and audio features
CN118055300A (en) * 2024-04-10 2024-05-17 深圳云天畅想信息科技有限公司 Cloud video generation method and device based on large model and computer equipment

Similar Documents

Publication Publication Date Title
CN113158727A (en) Bimodal fusion emotion recognition method based on video and voice information
CN105976809B (en) Identification method and system based on speech and facial expression bimodal emotion fusion
CN108899050B (en) Voice signal analysis subsystem based on multi-modal emotion recognition system
CN108805087B (en) Time sequence semantic fusion association judgment subsystem based on multi-modal emotion recognition system
CN108805089B (en) Multi-modal-based emotion recognition method
CN108877801B (en) Multi-turn dialogue semantic understanding subsystem based on multi-modal emotion recognition system
CN112581979B (en) Speech emotion recognition method based on spectrogram
CN103413113A (en) Intelligent emotional interaction method for service robot
CN112151030B (en) Multi-mode-based complex scene voice recognition method and device
CN111128242B (en) Multi-mode emotion information fusion and identification method based on double-depth network
KR101910089B1 (en) Method and system for extracting Video feature vector using multi-modal correlation
CN115169507A (en) Brain-like multi-mode emotion recognition network, recognition method and emotion robot
CN112101096A (en) Suicide emotion perception method based on multi-mode fusion of voice and micro-expression
Shinde et al. Real time two way communication approach for hearing impaired and dumb person based on image processing
CN111079465A (en) Emotional state comprehensive judgment method based on three-dimensional imaging analysis
Rwelli et al. Gesture based Arabic sign language recognition for impaired people based on convolution neural network
CN111967361A (en) Emotion detection method based on baby expression recognition and crying
CN112418166A (en) Emotion distribution learning method based on multi-mode information
CN113516990A (en) Voice enhancement method, method for training neural network and related equipment
Saha et al. Towards automatic speech identification from vocal tract shape dynamics in real-time MRI
Kuang et al. Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks
CN113571095B (en) Speech emotion recognition method and system based on nested deep neural network
Atkar et al. Speech Emotion Recognition using Dialogue Emotion Decoder and CNN Classifier
CN111785262B (en) Speaker age and gender classification method based on residual error network and fusion characteristics
Chinmayi et al. Emotion Classification Using Deep Learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination