CN111339913A - Method and device for recognizing emotion of character in video - Google Patents

Method and device for recognizing emotion of character in video Download PDF

Info

Publication number
CN111339913A
CN111339913A CN202010111614.2A CN202010111614A CN111339913A CN 111339913 A CN111339913 A CN 111339913A CN 202010111614 A CN202010111614 A CN 202010111614A CN 111339913 A CN111339913 A CN 111339913A
Authority
CN
China
Prior art keywords
feature vector
sound
text
feature
face image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010111614.2A
Other languages
Chinese (zh)
Inventor
杨杰
苏敏童
宋施恩
金义彬
卢海波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Happly Sunshine Interactive Entertainment Media Co Ltd
Original Assignee
Hunan Happly Sunshine Interactive Entertainment Media Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan Happly Sunshine Interactive Entertainment Media Co Ltd filed Critical Hunan Happly Sunshine Interactive Entertainment Media Co Ltd
Priority to CN202010111614.2A priority Critical patent/CN111339913A/en
Publication of CN111339913A publication Critical patent/CN111339913A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification

Abstract

The invention provides a method and a device for recognizing the emotion of a character in a video, which are used for extracting a face image in the video, a sound frequency spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image feature, improves the richness of the character emotion feature, and inputs the combined feature vector into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition processing, so that a more accurate character emotion recognition result can be obtained.

Description

Method and device for recognizing emotion of character in video
Technical Field
The invention relates to the technical field of video data analysis, in particular to a method and a device for recognizing emotion of a person in a video.
Background
Human emotion recognition is an important component of human-computer interaction and emotion calculation research. Common and important categories of emotion in video include happiness, anger, disgust, fear, sadness, surprise, etc. The emotion is an important component of video content, and the emotion expressed by the video segment can be analyzed by recognizing the emotion, so that the video application related to the emotion is derived.
Most of the existing emotion recognition technologies in videos focus on a mode of emotion recognition based on human face visual features, namely, emotion classification is performed according to the visual features of human face region images through human face detection positioning, analysis and recognition of the human face region images. The visual features of the face region images are the visual features which can reflect the emotion of the face most, but the face images in the video have various interference factors, such as face image blurring, poor illumination conditions, angle deviation and the like, so that the emotion recognition accuracy of the person based on the visual features is low.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for recognizing emotion of a person in a video, so as to improve the accuracy of recognizing emotion of the person.
In order to achieve the above purpose, the invention provides the following specific technical scheme:
a method for recognizing emotion of a person in a video comprises the following steps:
extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result, wherein the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
Optionally, the extracting the face image in the video, and the sound spectrogram and the subtitle text corresponding to the face image includes:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Optionally, the extracting an image feature vector from the face image includes:
the face image is input into an image feature extraction model obtained by pre-training and processed, feature vectors output by a full connection layer in the image feature extraction model are determined as the image feature vectors, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the extracting the acoustic feature vector from the acoustic spectrogram includes:
the method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the extracting a text feature vector from the subtitle text includes:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the performing feature fusion on the image feature vector, the sound feature vector, and the text feature vector to obtain a joint feature vector includes:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
Optionally, training the character emotion recognition model includes:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
Optionally, the character emotion recognition model comprises a plurality of sub-recognition models; the step of calling a character emotion recognition model obtained through pre-training and processing the combined feature vector to obtain a character emotion recognition result comprises the following steps:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
An emotion recognition apparatus for a person in a video, comprising:
the multi-modal data extraction unit is used for extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
the multi-modal feature extraction unit is used for extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram and extracting text feature vectors from the subtitle text;
the feature fusion unit is used for performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and the emotion recognition unit is used for calling a character emotion recognition model obtained by pre-training, processing the combined feature vector to obtain a character emotion recognition result, and the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
Optionally, the multi-modal data extraction unit is specifically configured to:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Optionally, the multi-modal feature extraction unit includes a face image feature extraction subunit, the face image feature extraction subunit is configured to perform processing on an image feature extraction model obtained by pre-training the face image input, determine feature vectors output by a full connection layer in the image feature extraction model as the image feature vectors, train a preset depth convolutional neural network model to obtain the image feature extraction model, and the preset depth convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer in front of the full connection layer, and a softmax layer behind the full connection layer.
Optionally, the multi-modal feature extraction unit includes a sound feature extraction subunit, the sound feature extraction subunit is configured to process the sound feature extraction model obtained by pre-training the input of the sound spectrogram, determine feature vectors output by a full connection layer in the sound feature extraction model as the sound feature vectors, train the preset deep convolutional neural network model to obtain, and the preset deep convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer before the full connection layer, and a softmax layer after the full connection layer.
Optionally, the multi-modal feature extraction unit includes a text feature extraction subunit, and the text feature extraction subunit is configured to:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the feature fusion unit is specifically configured to:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
Optionally, the apparatus further includes a recognition model training unit, specifically configured to:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
Optionally, the emotion recognition unit is specifically configured to:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
Compared with the prior art, the invention has the following beneficial effects:
the invention discloses a method for recognizing the emotion of a character in a video, which extracts a face image in the video, a sound frequency spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image features, so that the diversity and the richness of the character emotion features are improved, the combined feature vector is input into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition, and a more accurate character emotion recognition result can be obtained.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic flowchart of a method for recognizing emotion of a person in a video according to an embodiment of the present invention;
fig. 2 is a schematic flowchart of a method for extracting a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart of a training method of an emotion recognition model disclosed in an embodiment of the present invention;
FIG. 4 is a schematic diagram of multi-modal feature fusion as disclosed in an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a device for recognizing emotion of a person in a video according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention discloses a method for recognizing the emotion of a person in a video, which is used for recognizing a multi-modal combined feature vector comprising a face image, sound and a subtitle text extracted from the video and can obtain a more accurate emotion recognition result of the person.
Specifically, referring to fig. 1, the method for recognizing emotion of a person in a video disclosed in this embodiment includes the following steps:
s101: extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
referring to fig. 2, an alternative method for extracting a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image includes the following steps:
s201: splitting a video into a plurality of video frames, and recording the time of each video frame;
specifically, the video can be read by opencv, and the video is split into a plurality of video frames.
S202: sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
the face recognition model can be obtained by training a machine learning model by using a face image training sample, and can also be obtained by sequentially recognizing a plurality of video frames by using the existing face recognition model, such as a face classifier carried by an opencv.
Preferably, the video frame can be converted into a gray scale image to improve the speed of face recognition.
And intercepting the face area of the video frame containing the face image obtained by identification according to a format of 128 × 128, and obtaining the face image.
S203: intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
it can be understood that the video frames containing the face images in the video are a plurality of continuous video frames, each video frame corresponds to a time, the plurality of continuous video frames correspond to a time period, and then the sound segments in the time period in the video can be intercepted.
S204: carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and performing spectrum analysis on the sound segment, wherein the spectrum is quantized into 128 frequency bands, each 128 sampling points is a sampling group, the time length of each sampling segment is 0.02 seconds by 128 seconds to 2.56 seconds, and a 128-dimensional spectral response image, namely a sound spectrogram, is formed.
S205: and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
The pre-constructed caption detection model can be any one of the existing caption detection models.
S102: extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;
specifically, the face image is input into an image feature extraction model obtained through pre-training and processed, the feature vector output by a full connection layer in the image feature extraction model is determined to be the image feature vector, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
The method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, processing the words, wherein each word is represented by a K-dimensional vector, and if the words are N words, obtaining an N-X-K-dimensional vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
In the model for extracting the feature vector in the embodiment, a plurality of full-connected layers in the traditional convolutional neural network model are replaced by one full-connected layer, a softmax layer is directly added behind the one full-connected layer, and a mixed model of a residual error structure and an inclusion structure is combined, meanwhile, the input data is processed by Batch Normalization (Batch Normalization), the pooling layer uses a global average pooling method, the Dropout layer is added in front of the full-connection layer, the Dropout layer can effectively relieve the occurrence of overfitting, to the extent that regularization is achieved, the Dropout layer results in two neurons not necessarily appearing in one Dropout network at a time, therefore, the updating of the weight value does not depend on the common action of hidden nodes with fixed relations, the condition that certain characteristics are only effective under other specific characteristics is prevented, the network is forced to learn more robust characteristics, and the robustness of the model is increased.
Since the image feature vector, the sound feature vector and the text feature vector are 512-dimensional feature vectors output by the full connected layer of the corresponding model.
S103: performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a 512 x 3-dimensional feature vector.
Preferably, in order to reduce the data amount processed by the human emotion recognition model, a PCA method packaged in a sklern tool library may be used, for example, a parameter n _ components is set to 768, the feature vector after feature fusion is subjected to dimension reduction processing, so that a feature vector of 768 dimensions may be obtained, and the feature vector obtained after the dimension reduction processing is subjected to normalization processing, so that a three-channel combined feature vector is obtained.
The normalization process herein refers to maximum-minimum normalization, where maximum-minimum normalization is a linear transformation of raw data, min a and max a are respectively the minimum and maximum values of attribute a, and one raw value x is mapped to a value x' of interval [0,1] by maximum-minimum normalization, then the following formula:
Figure BDA0002390219200000101
s104: and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result.
The character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a human face image.
On the basis, calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector, wherein the method specifically comprises the following steps: respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models; and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
For example, the character emotion recognition model includes 3 sub-recognition models C1, C2, and C3, where the recognition result of the sub-recognition model C1 on the joint feature vector is emotion label L1, the recognition result of the sub-recognition model C2 on the joint feature vector is emotion label L1, the recognition result of the sub-recognition model C3 on the joint feature vector is emotion label L2, and the final character emotion recognition result output by the character emotion recognition model is emotion label L1.
Further, referring to fig. 3, the embodiment also discloses a method for training a character emotion recognition model, which specifically includes the following steps:
s301: acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
the open source emotion data set of the face image can be an Asian face database (KFDB) and comprises 1000 artificial 52000 multi-pose, multi-illumination and multi-expression face images, wherein the images with changed poses and illumination are acquired under the condition of strict control.
The open source emotion data set of the voice can be recorded by professional speakers by utilizing a CASIA Chinese emotion corpus, and comprises six emotions: anger, happiness, fear, sadness, surprise and neutrality, 9600 different pronunciation sound fragment data sets.
The open source emotion dataset of the text can be a chinese conversation emotion dataset, covering emotion words of 40000 multiple chinese instances.
S302: carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
image transformations used for data enhancement include, but are not limited to, scaling, rotating, flipping, warping, erasing, miscut, perspective, blurring, or combinations thereof, among others, and the amount of data is augmented by the enhancement process.
S303: carrying out spectrum analysis on sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
s304: extracting an image feature vector set from a face image data set, extracting a sound feature vector set from a sound spectrogram data set, and extracting a text feature vector set from an open source emotion data set of a text;
s305: respectively carrying out feature fusion on feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
referring to fig. 4, fig. 4 is a schematic diagram of multi-modal feature fusion including face images, sounds and text.
S306: and respectively training at least one machine learning model by utilizing the joint feature vector training data set to obtain a character emotion recognition model.
According to the embodiment, a large amount of labor labeling cost is saved by automatically generating the training data set, the method has flexible expansibility, and the transformation of the human face, including the transformation of zooming, masking, rotating, miscut and the like, can be conveniently increased.
Based on the method for recognizing the emotion of a person in a video disclosed in the above embodiments, this embodiment correspondingly discloses a device for recognizing the emotion of a person in a video, please refer to fig. 5, and the device includes:
the multi-modal data extraction unit 501 is configured to extract a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image;
a multi-modal feature extraction unit 502, configured to extract image feature vectors from the face image, extract sound feature vectors from the sound spectrogram, and extract text feature vectors from the subtitle text;
a feature fusion unit 503, configured to perform feature fusion on the image feature vector, the sound feature vector, and the text feature vector to obtain a joint feature vector;
and an emotion recognition unit 504, configured to invoke a pre-trained character emotion recognition model, and process the joint feature vector to obtain a character emotion recognition result, where the character emotion recognition model is obtained by training at least one machine learning model using a joint feature vector training data set including an image feature vector, a sound feature vector, and a text feature vector of a face image.
Optionally, the multi-modal data extraction unit is specifically configured to:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
Optionally, the multi-modal feature extraction unit includes a face image feature extraction subunit, the face image feature extraction subunit is configured to perform processing on an image feature extraction model obtained by pre-training the face image input, determine feature vectors output by a full connection layer in the image feature extraction model as the image feature vectors, train a preset depth convolutional neural network model to obtain the image feature extraction model, and the preset depth convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer in front of the full connection layer, and a softmax layer behind the full connection layer.
Optionally, the multi-modal feature extraction unit includes a sound feature extraction subunit, the sound feature extraction subunit is configured to process the sound feature extraction model obtained by pre-training the input of the sound spectrogram, determine feature vectors output by a full connection layer in the sound feature extraction model as the sound feature vectors, train the preset deep convolutional neural network model to obtain, and the preset deep convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer before the full connection layer, and a softmax layer after the full connection layer.
Optionally, the multi-modal feature extraction unit includes a text feature extraction subunit, and the text feature extraction subunit is configured to:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
Optionally, the feature fusion unit is specifically configured to:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
Optionally, the apparatus further includes a recognition model training unit, specifically configured to:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
Optionally, the emotion recognition unit is specifically configured to:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
The device for recognizing the emotion of a person in a video, disclosed by the embodiment, extracts a face image in the video and a sound spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image features, so that the diversity and the richness of the character emotion features are improved, the combined feature vector is input into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition, and a more accurate character emotion recognition result can be obtained.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for recognizing emotion of a person in a video, the method comprising:
extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result, wherein the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
2. The method of claim 1, wherein the extracting of the face image and the audio spectrogram and subtitle text corresponding to the face image in the video comprises:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
3. The method of claim 1, wherein the extracting image feature vectors from the face image comprises:
the face image is input into an image feature extraction model obtained by pre-training and processed, feature vectors output by a full connection layer in the image feature extraction model are determined as the image feature vectors, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
4. The method of claim 1, wherein the extracting of the acoustic feature vector from the acoustic spectrogram comprises:
the method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
5. The method of claim 1, wherein extracting text feature vectors from the subtitle text comprises:
performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;
calling a word2vec model, and processing the words to obtain a vector matrix;
the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.
6. The method according to claim 1, wherein the feature fusing the image feature vector, the sound feature vector and the text feature vector to obtain a joint feature vector comprises:
performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;
performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;
and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.
7. The method of claim 1, wherein training the character emotion recognition model comprises:
acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;
carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;
carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;
extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;
respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;
and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.
8. The method of claim 1, wherein the character emotion recognition model comprises a plurality of sub-recognition models; the step of calling a character emotion recognition model obtained through pre-training and processing the combined feature vector to obtain a character emotion recognition result comprises the following steps:
respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;
and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.
9. An apparatus for recognizing emotion of a person in a video, comprising:
the multi-modal data extraction unit is used for extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;
the multi-modal feature extraction unit is used for extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram and extracting text feature vectors from the subtitle text;
the feature fusion unit is used for performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;
and the emotion recognition unit is used for calling a character emotion recognition model obtained by pre-training, processing the combined feature vector to obtain a character emotion recognition result, and the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.
10. The apparatus according to claim 9, wherein the multimodal data extraction unit is specifically configured to:
splitting a video into a plurality of video frames, and recording the time of each video frame;
sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;
intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;
carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;
and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.
CN202010111614.2A 2020-02-24 2020-02-24 Method and device for recognizing emotion of character in video Pending CN111339913A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010111614.2A CN111339913A (en) 2020-02-24 2020-02-24 Method and device for recognizing emotion of character in video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010111614.2A CN111339913A (en) 2020-02-24 2020-02-24 Method and device for recognizing emotion of character in video

Publications (1)

Publication Number Publication Date
CN111339913A true CN111339913A (en) 2020-06-26

Family

ID=71185495

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010111614.2A Pending CN111339913A (en) 2020-02-24 2020-02-24 Method and device for recognizing emotion of character in video

Country Status (1)

Country Link
CN (1) CN111339913A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798849A (en) * 2020-07-06 2020-10-20 广东工业大学 Robot instruction identification method and device, electronic equipment and storage medium
CN112183334A (en) * 2020-09-28 2021-01-05 南京大学 Video depth relation analysis method based on multi-modal feature fusion
CN112233698A (en) * 2020-10-09 2021-01-15 中国平安人寿保险股份有限公司 Character emotion recognition method and device, terminal device and storage medium
CN112364829A (en) * 2020-11-30 2021-02-12 北京有竹居网络技术有限公司 Face recognition method, device, equipment and storage medium
CN112464958A (en) * 2020-12-11 2021-03-09 沈阳芯魂科技有限公司 Multi-modal neural network information processing method and device, electronic equipment and medium
CN112487937A (en) * 2020-11-26 2021-03-12 北京有竹居网络技术有限公司 Video identification method and device, storage medium and electronic equipment
CN112669876A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium
CN112699774A (en) * 2020-12-28 2021-04-23 深延科技(北京)有限公司 Method and device for recognizing emotion of person in video, computer equipment and medium
CN112861949A (en) * 2021-01-29 2021-05-28 成都视海芯图微电子有限公司 Face and voice-based emotion prediction method and system
CN113139525A (en) * 2021-05-21 2021-07-20 国家康复辅具研究中心 Multi-source information fusion-based emotion recognition method and man-machine interaction system
CN113158727A (en) * 2020-12-31 2021-07-23 长春理工大学 Bimodal fusion emotion recognition method based on video and voice information
CN113420733A (en) * 2021-08-23 2021-09-21 北京黑马企服科技有限公司 Efficient distributed big data acquisition implementation method and system
CN113536999A (en) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN113837072A (en) * 2021-09-24 2021-12-24 厦门大学 Method for sensing emotion of speaker by fusing multidimensional information
WO2024000867A1 (en) * 2022-06-30 2024-01-04 浪潮电子信息产业股份有限公司 Emotion recognition method and apparatus, device, and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010133661A1 (en) * 2009-05-20 2010-11-25 Tessera Technologies Ireland Limited Identifying facial expressions in acquired digital images
CN105740758A (en) * 2015-12-31 2016-07-06 上海极链网络科技有限公司 Internet video face recognition method based on deep learning
CN108255307A (en) * 2018-02-08 2018-07-06 竹间智能科技(上海)有限公司 Man-machine interaction method, system based on multi-modal mood and face's Attribute Recognition
CN108764010A (en) * 2018-03-23 2018-11-06 姜涵予 Emotional state determines method and device
US20180330152A1 (en) * 2017-05-11 2018-11-15 Kodak Alaris Inc. Method for identifying, ordering, and presenting images according to expressions
CN109308466A (en) * 2018-09-18 2019-02-05 宁波众鑫网络科技股份有限公司 The method that a kind of pair of interactive language carries out Emotion identification
CN109614895A (en) * 2018-10-29 2019-04-12 山东大学 A method of the multi-modal emotion recognition based on attention Fusion Features
CN109902632A (en) * 2019-03-02 2019-06-18 西安电子科技大学 A kind of video analysis device and video analysis method towards old man's exception
CN110084266A (en) * 2019-03-11 2019-08-02 中国地质大学(武汉) A kind of dynamic emotion identification method based on audiovisual features depth integration
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010133661A1 (en) * 2009-05-20 2010-11-25 Tessera Technologies Ireland Limited Identifying facial expressions in acquired digital images
CN105740758A (en) * 2015-12-31 2016-07-06 上海极链网络科技有限公司 Internet video face recognition method based on deep learning
US20180330152A1 (en) * 2017-05-11 2018-11-15 Kodak Alaris Inc. Method for identifying, ordering, and presenting images according to expressions
CN108255307A (en) * 2018-02-08 2018-07-06 竹间智能科技(上海)有限公司 Man-machine interaction method, system based on multi-modal mood and face's Attribute Recognition
CN108764010A (en) * 2018-03-23 2018-11-06 姜涵予 Emotional state determines method and device
CN109308466A (en) * 2018-09-18 2019-02-05 宁波众鑫网络科技股份有限公司 The method that a kind of pair of interactive language carries out Emotion identification
CN109614895A (en) * 2018-10-29 2019-04-12 山东大学 A method of the multi-modal emotion recognition based on attention Fusion Features
CN109902632A (en) * 2019-03-02 2019-06-18 西安电子科技大学 A kind of video analysis device and video analysis method towards old man's exception
CN110084266A (en) * 2019-03-11 2019-08-02 中国地质大学(武汉) A kind of dynamic emotion identification method based on audiovisual features depth integration
CN110414323A (en) * 2019-06-14 2019-11-05 平安科技(深圳)有限公司 Mood detection method, device, electronic equipment and storage medium

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111798849A (en) * 2020-07-06 2020-10-20 广东工业大学 Robot instruction identification method and device, electronic equipment and storage medium
CN112183334A (en) * 2020-09-28 2021-01-05 南京大学 Video depth relation analysis method based on multi-modal feature fusion
CN112183334B (en) * 2020-09-28 2024-03-22 南京大学 Video depth relation analysis method based on multi-mode feature fusion
CN112233698A (en) * 2020-10-09 2021-01-15 中国平安人寿保险股份有限公司 Character emotion recognition method and device, terminal device and storage medium
CN112233698B (en) * 2020-10-09 2023-07-25 中国平安人寿保险股份有限公司 Character emotion recognition method, device, terminal equipment and storage medium
CN112487937B (en) * 2020-11-26 2022-12-06 北京有竹居网络技术有限公司 Video identification method and device, storage medium and electronic equipment
CN112487937A (en) * 2020-11-26 2021-03-12 北京有竹居网络技术有限公司 Video identification method and device, storage medium and electronic equipment
CN112364829A (en) * 2020-11-30 2021-02-12 北京有竹居网络技术有限公司 Face recognition method, device, equipment and storage medium
CN112464958A (en) * 2020-12-11 2021-03-09 沈阳芯魂科技有限公司 Multi-modal neural network information processing method and device, electronic equipment and medium
CN112669876A (en) * 2020-12-18 2021-04-16 平安科技(深圳)有限公司 Emotion recognition method and device, computer equipment and storage medium
CN112699774A (en) * 2020-12-28 2021-04-23 深延科技(北京)有限公司 Method and device for recognizing emotion of person in video, computer equipment and medium
CN113158727A (en) * 2020-12-31 2021-07-23 长春理工大学 Bimodal fusion emotion recognition method based on video and voice information
CN112861949B (en) * 2021-01-29 2023-08-04 成都视海芯图微电子有限公司 Emotion prediction method and system based on face and sound
CN112861949A (en) * 2021-01-29 2021-05-28 成都视海芯图微电子有限公司 Face and voice-based emotion prediction method and system
CN113139525B (en) * 2021-05-21 2022-03-01 国家康复辅具研究中心 Multi-source information fusion-based emotion recognition method and man-machine interaction system
CN113139525A (en) * 2021-05-21 2021-07-20 国家康复辅具研究中心 Multi-source information fusion-based emotion recognition method and man-machine interaction system
CN113536999A (en) * 2021-07-01 2021-10-22 汇纳科技股份有限公司 Character emotion recognition method, system, medium and electronic device
CN113420733B (en) * 2021-08-23 2021-12-31 北京黑马企服科技有限公司 Efficient distributed big data acquisition implementation method and system
CN113420733A (en) * 2021-08-23 2021-09-21 北京黑马企服科技有限公司 Efficient distributed big data acquisition implementation method and system
CN113837072A (en) * 2021-09-24 2021-12-24 厦门大学 Method for sensing emotion of speaker by fusing multidimensional information
WO2024000867A1 (en) * 2022-06-30 2024-01-04 浪潮电子信息产业股份有限公司 Emotion recognition method and apparatus, device, and storage medium

Similar Documents

Publication Publication Date Title
CN111339913A (en) Method and device for recognizing emotion of character in video
Borde et al. Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition
US7515770B2 (en) Information processing method and apparatus
Dabre et al. Machine learning model for sign language interpretation using webcam images
CN111046133A (en) Question-answering method, question-answering equipment, storage medium and device based on atlas knowledge base
CN108648760B (en) Real-time voiceprint identification system and method
Sahoo et al. Emotion recognition from audio-visual data using rule based decision level fusion
CN111950497A (en) AI face-changing video detection method based on multitask learning model
CN113223560A (en) Emotion recognition method, device, equipment and storage medium
Jaratrotkamjorn et al. Bimodal emotion recognition using deep belief network
CN114639150A (en) Emotion recognition method and device, computer equipment and storage medium
Kuang et al. Simplified inverse filter tracked affective acoustic signals classification incorporating deep convolutional neural networks
Tsai et al. Sentiment analysis of pets using deep learning technologies in artificial intelligence of things system
CN111462762A (en) Speaker vector regularization method and device, electronic equipment and storage medium
CN113128284A (en) Multi-mode emotion recognition method and device
CN111222854A (en) Interview method, device and equipment based on interview robot and storage medium
Abel et al. Cognitively inspired audiovisual speech filtering: towards an intelligent, fuzzy based, multimodal, two-stage speech enhancement system
CN113642446A (en) Detection method and device based on face dynamic emotion recognition
Xu et al. Gabor based lipreading with a new audiovisual mandarin corpus
Brahme et al. Marathi digit recognition using lip geometric shape features and dynamic time warping
Mattos et al. Towards view-independent viseme recognition based on CNNs and synthetic data
Weninger et al. Speaker trait characterization in web videos: Uniting speech, language, and facial features
Bhukhya et al. Virtual Assistant and Navigation for Visually Impaired using Deep Neural Network and Image Processing
CN115240649B (en) Voice recognition method and system based on deep learning
Nemani et al. Speaker independent VSR: A systematic review and futuristic applications

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination