CN111339913A

CN111339913A - Method and device for recognizing emotion of character in video

Info

Publication number: CN111339913A
Application number: CN202010111614.2A
Authority: CN
Inventors: 杨杰; 苏敏童; 宋施恩; 金义彬; 卢海波
Original assignee: Hunan MgtvCom Interactive Entertainment Media Co Ltd
Current assignee: Hunan MgtvCom Interactive Entertainment Media Co Ltd
Priority date: 2020-02-24
Filing date: 2020-02-24
Publication date: 2020-06-26

Abstract

The invention provides a method and a device for recognizing the emotion of a character in a video, which are used for extracting a face image in the video, a sound frequency spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image feature, improves the richness of the character emotion feature, and inputs the combined feature vector into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition processing, so that a more accurate character emotion recognition result can be obtained.

Description

Method and device for recognizing emotion of character in video

Technical Field

The invention relates to the technical field of video data analysis, in particular to a method and a device for recognizing emotion of a person in a video.

Background

Human emotion recognition is an important component of human-computer interaction and emotion calculation research. Common and important categories of emotion in video include happiness, anger, disgust, fear, sadness, surprise, etc. The emotion is an important component of video content, and the emotion expressed by the video segment can be analyzed by recognizing the emotion, so that the video application related to the emotion is derived.

Most of the existing emotion recognition technologies in videos focus on a mode of emotion recognition based on human face visual features, namely, emotion classification is performed according to the visual features of human face region images through human face detection positioning, analysis and recognition of the human face region images. The visual features of the face region images are the visual features which can reflect the emotion of the face most, but the face images in the video have various interference factors, such as face image blurring, poor illumination conditions, angle deviation and the like, so that the emotion recognition accuracy of the person based on the visual features is low.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for recognizing emotion of a person in a video, so as to improve the accuracy of recognizing emotion of the person.

In order to achieve the above purpose, the invention provides the following specific technical scheme:

a method for recognizing emotion of a person in a video comprises the following steps:

extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;

extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;

performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;

and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result, wherein the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.

Optionally, the extracting the face image in the video, and the sound spectrogram and the subtitle text corresponding to the face image includes:

splitting a video into a plurality of video frames, and recording the time of each video frame;

sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;

intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;

carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;

and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.

Optionally, the extracting an image feature vector from the face image includes:

the face image is input into an image feature extraction model obtained by pre-training and processed, feature vectors output by a full connection layer in the image feature extraction model are determined as the image feature vectors, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.

Optionally, the extracting the acoustic feature vector from the acoustic spectrogram includes:

the method comprises the steps of inputting a sound frequency spectrogram into a sound feature extraction model obtained through pre-training for processing, determining feature vectors output by a full connection layer in the sound feature extraction model to be the sound feature vectors, training the sound feature extraction model to a preset deep convolutional neural network model, wherein the preset deep convolutional neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.

Optionally, the extracting a text feature vector from the subtitle text includes:

performing word segmentation on the subtitle text, and removing low-frequency words and stop words in word segmentation results to obtain a plurality of words;

calling a word2vec model, and processing the words to obtain a vector matrix;

the method comprises the steps of inputting a vector matrix into a pre-trained text feature extraction model for processing, determining feature vectors output by a full connection layer in the text feature extraction model as the text feature vectors, training a preset text convolution neural network model to obtain the text feature extraction model, wherein the preset text convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.

Optionally, the performing feature fusion on the image feature vector, the sound feature vector, and the text feature vector to obtain a joint feature vector includes:

performing feature fusion on the image feature vector, the sound feature vector and the text feature vector;

performing dimensionality reduction on the feature vector after feature fusion by using a PCA method packaged in a sklern tool library;

and carrying out normalization processing on the feature vector obtained after the dimension reduction processing to obtain a three-channel combined feature vector.

Optionally, training the character emotion recognition model includes:

acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;

carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;

carrying out spectrum analysis on the sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;

extracting an image feature vector set from the face image data set, extracting a sound feature vector set from the sound spectrogram data set, and extracting a text feature vector set from the open source emotion data set of the text;

respectively carrying out feature fusion on the feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;

and respectively training at least one machine learning model by utilizing a joint feature vector training data set to obtain the character emotion recognition model.

Optionally, the character emotion recognition model comprises a plurality of sub-recognition models; the step of calling a character emotion recognition model obtained through pre-training and processing the combined feature vector to obtain a character emotion recognition result comprises the following steps:

respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models;

and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.

An emotion recognition apparatus for a person in a video, comprising:

the multi-modal data extraction unit is used for extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;

the multi-modal feature extraction unit is used for extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram and extracting text feature vectors from the subtitle text;

the feature fusion unit is used for performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;

and the emotion recognition unit is used for calling a character emotion recognition model obtained by pre-training, processing the combined feature vector to obtain a character emotion recognition result, and the character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a face image.

Optionally, the multi-modal data extraction unit is specifically configured to:

Optionally, the multi-modal feature extraction unit includes a face image feature extraction subunit, the face image feature extraction subunit is configured to perform processing on an image feature extraction model obtained by pre-training the face image input, determine feature vectors output by a full connection layer in the image feature extraction model as the image feature vectors, train a preset depth convolutional neural network model to obtain the image feature extraction model, and the preset depth convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer in front of the full connection layer, and a softmax layer behind the full connection layer.

Optionally, the multi-modal feature extraction unit includes a sound feature extraction subunit, the sound feature extraction subunit is configured to process the sound feature extraction model obtained by pre-training the input of the sound spectrogram, determine feature vectors output by a full connection layer in the sound feature extraction model as the sound feature vectors, train the preset deep convolutional neural network model to obtain, and the preset deep convolutional neural network model includes a pooling layer, a full connection layer, a dropout layer before the full connection layer, and a softmax layer after the full connection layer.

Optionally, the multi-modal feature extraction unit includes a text feature extraction subunit, and the text feature extraction subunit is configured to:

calling a word2vec model, and processing the words to obtain a vector matrix;

Optionally, the feature fusion unit is specifically configured to:

Optionally, the apparatus further includes a recognition model training unit, specifically configured to:

Optionally, the emotion recognition unit is specifically configured to:

Compared with the prior art, the invention has the following beneficial effects:

the invention discloses a method for recognizing the emotion of a character in a video, which extracts a face image in the video, a sound frequency spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image features, so that the diversity and the richness of the character emotion features are improved, the combined feature vector is input into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition, and a more accurate character emotion recognition result can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a method for recognizing emotion of a person in a video according to an embodiment of the present invention;

fig. 2 is a schematic flowchart of a method for extracting a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image according to an embodiment of the present invention;

FIG. 3 is a schematic flow chart of a training method of an emotion recognition model disclosed in an embodiment of the present invention;

FIG. 4 is a schematic diagram of multi-modal feature fusion as disclosed in an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device for recognizing emotion of a person in a video according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The invention discloses a method for recognizing the emotion of a person in a video, which is used for recognizing a multi-modal combined feature vector comprising a face image, sound and a subtitle text extracted from the video and can obtain a more accurate emotion recognition result of the person.

Specifically, referring to fig. 1, the method for recognizing emotion of a person in a video disclosed in this embodiment includes the following steps:

s101: extracting a face image in a video, and a sound frequency spectrogram and a subtitle text which correspond to the face image;

referring to fig. 2, an alternative method for extracting a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image includes the following steps:

s201: splitting a video into a plurality of video frames, and recording the time of each video frame;

specifically, the video can be read by opencv, and the video is split into a plurality of video frames.

S202: sequentially identifying a plurality of video frames by using a preset face identification model to obtain a face image, and recording the time of the video frames containing the face image;

the face recognition model can be obtained by training a machine learning model by using a face image training sample, and can also be obtained by sequentially recognizing a plurality of video frames by using the existing face recognition model, such as a face classifier carried by an opencv.

Preferably, the video frame can be converted into a gray scale image to improve the speed of face recognition.

And intercepting the face area of the video frame containing the face image obtained by identification according to a format of 128 × 128, and obtaining the face image.

S203: intercepting sound segments in a corresponding time period in the video according to the time of a video frame containing the face image;

it can be understood that the video frames containing the face images in the video are a plurality of continuous video frames, each video frame corresponds to a time, the plurality of continuous video frames correspond to a time period, and then the sound segments in the time period in the video can be intercepted.

S204: carrying out spectrum analysis on the sound fragment to obtain the sound spectrogram;

and performing spectrum analysis on the sound segment, wherein the spectrum is quantized into 128 frequency bands, each 128 sampling points is a sampling group, the time length of each sampling segment is 0.02 seconds by 128 seconds to 2.56 seconds, and a 128-dimensional spectral response image, namely a sound spectrogram, is formed.

S205: and calling a pre-constructed subtitle detection model to process the video frame containing the face image to obtain a subtitle text of the video frame containing the face image.

The pre-constructed caption detection model can be any one of the existing caption detection models.

S102: extracting image feature vectors from the face image, extracting sound feature vectors from the sound spectrogram, and extracting text feature vectors from the subtitle text;

specifically, the face image is input into an image feature extraction model obtained through pre-training and processed, the feature vector output by a full connection layer in the image feature extraction model is determined to be the image feature vector, the image feature extraction model is obtained by training a preset depth convolution neural network model, and the preset depth convolution neural network model comprises a pooling layer, a full connection layer, a dropout layer in front of the full connection layer and a softmax layer behind the full connection layer.

calling a word2vec model, processing the words, wherein each word is represented by a K-dimensional vector, and if the words are N words, obtaining an N-X-K-dimensional vector matrix;

In the model for extracting the feature vector in the embodiment, a plurality of full-connected layers in the traditional convolutional neural network model are replaced by one full-connected layer, a softmax layer is directly added behind the one full-connected layer, and a mixed model of a residual error structure and an inclusion structure is combined, meanwhile, the input data is processed by Batch Normalization (Batch Normalization), the pooling layer uses a global average pooling method, the Dropout layer is added in front of the full-connection layer, the Dropout layer can effectively relieve the occurrence of overfitting, to the extent that regularization is achieved, the Dropout layer results in two neurons not necessarily appearing in one Dropout network at a time, therefore, the updating of the weight value does not depend on the common action of hidden nodes with fixed relations, the condition that certain characteristics are only effective under other specific characteristics is prevented, the network is forced to learn more robust characteristics, and the robustness of the model is increased.

Since the image feature vector, the sound feature vector and the text feature vector are 512-dimensional feature vectors output by the full connected layer of the corresponding model.

S103: performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a combined feature vector;

and performing feature fusion on the image feature vector, the sound feature vector and the text feature vector to obtain a 512 x 3-dimensional feature vector.

Preferably, in order to reduce the data amount processed by the human emotion recognition model, a PCA method packaged in a sklern tool library may be used, for example, a parameter n _ components is set to 768, the feature vector after feature fusion is subjected to dimension reduction processing, so that a feature vector of 768 dimensions may be obtained, and the feature vector obtained after the dimension reduction processing is subjected to normalization processing, so that a three-channel combined feature vector is obtained.

The normalization process herein refers to maximum-minimum normalization, where maximum-minimum normalization is a linear transformation of raw data, min a and max a are respectively the minimum and maximum values of attribute a, and one raw value x is mapped to a value x' of interval [0,1] by maximum-minimum normalization, then the following formula:

s104: and calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector to obtain a character emotion recognition result.

The character emotion recognition model is obtained by training at least one machine learning model by using a combined feature vector training data set comprising an image feature vector, a sound feature vector and a text feature vector of a human face image.

On the basis, calling a character emotion recognition model obtained by pre-training, and processing the combined feature vector, wherein the method specifically comprises the following steps: respectively inputting the combined feature vectors into a plurality of sub-recognition models for recognition processing to obtain character emotion recognition results of the sub-recognition models; and determining the character emotion recognition result with the largest number of same results in the character emotion recognition results of the plurality of sub-recognition models as a final character emotion recognition result output by the character emotion recognition model.

For example, the character emotion recognition model includes 3 sub-recognition models C1, C2, and C3, where the recognition result of the sub-recognition model C1 on the joint feature vector is emotion label L1, the recognition result of the sub-recognition model C2 on the joint feature vector is emotion label L1, the recognition result of the sub-recognition model C3 on the joint feature vector is emotion label L2, and the final character emotion recognition result output by the character emotion recognition model is emotion label L1.

Further, referring to fig. 3, the embodiment also discloses a method for training a character emotion recognition model, which specifically includes the following steps:

s301: acquiring an open source emotion data set of a face image, an open source emotion data set of sound and an open source emotion data set of a text;

the open source emotion data set of the face image can be an Asian face database (KFDB) and comprises 1000 artificial 52000 multi-pose, multi-illumination and multi-expression face images, wherein the images with changed poses and illumination are acquired under the condition of strict control.

The open source emotion data set of the voice can be recorded by professional speakers by utilizing a CASIA Chinese emotion corpus, and comprises six emotions: anger, happiness, fear, sadness, surprise and neutrality, 9600 different pronunciation sound fragment data sets.

The open source emotion dataset of the text can be a chinese conversation emotion dataset, covering emotion words of 40000 multiple chinese instances.

S302: carrying out data enhancement processing on the open source emotion data set of the face image to obtain a face image data set;

image transformations used for data enhancement include, but are not limited to, scaling, rotating, flipping, warping, erasing, miscut, perspective, blurring, or combinations thereof, among others, and the amount of data is augmented by the enhancement process.

S303: carrying out spectrum analysis on sound segments in the open source emotion data set of the sound to obtain a sound spectrogram data set;

s304: extracting an image feature vector set from a face image data set, extracting a sound feature vector set from a sound spectrogram data set, and extracting a text feature vector set from an open source emotion data set of a text;

s305: respectively carrying out feature fusion on feature vectors with the same emotion label in the image feature vector set, the sound feature vector set and the text feature vector set to obtain joint feature vectors corresponding to different emotion labels, and using the joint feature vectors as a joint feature vector training data set of the character emotion recognition model;

referring to fig. 4, fig. 4 is a schematic diagram of multi-modal feature fusion including face images, sounds and text.

S306: and respectively training at least one machine learning model by utilizing the joint feature vector training data set to obtain a character emotion recognition model.

According to the embodiment, a large amount of labor labeling cost is saved by automatically generating the training data set, the method has flexible expansibility, and the transformation of the human face, including the transformation of zooming, masking, rotating, miscut and the like, can be conveniently increased.

Based on the method for recognizing the emotion of a person in a video disclosed in the above embodiments, this embodiment correspondingly discloses a device for recognizing the emotion of a person in a video, please refer to fig. 5, and the device includes:

the multi-modal data extraction unit 501 is configured to extract a face image in a video, and a sound spectrogram and a subtitle text corresponding to the face image;

a multi-modal feature extraction unit 502, configured to extract image feature vectors from the face image, extract sound feature vectors from the sound spectrogram, and extract text feature vectors from the subtitle text;

a feature fusion unit 503, configured to perform feature fusion on the image feature vector, the sound feature vector, and the text feature vector to obtain a joint feature vector;

and an emotion recognition unit 504, configured to invoke a pre-trained character emotion recognition model, and process the joint feature vector to obtain a character emotion recognition result, where the character emotion recognition model is obtained by training at least one machine learning model using a joint feature vector training data set including an image feature vector, a sound feature vector, and a text feature vector of a face image.

Optionally, the multi-modal data extraction unit is specifically configured to:

calling a word2vec model, and processing the words to obtain a vector matrix;

Optionally, the feature fusion unit is specifically configured to:

Optionally, the emotion recognition unit is specifically configured to:

The device for recognizing the emotion of a person in a video, disclosed by the embodiment, extracts a face image in the video and a sound spectrogram and a subtitle text corresponding to the face image, on the basis of extracting and obtaining the image characteristic vector, the sound characteristic vector and the text characteristic vector, the image feature vector, the sound feature vector and the text feature vector are subjected to feature fusion to obtain a multi-modal combined feature vector, the multi-modal combined feature vector increases the sound feature vector and the text feature vector relative to the face image features, so that the diversity and the richness of the character emotion features are improved, the combined feature vector is input into a character emotion recognition model obtained by training at least one machine learning model in advance by using a combined feature vector training data set for recognition, and a more accurate character emotion recognition result can be obtained.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method for recognizing emotion of a person in a video, the method comprising:

2. The method of claim 1, wherein the extracting of the face image and the audio spectrogram and subtitle text corresponding to the face image in the video comprises:

3. The method of claim 1, wherein the extracting image feature vectors from the face image comprises:

4. The method of claim 1, wherein the extracting of the acoustic feature vector from the acoustic spectrogram comprises:

5. The method of claim 1, wherein extracting text feature vectors from the subtitle text comprises:

calling a word2vec model, and processing the words to obtain a vector matrix;

6. The method according to claim 1, wherein the feature fusing the image feature vector, the sound feature vector and the text feature vector to obtain a joint feature vector comprises:

7. The method of claim 1, wherein training the character emotion recognition model comprises:

8. The method of claim 1, wherein the character emotion recognition model comprises a plurality of sub-recognition models; the step of calling a character emotion recognition model obtained through pre-training and processing the combined feature vector to obtain a character emotion recognition result comprises the following steps:

9. An apparatus for recognizing emotion of a person in a video, comprising:

10. The apparatus according to claim 9, wherein the multimodal data extraction unit is specifically configured to: