CN113254713B

CN113254713B - Multi-source emotion calculation system and method for generating emotion curve based on video content

Info

Publication number: CN113254713B
Application number: CN202110533941.1A
Authority: CN
Inventors: 牛建伟; 杨森
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2021-05-17
Filing date: 2021-05-17
Publication date: 2022-05-24
Anticipated expiration: 2041-05-17
Also published as: CN113254713A

Abstract

The invention discloses a multisource emotion calculation system and method for generating an emotion curve based on video content, and relates to the technologies of deep learning, computer vision, emotion calculation, audio processing, image processing and the like. The system comprises a video content feature extraction module, an audio content feature extraction module, a feature fusion regression module and a long video segmentation and processing module. The method extracts a convolutional neural network from video and audio data acquired from a short video segment by using corresponding features and converts the convolutional neural network into visual features and auditory features; performing fusion regression on the obtained visual features and auditory features to obtain the emotion values of the video segments; and finally combining and outputting the emotion sequences of the long video, and generating a smooth emotion curve by using an interpolation algorithm. The invention realizes an automatic method and a system for calculating the video emotion change curve on a computer, retains the characteristics of manual emotion labeling of a viewer, has smooth and natural output result, higher visual effect and higher subsequent analysis and utilization value.

Description

Multi-source emotion calculation system and method for generating emotion curve based on video content

Technical Field

The invention relates to a deep learning technology, a computer vision technology and a video processing technology, in particular to a multisource emotion calculation system and a multisource emotion calculation method for generating an emotion curve based on video content, which are a technology for generating the emotion curve from the video content.

Background

The video referred to in the present invention refers specifically to long video with a duration of more than 1 minute, which usually contains video content and corresponding audio data. The emotion curve refers specifically to the change over time of the emotional feedback that the video brings to the viewer. Emotion consists of Valence (valency) and Arousal (Arousal) 2-dimensional emotional value. Valence indicates the positive and negative degree of emotion, and arousal indicates the intensity of emotion. Calculating the emotion curve of a video is a conversion task from the video to the emotion curve, and the purpose of the conversion task is to convert the input video into the emotion curve. In recent years, video content understanding based on deep learning has advanced significantly, and recent research has proposed a series of systematic approaches. For example, based on Convolutional Neural Network (CNN) methods; methods based on Recurrent Neural Networks (RNNs).

However, based on the prior art, the method for calculating the video emotion curve usually calculates for video content or audio content, which is difficult to utilize integrated information and characteristics, and the generated emotion curve change is inconsistent with the emotion fluctuation brought to audiences by the video, and cannot be directly used as the emotion representation of the video for further processing.

Disclosure of Invention

The invention aims to provide an automatic method and system for generating an emotion curve according to visual content and auditory content of a video based on two-dimensional and three-dimensional convolutional neural networks, so as to solve the problem that the comprehensive effect of various performances represented by video generation emotion in the prior art is poor.

The invention discloses a multi-source emotion calculating system for generating an emotion curve based on video content. The video content feature extraction module is used for extracting visual features from an input video. The audio content feature extraction module is used for calculating auditory features of the input video. And the feature fusion regression module is used for performing fusion regression on the visual features and the auditory features and performing regression prediction on the emotion values corresponding to the short videos. The long video segmentation and processing module segments the input original long video into short videos with equal length, the emotion value of each short video is calculated by the video content feature extraction module, the audio content feature extraction module and the feature fusion regression module, the short videos are spliced to form a whole long video emotion sequence, and the spliced long video emotion sequence is smoothed to obtain an emotion curve of the original input video.

Aiming at the multisource emotion calculating system for generating the emotion curve based on the video content, the multisource emotion calculating method comprises the following steps:

step 1: the long video V is cut into short video segments of equal length by a video cutting tool.

Step 2: extracting video sample frames from each short video segment and then using a three-dimensional residual network to extract successive video sample frames from the video sample framesExtracting visual Feature of short video content_visual。

And step 3: calculating Mel frequency cepstrum coefficient of audio frequency in each short video segment, and extracting auditory characteristic Feature of short video by using Mel frequency cepstrum coefficient of audio frequency as input_auditory。

And 4, step 4: for each short video segment, extracting Feature_visualAnd Feature_auditoryAnd fusing, merging into a uniform input vector Feature, and inputting into a regressor to further obtain the emotion value of each short video segment.

And 5: and (4) splicing the emotion values of each short video segment obtained in the steps 2-4 to form an emotion sequence of the long video segment V, and smoothing.

Compared with the prior art, the method and the system have the advantages and positive effects that:

1. the multisource emotion calculation system and method for generating the emotion curve based on the video content calculate the spatio-temporal characteristics of the video through the video data (vision and hearing) of different modes, and then perform fusion and regression training on the characteristics of the two modes to obtain the emotion value of the short video. And then, automatically segmenting the long video and carrying out emotion calculation to obtain an emotion sequence. Due to the discontinuity of the emotion sequence in time, the method carries out interpolation processing on the sequence by utilizing a third-order spline interpolation value, and outputs the obtained smooth emotion curve.

2. The invention relates to a multisource emotion calculation system and method for generating an emotion curve based on video content, which utilize a three-dimensional depth convolution network in the design of a video visual feature extraction network so as to effectively extract space-time features related to video frame context.

3. The invention discloses a multisource emotion calculation system and method for generating an emotion curve based on video content, and provides a preprocessing method based on a Mel frequency cepstrum coefficient in the design of a video auditory feature extraction network, so that the extracted auditory features are more in line with the characteristics of human ears.

5. According to the multisource emotion calculation system and method for generating the emotion curve based on the video content, a large-scale artificially labeled video emotion data set is used during network parameter training, the generated video emotion curve is closer to the real experience of human audiences, and the further processing and utilization of subsequent video analysis are facilitated.

Drawings

FIG. 1 is a schematic diagram of a multi-source emotion calculation system for generating an emotion curve based on video content according to the method of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples.

The multi-source emotion calculating system for generating the emotion curve based on the video content, as shown in fig. 1, comprises functional modules: the device comprises a video content feature extraction module, an audio content feature extraction module, a feature fusion regression module and a long video segmentation and processing module.

The video content feature extraction module extracts video sampling frames from an input video, then extracts space-time features (visual features) of the video content from continuous video sampling frames by using a three-dimensional residual error network, and inputs the features into the feature fusion regression module.

The audio content feature extraction module calculates audio information of an input video, calculates Mel Frequency Cepstrum Coefficient (MFCC), then inputs a depth residual error network to extract audio features (auditory features), and inputs a feature fusion regression module.

The feature fusion regression module performs fusion regression on the visual features extracted by the video content feature extraction module and the auditory features extracted by the audio content extraction module, and performs regression prediction on the emotion value corresponding to the short video by using a full-connection network.

The long video segmentation and processing module segments an input original long video into equal-length videos, the emotion values of each short video calculated by the video content feature extraction module, the audio content feature extraction module and the feature fusion regression module are spliced to form a whole long video emotion sequence, and then the spliced long video emotion sequence is subjected to smoothing processing by using a three-order spline difference value to obtain an emotion curve of the original input video.

For the above multisource emotion calculating system for generating an emotion curve based on video content, as shown in fig. 1, the multisource emotion calculating method is as follows:

step 1: cutting the long video V into short video segments of equal length by a video cutting tool (FFmpeg); the design of the invention equally divides the long video V into short video segments of 8 seconds, and the redundant parts are ignored.

Step 2: obtaining spatio-temporal Feature representing visual information in each short video segment_visual. The visual feature extraction method is not limited, and includes, but is not limited to, artificially designed features, convolutional neural networks, cyclic neural networks, long-short term memory networks, and attention mechanism.

In the embodiment of the invention, one frame is extracted as a key frame every 4 frames by a frame sampling mode. Due to the characteristic of continuous and dynamic change of videos, the traditional convolutional neural network can only process single-frame images and cannot effectively utilize context information of continuous frames. Thus Feature of the invention_visualMainly uses 18 layers of three-dimensional depth residual network (3Dimension ResNet, R3D), the 3-dimensional convolution neural network can process the spatial and temporal information and propagate the information through the network, and the input tensor z_iIn this case 4-dimensional, with a size of 3 × T × H × W, where 3 is the number of channels per frame of video, typically RGB; t is the number of frames in a video segment and H and W represent the height and width of the frames, respectively. In an embodiment of the present invention, the size (height and width) of each frame of image is scaled to 112. The field of the depth residual network moves along the space (H height and W width) and time (T frame number) on the input tensor, and after the convolution operation and the ReLU activation function, the output tensor is generated. The three-dimensional depth residual error network adopts an R3D structure with the best general effect. The output of the ith 3D volume block is as follows:

z_i＝z_i-1+F(z_i-1；θ_i)

wherein, F (z)_i-1；θ_i) Realize the weight is theta_iConvolution operations and applications ofHaving a ReLU function, z_i-1Represents the output of the last 3D volume block, z_iRepresenting the output of the ith 3D volume block. The output of the 18 3D volume blocks passes through a time-space pooling layer and a layer of fully-connected neural network to generate 128-dimensional Feature representing visual information_visual. R3D is a 3-dimensional space-time convolution network, 3D volume blocks are the basic components of the network, and the specific implementation technology can be referred to documents Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, Manohar Paluri: A clock hook at spatial convolution for Action recognitions.CVPR 2018: 6450-. The ReLU function is a Neural network activation function, and the realization technology can refer to documents Xavier Glorot, organism Bordes, Yoshua Bengio: Deep Sparse recovery Neural networks, AISTTATS 2011: 315-.

And step 3: feature representing auditory information in each short video segment is obtained_auditory. Firstly, the Mel frequency cepstrum coefficient of the audio is calculated, and then the Mel frequency cepstrum coefficient of the audio is used as input to extract auditory characteristics. The determination method of the auditory feature extraction is not limited, and includes but is not limited to artificial design features, neural networks and other machine learning methods.

In the embodiment of the invention, in order to reduce the input size and the model size, the sampling rate of the audio signal is reduced to 2000 Hz by using a sine difference method. Extraction of auditory Feature of mel-frequency cepstrum coefficient of audio_auditoryThe principle of the deep residual network (ResNet) is mainly used. The invention trains 18 layers of ResNet, changes the input size of the first layer of convolution network to 2 multiplied by 64 by adopting parameters pre-trained on ImageNet, so as to change a three-color channel adapting to a natural image into a two-channel adapting to a sound two-channel. And then training and fine-tuning parameters of the 18-layer ResNet model by using a video emotion analysis data set, and obtaining new parameters of the model after fine tuning, wherein the model is more suitable for emotion analysis tasks. Finally outputting the Feature representing auditory information with 128 dimensions_auditory. ResNet is a convolutional neural network, and the implementation technology can be referred to as: kaimng He, Xiangyu Zhuang, Shaoqing Ren, Jianan Sun: Deep research Learning for Image recognition. CVPR 2016:770 778.

And 4, step 4: for each short video segment, a Feature is obtained, as shown in fig. 1_visualAnd Feature_auditoryAnd (5) carrying out fusion, merging into a uniform input vector Feature, and inputting into a regressor to further obtain the emotion value of each short video segment. The fusion method includes, but is not limited to, machine learning methods such as neural networks, support vector machines and the like and data fusion technology, the regressor includes, but is not limited to, machine learning methods such as support vector regression, neural networks and the like, and the emotion value includes, but is not limited to, 2-dimensional valence-evoked emotion, emotion discrete classification and other video emotion representations.

In the embodiment of the invention, the Feature vector Feature of 128 dimensions_visualAnd a 128-dimensional Feature vector Feature_auditoryFirst normalized to unify the return and distribution of both. Then, the normalized Feature_visualAnd Feature_auditoryWill be spliced into a unified 256-dimensional input Feature vector Feature. Inputting the feature vectors into a 2-layer fully-connected network, wherein the input is 256-dimensional, and the output of the first layer is 64-dimensional; the output is a 2-dimensional vector which respectively represents the emotion evoking value and the emotion effect value, namely [ Arousal, Valence]. Fully connected networks employ a ReLU activation function.

Step 5, splicing the emotion values of the short video segments obtained in the step 2 to the step 4 to form an emotion sequence of the long video V, and smoothing the emotion sequence;

and (4) generating a point in a 2-dimensional emotion space by the emotion value of each short video through the method of the step (2-4). These discrete points are then connected into a fold line using conventional splicing methods. The polyline represents emotion sequences of the long video V, and each emotion sequence is a two-dimensional point sequence. Compared with the prior art, the emotion sequence predicted based on the deep learning model guarantees the independence between valence and arousal of 2-dimensional emotion. And then, carrying out third-order spline interpolation on the emotion sequence to form a smooth emotion curve and outputting the curve.

Step 6: and outputting the smoothed emotion sequence as an emotion curve by adopting an interpolation algorithm.

In order to verify the effectiveness of the generated emotional curve, the invention also provides a quantitative emotional curve verification method. And performing Correlation analysis by using the generated emotion curve and the emotion sequence calculated by the viewer mark, and calculating Spearman's Rank Correlation Coefficient (SRCC) of the emotion curve and the viewer mark to quantitatively measure the effectiveness of the emotion curve. The calculation method is as the formula:

wherein d is_i＝rg(X_i)-rg(Y_i) Is the difference between the rank sizes of 2 sequences, n is the length of the sequence, X represents the emotion sequence calculated by the invention, X_iIndicating the ith value therein. Y denotes the corresponding viewer-tagged emotion sequence, Y_iIndicating the ith value therein. rg (X)_i) Represents X_iRank in X, i.e., position sorted by size. Same theory rg (Y)_i) As well as the same. The 2 sequences are emotion (Valence and aroma) sequences output by the long video segmentation and processing module and corresponding emotion sequences labeled by the audience.

Claims

1. Multisource emotion calculation system for generating emotion curve based on video content is characterized in that: the system comprises a video content feature extraction module, an audio content feature extraction module, a feature fusion regression module and a long video segmentation and processing module;

the video content feature extraction module is used for extracting visual features from an input video;

the audio content feature extraction module is used for calculating auditory features of input video;

the feature fusion regression module is used for performing fusion regression on the visual features and the auditory features and performing regression prediction on emotion values corresponding to the short videos;

the long video segmentation and processing module segments an input original long video into short videos with equal length, the emotion values of each short video calculated by the video content feature extraction module, the audio content feature extraction module and the feature fusion regression module are spliced to form a whole long video emotion sequence, and the spliced long video emotion sequence is smoothed to obtain an emotion curve of the original input video;

the multi-source emotion calculation method comprises the following steps:

step 1: cutting the long video v into short video segments of equal length by a video cutting tool;

step 2: extracting video sampling frames from each short video segment, and then extracting visual Feature of short video content from continuous video sampling frames by using three-dimensional residual error network_visual(ii) a The three-dimensional depth residual error network adopts an R3D structure, and the size of each frame of image is scaled to 112; moving a receptive field of the depth residual error network on the input tensor along space and time, performing convolution operation, and generating an output tensor after a ReLU activation function;

and step 3: calculating Mel frequency cepstrum coefficient of audio frequency in each short video segment, and extracting auditory characteristic Feature of short video by using Mel frequency cepstrum coefficient of audio frequency as input_auditory(ii) a Specifically, a sine difference method is used for reducing the sampling rate of an audio signal to 2000 Hz, 18 layers of depth residual error networks are trained, parameters pre-trained on ImageNet are adopted, the input size of the first layer of convolution network is changed to be 2 x 64, and a three-color channel adaptive to a natural image is changed to be a two-channel adaptive to sound and two channels; then training and fine-tuning parameters of an 18-layer ResNet model by using a video emotion analysis data set, obtaining new parameters of the fine-tuned model, and outputting Feature representing auditory information in 128 dimensions_auditory；

And 4, step 4: for each short video segment, the extracted Feature_visualAnd Feature_auditoryFusing, merging into a uniform input vector Feature, and inputting into a regressor to further obtain the emotion value of each short video clip;

and 5: splicing the emotion values of the short video segments obtained in the steps 2-4 to form an emotion sequence of the long video segment V, and smoothing;

2. The multi-source emotion calculation system for generating an emotion curve based on video content of claim 1, wherein: in step 3, a 128-dimensional Feature vector Feature_visualAnd a 128-dimensional Feature vector Feature_auditoryFirstly, the two are normalized to unify the return and distribution of the two; then, the normalized Feature_visualAnd Feature_auditoryWill be spliced into a unified 256-dimensional input Feature vector Feature; inputting the feature vectors into a 2-layer fully-connected network, wherein the input is 256-dimensional, and the output of the first layer is 64-dimensional; the output is a 2-dimensional vector which respectively represents the emotion evoking value and the emotion effect value, namely [ Arousal, Valence](ii) a Fully connected networks employ a ReLU activation function.

3. The multi-source emotion calculation system for generating an emotion curve based on video content of claim 1, wherein: and carrying out correlation analysis on the generated emotion curve and the emotion sequence calculated by the viewer marking, and calculating the spearman grade correlation coefficient of the emotion curve and the viewer marking to quantitatively measure the effectiveness of the emotion curve.