CN111143615A

CN111143615A - Short video emotion classification recognition device

Info

Publication number: CN111143615A
Application number: CN201911293473.4A
Authority: CN
Inventors: 陈实; 余米; 王禹溪; 鲁雨佳; 杨昌源; 马春阳
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2020-05-12
Anticipated expiration: 2039-12-12
Also published as: CN111143615B

Abstract

The invention discloses a device for identifying short video emotion categories, which comprises: (1) acquiring a target short video to be identified, dividing the target short video into a plurality of shot segments, extracting the frame picture characteristics of each shot segment, and extracting the shot characteristics and the dynamic characteristics of the target short video; (2) calling the emotion valence model to calculate the input frame picture characteristics and outputting the emotion valence value of the target short video; (3) calling the emotional excitation model to calculate the input combined characteristics consisting of the frame picture characteristics, the lens characteristics and the dynamic characteristics, and outputting an emotional excitation value of the target short video; (4) and calculating Euclidean distances between the V-A emotion space constructed by the emotion effective value and the emotion incentive value and the coordinate centers of all emotion types, and determining the emotion types of the target short video according to the Euclidean distances. The identification device can quickly and accurately identify the emotion type of the short video.

Description

Short video emotion classification recognition device

Technical Field

The invention belongs to the technical field of video emotion recognition, and particularly relates to a recognition device for short video emotion categories.

Background

In the internet information era of recent years, videos replace ways of characters and pictures in a large number due to the characteristic of high information transmission efficiency, and are important forms of information publishing and social sharing. Among the numerous forms of information distribution, there are studies that have shown that the completion rate and interaction rate of advertisements can be significantly improved when using the form of short video. The problem of how to produce short videos with excellent promotional effects has been solved by short video producers in the past with accumulated subjective experience, and such work is very complicated. Research shows that in the short video production process, effective identification of short video emotion categories can provide guidance on expression effects, so that short videos can achieve the advertising effect expected by producers. The emotion type identification of the short video is as follows: and identifying the emotion category of the short video based on the content of the short video.

Due to the short video time, the content is generally not dramatic, the content types are complex and various, and the emotion expressed by the short video is determined by the picture characteristics and the video characteristics. The existing video emotion category identification methods are generally two, namely a traditional machine learning method which uses feature extraction and then trains a model; and the other method is to identify the target video emotion by using a complex framework such as a convolutional neural network, both the two methods are end-to-end emotion type identification on the target video, and a very large amount of training data sets and a relatively long training time are required for achieving a certain accuracy of identification.

In the psychological research, the emotion categories can be mapped to two emotion dimensions, namely emotion Valence and emotion incentive Arousal, wherein a higher value represents higher emotional pleasure, a higher Arousal value represents higher emotional intensity, and different emotion categories are distributed in different areas on a V-A (Valence-Arousal) graph. The various features of the video contribute differently to the two dimensions of emotion, which is ignored by the end-to-end emotion recognition method.

After the objects of emotion classification from the V-A space are determined, what features and what methods are used to calculate the emotion excitation value and emotion Valence Arousal of a video more accurately respectively becomes the next problem to be solved. Firstly, in the aspect of feature selection, in view of some characteristics of a short video, selection is mainly performed on a non-semantic level of the video, the emotional Valence value is more influenced by color information of a frame picture, and the emotional excitation Arousal is more influenced by dynamic information of the frame picture; secondly, in a calculation mode, in order to draw the distribution situation in the V-A space more finely, Valence and Arousal are quantized to a floating point number one-dimensional space from-1 to 1 in most researches, so that the initial emotion classification problem can be converted into two regression calculation problems, and the regression calculation has more available methods, such as a multiple linear regression method and a gradient lifting tree method, and even some regression methods of a neural network can be used.

Disclosure of Invention

The invention aims to provide a short video emotion classification recognition device which can quickly and accurately recognize emotion classifications of short videos.

The technical scheme of the invention is as follows:

an apparatus for identifying short video emotion classes, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the memory stores a set of coordinate centers for emotion class based on a multiple linear regression model for constructing emotion valence and emotion excitation models, and the processor executes the computer program to perform the following steps:

(1) acquiring a target short video to be identified, dividing the target short video into a plurality of shot segments, extracting the frame picture characteristics of each shot segment, and extracting the shot characteristics and the dynamic characteristics of the target short video;

(2) calling the emotion valence model to calculate the input frame picture characteristics and outputting the emotion valence value of the target short video;

(3) calling the emotional excitation model to calculate the input combined characteristics consisting of the frame picture characteristics, the lens characteristics and the dynamic characteristics, and outputting an emotional excitation value of the target short video;

(4) and calculating Euclidean distances between the V-A emotion space constructed by the emotion effective value and the emotion incentive value and the coordinate centers of all emotion types, and determining the emotion types of the target short video according to the Euclidean distances.

In the short video emotion classification recognition device, the process of dividing the target short video is as follows:

and calculating the difference degree between two frames of pictures by adopting a dHASH algorithm in a perceptual hash algorithm, and dividing the target short video into a plurality of shot segments according to the difference degree and the size of a set threshold value.

In the short video emotion category identification device, the extraction process of the frame picture features of the shot section is as follows:

extracting sampling frames of the shot segments at intervals of equal frame number, and extracting picture characteristic information of the sampling frames, wherein the picture characteristic information comprises color richness, color cold and warm degree, color weight, color liveness, color softness, dark color proportion, light color proportion, color saturation, color energy, color variance characteristics related to colors, and contrast, homogeneity and energy characteristics of a gray level co-occurrence matrix related to textures;

and averaging the picture characteristics of all sampling frames corresponding to the lens segment, and taking the average as the frame picture characteristic of the lens segment.

In the short video emotion category recognition device, the process of extracting shot features of the target short video is as follows:

and extracting the shot length of the shot section, calculating the average shot length and the shot switching frequency of the target short video according to the shot lengths of all the shot sections corresponding to the target short video, and taking the average shot length and the shot switching frequency as shot characteristics.

In the short video emotion classification recognition device, the extraction process of the dynamic features of the target short video is as follows:

extracting adjacent sampling frames of the target short video, calculating a picture characteristic difference value of the adjacent sampling frames to obtain a dynamic characteristic between every two adjacent sampling frames, and taking an average value of the dynamic characteristics between all the adjacent sampling frames as a first dynamic characteristic of the target short video;

extracting sampling frames of the shot section at intervals of equidistant frame numbers, calculating dynamic characteristics between the front sampling frame and the rear sampling frame by adopting an optical flow method and a visual excitation algorithm, and taking an average value of the dynamic characteristics between the front sampling frame and the rear sampling frame as a second dynamic characteristic of the target short video;

the first dynamic feature and the second dynamic feature constitute a dynamic feature of the target short video.

In the identification device for short video emotion categories, the construction process of the coordinate center of each emotion category is as follows:

acquiring the emotion category, emotion effectiveness value and emotion incentive value of the short video;

and screening all the short videos under the emotion category aiming at each type of emotion category, respectively calculating the average value of the emotion effectiveness values and the average value of the emotion incentive values of the videos of the emotion category according to the emotion effectiveness values and the emotion incentive values of the short videos, and taking the average values as the coordinate center of the emotion category in a V-A space formed by the emotion effectiveness values and the emotion incentive values.

In the apparatus for identifying emotion types of short videos, the determining an emotion type of the target short video according to the euclidean distance includes:

and taking the reciprocal of the Euclidean distance as the matching degree, taking the ratio of the matching degree corresponding to each emotion category to the sum of all the matching degrees as the probability of the target short video under the emotion category, and screening the emotion category with the maximum probability to determine the emotion category of the target short video.

In the short video emotion category identification device, the short video emotion category, emotion validity value and emotion excitation value acquisition process comprises the following steps:

and collecting short videos corresponding to the target emotion categories, carrying out crowdsourcing and labeling on the collected short videos, filling an emotion questionnaire after watching the short videos by a volunteer, and counting the emotion effectiveness value, the emotion incentive value and the emotion category of each short video through the questionnaire.

In the short video emotion category identification device, the construction method of the emotion valence model comprises the following steps:

the method comprises the steps of taking frame picture characteristics of shot segments and emotional effectiveness values of short videos where the shot segments are located as a sample, forming a sample set of an emotional effectiveness model, constructing a linear regression model according to the frame picture characteristics and the emotional effectiveness values, training the linear regression model by using the sample set of the emotional effectiveness model and an XGboost algorithm, and determining model parameters to obtain the emotional effectiveness model.

In the short video emotion category identification device, the construction method of the emotion excitation model comprises the following steps:

the method comprises the steps of taking the frame picture characteristics of a shot segment, the shot characteristics and the dynamic characteristics of a short video and the emotional incentive value of the short video as a sample to form a sample set of an emotional incentive model, constructing a linear regression model according to the frame picture characteristics, the shot characteristics and the dynamic characteristics of the short video and the emotional incentive value, training the linear regression model by using the sample set of the emotional incentive model and adopting an XGboost algorithm, and determining model parameters to obtain the emotional incentive model.

Compared with the prior art, the invention has the beneficial effects that at least:

the method carries out the identification of the short video emotion category by substituting the emotion valence model and the emotion exciting model obtained by pre-training and the set containing the coordinate centers of all emotion categories, thereby greatly improving the identification rate and accuracy.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a short video emotion classification recognition method provided by the present invention;

FIG. 2 is a process diagram for obtaining a Valence model provided in the present invention;

FIG. 3 is a process diagram for obtaining an Arousal model provided by the present invention;

FIG. 4 is a diagram of a process for identifying emotion classes in a target video according to the present invention;

FIG. 5 is a schematic diagram of V-A coordinate emotion category matching provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

As shown in fig. 1, fig. 2, fig. 3, fig. 4, and fig. 5, the method for identifying a short video emotion category according to the present invention includes:

s1, obtaining the short video and the emotion type, emotion Valence value and emotion incentive Arousal value of the short video.

After selecting a specific category requiring emotion category identification, short videos under the category are collected from shopping websites (such as Taobao). After the short videos are obtained, short video marking is carried out on each short video in a crowdsourcing marking mode, a PANAS emotion scale questionnaire can be filled in the marking mode, so that volunteers can fill in the questionnaire of the short videos after watching the short videos, and each short video needs a plurality of volunteers to mark. Through the questionnaire, the emotional Valence value, the emotional incentive Arousal value and the emotional classification of each short video are calculated.

PANAS (Positive and Negative influence schedule) is a psychometric scale proposed by David Watson, Lee AnnaClark and Auke Telegen in 1988 for assessing the positive and Negative emotions of persons filling out a questionnaire at that time, the content generally consisting of 20 5-point questions.

And S2, respectively constructing a training set and a testing set of the Valence model and the Arousal model according to the short video.

After emotion marking data of the short video are obtained, feature extraction and sample division operations are carried out, and the process is separately explained by sample sets divided into two models.

The emotion Valence value is more influenced by the picture information content, so that on the aspect of selecting the characteristics of the Valence model, the invention starts with the picture frame information of the short video and establishes the model on the basis that the same short video is approximately the same on the emotion expression pitch, and the construction process of the sample set of the model is as follows:

(a) setting a threshold value as a dividing rule of picture frames between the shots, dividing the short video into a plurality of shot segments, wherein a dHASH algorithm in a perceptual hash algorithm is adopted to calculate the difference degree between two frames of pictures, and dividing the target short video into a plurality of shot segments according to the difference degree and the size of the set threshold value;

(b) extracting sampling frames of the shot segments at intervals of equidistant frame numbers, and extracting picture characteristic information of the sampling frames;

(c) the extracted picture characteristic information comprises characteristics such as color richness, color cold and warm degree, color lightness and heaviness, color liveness, color softness, dark color proportion, brightness color proportion, color saturation, color energy, color variance and the like related to colors, and characteristics such as contrast, homogeneity, energy and the like of a gray level co-occurrence matrix related to textures;

(d) averaging the characteristic information of each picture of a plurality of sampling frames in the shot to serve as the characteristic information of one shot section;

(e) constructing a sample by combining the characteristic information of the shot section and the value of the short video;

the statistical method of the gray level co-occurrence matrix is a comprehensive texture analysis method provided on the premise that the spatial distribution relation among pixels in an image includes image texture information. The gray level co-occurrence matrix is defined as the probability that the gray level value is at a point away from a fixed position from a pixel point with a gray level of a certain value, that is, all estimated values can be expressed in the form of a matrix, and the matrix is called as the gray level co-occurrence matrix. Due to the large data volume of the gray level co-occurrence matrix, the gray level co-occurrence matrix is generally not directly used as a feature for distinguishing textures, but some statistics constructed based on the gray level co-occurrence matrix are used as texture classification features, such as contrast, homogeneity, energy and the like.

The above operation is performed on each video in the short video library, so that a sample set of the value model can be obtained, and the sample set is divided into a training set and a test set according to a certain proportion.

The emotional excitation Arousal value is related to the content of picture information and also related to dynamic information among the picture information, so that on the aspect of Arousal model feature selection, the method starts from inter-frame dynamic information, lens information, dynamic features extracted by an optical flow method and a visual excitation algorithm in addition to picture frame information, and the construction process of a sample set of the model is as follows:

(a) extracting sampling frames of the shot segments at intervals of equidistant frame numbers, and extracting picture characteristic information of the sampling frames;

(b) the extracted picture characteristic information is the same as the step (b) of the Valence model sample set construction process and is not repeated;

(c) taking an average value of each picture characteristic information of a plurality of sampling frames in the short video as the picture characteristic information of the short video;

(d) taking difference to picture characteristic information of adjacent sampling frames in the short video to obtain absolute value, obtaining dynamic characteristic information between each sampling frame, and taking average value as dynamic characteristic information of the short video;

(e) extracting sampling frames of the short video from the frames at equal intervals of the short video, calculating dynamic characteristic information of adjacent sampling frames by using an optical flow method and a visual excitation algorithm, and finally taking an average value as the supplement of the dynamic characteristic information in the step (d);

(f) extracting the shot length information of each shot, and calculating shot characteristic information such as the average shot length, shot switching frequency and the like of a short video where the shot is located;

(g) ' constructing feature information of a short video together with its Arousal value into one sample;

the optical flow method utilizes the change of pixels in an image sequence on a time domain and the correlation between adjacent frames, and calculates and obtains the motion information of an object between the adjacent frames according to the corresponding relation between the previous frame and the current frame.

The vision excitement algorithm calculates the vision difference of two frames by comparing the difference of the two frames in the LUV color space, wherein the LUV color space aims to establish a color space unified with vision, the algorithm used in the invention converts the RGB color space of the two frames into the LUV color space, then calculates the difference degree of the pixel points in the same space of the two frames in the L, U, V space, and calculates the difference degree in the square difference mode.

The shot characteristic information is the length of each shot and the switching frequency of the shot after the shot is divided by the short video.

The above operation is performed on each video in the short video library, so that a sample set of the Arousal model can be obtained, and the sample set is divided into a training set and a testing set according to a certain proportion.

And S3, training the multiple regression model by using the training set and the test set.

And training the Valence model and the Arousal model by using XGboost, wherein a training model target parameter in the training parameters in the XGboost is set as a training regression model.

XGboost is one of Boosting algorithms, and the idea of Boosting algorithm is to integrate many weak classifiers together to form a strong classifier. Because the XGboost is a lifting tree model, a plurality of tree models are integrated together to form a strong classifier. The tree model used is the CART regression tree model. The XGboost is improved on the basis of GBDT, so that the XGboost is stronger and is suitable for a larger range. The XGboost is used for training the multiple linear regression model, so that the method can obtain a better regression model more quickly.

And S4, calculating the coordinate center of each emotion type according to the value and the Arousal value of the short video and the emotion type classification.

And screening all the short videos under the category for each emotion category of the short video data, and respectively calculating a value average value of the emotion effectiveness value and an Arousal average value of the emotion excitation value of the emotion category video according to the value and the Arousal value of the short video to serve as the central point coordinate of the emotion category in the V-A space.

And S5, calculating the value and the Arousal value of the target short video by using the pre-trained value model and the Arousal model.

The operation process of calculating the value of the short video to be recognized by using the value model is as follows:

s5-1, dividing the short video to be recognized into a plurality of shot segments;

s5-2, extracting the picture characteristic information of each shot;

s5-3, calculating the Valence value of the lens according to the feature information of the lens by using a pre-trained Valence model;

and S5-4, calculating the value of the short video in a weighted summation mode by taking the length ratio of each shot as a weight.

The operation process of calculating the Arousal value of the short video to be recognized by using the Arousal model is as follows:

s5-1', extracting picture characteristic information, lens characteristic information, dynamic characteristic information of the short video and supplementary dynamic characteristic information calculated by an optical flow method and a visual excitement algorithm;

and S5-2', calculating the Arousal value of the short video according to the feature information of the short video by using the pre-trained Arousal model.

And S6, calculating the matching degree of each emotion category by using the emotion category center point according to the calculated value and Arousal value.

After knowing the value and the value of Arousal of the short video to be recognized, the operation of calculating the matching of each emotion category is as follows:

s6-1, calculating the distance between the short video to be recognized and the center point of each emotion category in the V-A space, wherein the Euclidean distance, namely the linear distance on a two-dimensional plane, is used in the distance calculation method;

s6-2, calculating the reciprocal of each distance result to be used as the similarity between the short video to be recognized and each emotion category;

s6-3, the percentage calculation mode of the matching degree between the short video to be recognized and each emotion category is as follows: the degree of closeness is divided by the sum of the degrees of closeness of all categories.

The percentage value of the matching degree is the probability of the short video under each emotion category, and the emotion category with the maximum probability is the emotion category of the short video.

The embodiment further provides a device for identifying short video emotion categories, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and is characterized in that a set including coordinate centers of each emotion category and an emotion valence model and an emotion excitation model constructed based on a multiple linear regression model are stored in the memory, and the methods for constructing the emotion valence model, the emotion excitation model, and the emotion excitation model including the emotion category coordinate centers are the same as those in the method for identifying short video emotion categories, and are not described herein again.

The processor, when executing the computer program, implements the steps of:

By the identification method and the identification device, the short video emotion category can be accurately identified, and the requirement of effective prediction of video expression emotion in the short video production process is met.

The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims

1. An apparatus for identifying short video emotion classes, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the memory stores a set of coordinate centers for emotion classes, wherein the set of coordinate centers includes an emotion valence model and an emotion excitation model constructed based on a multiple linear regression model, and the processor implements the following steps when executing the computer program:

2. The apparatus for identifying emotion category in short video according to claim 1, wherein the process of dividing the target short video is:

3. The apparatus for identifying short video emotion category as recited in claim 1, wherein the extraction process of the frame picture features of the shot is as follows:

4. The apparatus for identifying short video emotion category according to claim 1, wherein the extraction process of shot features of the target short video is as follows:

5. The apparatus for identifying emotion category in short video according to claim 1, wherein the extraction process of dynamic features of target short video is as follows:

6. The apparatus for identifying short video emotion categories as recited in claim 1, wherein the construction process of the coordinate center of each emotion category is as follows:

7. The apparatus for identifying emotion classification of short video in claim 1 or 6, wherein said determining emotion classification of the target short video according to Euclidean distance comprises:

8. The apparatus for identifying emotion category in short video according to claim 6, wherein the emotion category, emotion valence value and emotion excitation value of short video are acquired by:

9. The apparatus for identifying short video emotion classification as claimed in claim 1, wherein the emotion valence model is constructed by:

10. The apparatus for identifying short video emotion categories as recited in claim 1, wherein said emotion activation model is constructed by: