CN111860237B

CN111860237B - Video emotion fragment identification method and device

Info

Publication number: CN111860237B
Application number: CN202010645824.XA
Authority: CN
Inventors: 陈恩红; 徐童; 曹卫; 张琨; 吕广弈; 何明; 武晗
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2022-09-06
Anticipated expiration: 2040-07-07
Also published as: CN111860237A

Abstract

The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and recognizing the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the recognition period of the emotion fragments is shortened, and the problem of long recognition period of the emotion fragments caused by long marking time of the emotion labels manually identified is solved.

Description

Video emotion fragment identification method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying video emotion fragments.

Background

With the development of multimedia technology, the data volume of multimedia videos is increased explosively, a large number of users are attracted, people tend to watch videos to relieve pressure and boredom, the watching of videos becomes a new mode meeting the emotional requirements of people, contradictions exist between the huge scale of videos and the limited time of users, and audiences only want to watch partial emotional segments of videos instead of the whole videos. Therefore, the emotion labels (five types of emotions: happiness, surprise, dislike, sadness and fear) synchronized with time are needed for the video, the emotion fragments in the video are identified, and the personalized emotion requirements of audiences are better met.

The first challenge of the work is that the video lacks a time sequence emotion tag, at present, the emotion tag is mainly marked on each frame in the video manually, emotion fragments are identified based on the marked emotion tag, and the emotion fragment identification period is long due to the fact that the manual marking emotion tag is long in marking time.

Disclosure of Invention

In view of this, the present invention provides a method and an apparatus for identifying video emotion fragments, so as to solve the problem in the prior art that, currently, emotion tags are mainly manually tagged to each frame in a video, and emotion fragments are identified based on the tagged emotion tags, and because the manual tagging emotion tags are long in tagging time, the emotion fragment identification period is long, and the specific scheme is as follows:

a video emotion fragment identification method comprises the following steps:

determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;

segmenting the video to be analyzed to obtain each video segment to be analyzed;

calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;

and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.

Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen in the video to be analyzed includes:

acquiring each bullet screen of the video to be analyzed;

screening the bullet screens to obtain all emotion bullet screens;

and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.

Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen according to the preset neural network model includes:

determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing a fine-granularity semantic representation and a coarse-granularity semantic representation of the corresponding emotion bullet screen;

determining visual vector representation of scene image data at each emotion bullet screen generation moment;

and transmitting the target semantic representation and the visual vector representation to the preset neural network model to obtain a bullet screen emotion label corresponding to the emotion bullet screen.

Optionally, in the method, the segmenting the video to be analyzed to obtain each video segment to be analyzed includes:

determining the visual semantics of each frame in the video to be analyzed;

sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;

and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.

The above method, optionally, further includes:

acquiring the bullet screen semanteme of the adjacent frames;

and correcting the segmentation points according to the bullet screen semantics.

Optionally, in the method, identifying the emotion fragment in each video fragment to be analyzed according to the fragment emotion vector and the emotion entropy includes:

judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;

if yes, judging that the current video segment to be analyzed contains an emotion, or;

if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold value or not;

if so, judging that the current video segment to be analyzed contains one emotion, or if not, judging that the current video to be analyzed contains two emotions.

7. An apparatus for identifying emotion fragments in a video, comprising:

the determining module is used for determining the barrage emotion tags of all emotion barrages in the video to be analyzed;

the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;

the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;

and the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.

The above apparatus, optionally, the determining module includes:

the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;

the screening unit is used for screening the bullet screens to obtain all emotion bullet screens;

and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.

The above apparatus, optionally, the dividing module includes:

the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;

the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;

and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.

The above apparatus, optionally, the identification module includes:

the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;

the first judging unit is used for judging whether the current video clip to be analyzed contains an emotion or not if the current video clip to be analyzed contains the emotion;

the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;

and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.

Compared with the prior art, the invention has the following advantages:

the invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and recognizing the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the recognition period of the emotion fragments is shortened, and the problem of long recognition period of the emotion fragments caused by long marking time of the emotion labels manually identified is solved.

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flowchart of a method for identifying video emotion fragments disclosed in an embodiment of the present application;

fig. 2 is a block diagram of a structure of an apparatus for recognizing video emotion fragments according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The invention discloses a method and a device for identifying video emotion fragments, which are applied to the process of identifying the emotion fragments in a video, wherein the emotion fragments comprise: according to the method, an emotion label is manually marked, emotion fragments in a video are identified based on the emotion label, but the marking time of the manually marked emotion label is long, so that the identification period is long, the embodiment of the invention provides a video emotion fragment identification method for solving the problems, a plurality of video sharing platforms widely have time synchronization comments named as 'barrage', the time synchronization comments are the instant feelings of audience viewing, contain rich emotion expressions, are consistent with video emotion development and can be used for video emotion analysis, therefore, the identification method identifies the video to be analyzed based on the barrage, the execution flow of the identification method is shown in figure 1, and the method comprises the following steps:

s101, determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed;

in the embodiment of the invention, each bullet screen of the video to be analyzed is obtained, as the bullet screen is the instant feeling of a video user, not all audiences participate in the interaction of the bullet screen of the whole video, and card punching bullet screens or sign-in bullet screens irrelevant to video semantics and video emotions often appear, the themes of each bullet screen are loose, the semantic noise is more, and preferably, denoising processing is carried out firstly.

Further, an emotion label of each emotion barrage is determined according to a preset neural network model, wherein the preset neural network model is a barrage emotion model, the barrage emotion model needs to be trained in advance, and the training process of the barrage emotion model is as follows:

firstly, a training data set is constructed, and a bullet screen emotional data set C with emotional labels is constructed from a bullet screen data set C without emotional labels _e Based on C _e Training the barrage emotion modelAnd (4) molding. Bullet screen emotion data set C considering high cost of manual annotation _e The emotion labels are obtained through a two-stage word matching method, and the basic idea is established on the basis of the fact that the emotion expression of the bullet screen is very common. The bullet screen contains rich emotion expressions, and the bullet screen with the explicit emotion expressions can perform automatic emotion recognition through a two-stage emotion dictionary matching method. The first stage is that the emotion polarity recognition is carried out on all video barrages through a comprehensive emotion polarity dictionary integrating a general emotion dictionary and a barrage emotion dictionary, and the barrage which contains explicit emotion expression and can recognize the positive and negative emotion polarities is selected; and in the second stage, fine-grained emotion recognition (five types of emotions: happiness, surprise, dislike, sadness and fear) is carried out on the bullet screen with positive and negative emotion polarities obtained in the first stage by using a fine-grained emotion dictionary, and an emotion bullet screen containing an emotion label is finally obtained by an emotion dictionary matching method in the two stages. Bullet screen data set C and emotion bullet screen data set C _e Is expressed as follows:

C＝{(C ¹ ,T ¹ ,I ¹ ),K(C ⁱ ,T ⁱ ,I ⁱ ),K(C ^N ,T ^N ,I ^N )} (1)

wherein any element (C) in the barrage data set C ⁱ ,T ⁱ ,I ⁱ ) Respectively represents T ⁱ Barrage C corresponding to moment ⁱ And scene image data I of video key frame ⁱ . Emotion barrage data set C _e Any element of (1)

Respectively represent

Bullet screen that moment corresponds

And scene image data of video key frame

And barrage

Corresponding five-classification sentiment tag

N and M represent the number of sentences of the bullet screen text and the number of sentences of the emotion bullet screen text, respectively.

In the embodiment of the invention, based on the emotion bullet screen data set C _e Training the barrage emotion model, wherein the input of the barrage emotion model is an emotion barrage data set C _e Of which any one element is

Namely:

bullet screen text corresponding to moment

And emotion tag text

And visual data information of the video key frame at that time

The characterization process of the input data is as follows: obtaining bullet screen text by using pre-training language model Bert

Sentence vector characterization of

Sum word vector characterization

Barrage emotion label obtained by using pre-training language model Bert

Sentence vector characterization of

Processing visual image information of video key frame by using existing deep network model VGG

Extracting the last convolution layer of the VGG model as the result

Vector characterization of

The correlation formula is as follows:

in view of the fact that the text semantics of the bullet screen are related to the semantics of the video scene at the corresponding moment, in the embodiment of the invention, the scene visual information of the bullet screen is used

Word vector fused into bullet screen text in attention mode

In (1), help model focuses on related to barrage visual sceneWord to obtain bullet screen word vector with visual attention

The relevant formula for the attention mechanism is as follows.

α＝soft max(W ₃ M) (7)

Wherein, W ₁ 、W ₂ And W ₃ Which may be set empirically or on a case-by-case basis, for the training parameters of the attention unit, tanh represents the activation function of the deep neural network, M is an intermediate quantity,

expressing normalization operation, alpha expressing each word of visual information in the bullet screen text

The visual attention weight alpha acts on the word vector of the bullet screen text

Obtaining bullet screen word vectors with visual attention

Considering the sequence information of the words contained in the sentence text, the invention utilizes the recurrent neural network BilSTM and the self-attention mechanism to carry out the bullet screen word vector fused with the visual information

Modeling to obtain fine-grained semantic representation of the bullet screen

As shown in the formula (3), the bullet screen sentence vector representation is obtained by the Bert model

The method is the coarse-grained semantic representation of the whole sentence of the bullet screen and represents the whole sentence of the bullet screen and the fine-grained sentence of the bullet screen semantically

Carrying out splicing operation with weight to obtain bullet target meaning representation

See the following formula.

Wherein, γ is a weight adjustment parameter, and the sign '+' is a splicing operation of tensor.

Subsequently, the target semantic representation

Training and outputting through full link FC, obtaining emotion probability P of the bullet screen:

wherein y represents the emotion type to which the bullet screen belongs,

representing through input

Calculating to obtain the bullet screen

The emotion category probability of (1). FC is a single-layer full-connection network structure, outputs through the full-connection layer, obtains the emotion probability P of each emotion barrage, trains the emotion model through minimizing the following objective function:

wherein, the first and the second end of the pipe are connected with each other,

barrage for emotion

The original emotion tag of (1),

barrage for emotion

After model training and output emotion probability, softmax _ cross _ entropy is a cross entropy loss function, and the original emotion label of each bullet screen is calculated

And emotion prediction results

Cross entropy loss of (c). In order to minimize the objective function, an Adam optimizer is adopted to iteratively update each parameter in the model (Tensorflow automatic derivation implementation), so that the emotion recognition model of the bullet screen is trained.

Finally, any bullet screen C in the bullet screen data set C is identified by the trained bullet screen emotion identification model ^k :(C ^k ,T ^k ,I ^k ) Performing emotion prediction and outputting P (y | C) ^k ,I ^k ) And further obtaining an emotion probability vector

Wherein, P (y | C) ^k ,I ^k ) As a bullet screen C ^k The result of the model(s) of (3) is predicted,

the method aims to calculate the proportion of each class in multiple classes and ensure that the sum of all the proportions is 1. The invention is provided with

Pair barrage C ^k Further processing the prediction result to obtain a bullet screen C ^k Of the emotion probability vector

It is a five-dimensional emotion vector, the value of which in each dimension can be regarded as bullet screen C ^k Sentiment semantic distribution in each dimension, measuring bullet screen C ^k The emotional semantic value in each emotional dimension also represents the bullet screen C ^k The emotion tag of (1).

S102, segmenting the video to be analyzed to obtain video segments to be analyzed;

in the embodiment of the invention, because the barrage comment is the instant response of the audience, the contained emotion is always instant. Therefore, a video emotion analysis based on one continuous period is most suitable. In fact, a video contains many relatively independent scene segments, and the content of these segments usually has relatively independence and topics, which will evolve with the development of video scenes, namely: the change of the video plot is generally consistent with the switching of the video scene, and the change of the video scene can be used as the segmentation basis of the video clip. Compared with the conventional equal-length video clip segmentation, the video clip segmentation is more suitable for the application from the viewpoint of scene switching.

Firstly, visual data information of each video key frame is identified by using an object identification method based on bottom-up and top-down attention

Carrying out object recognition to obtain visual words of each frame

Can be regarded as a frame

The visual semantics of which describe the frame

Of the visual scene. The text of the visual words of two adjacent frames is changed remarkably, which means that the described scene is changed, and the moment can be used as a video segmentation point.

Furthermore, in order to improve the segmentation accuracy, the invention also corrects the segmentation points from the perspective of video semantics. This operation is implemented by means of a bullet screen that can reflect the video semantics: the segmentation point is used as the plot transition point of the video, the barrage semantics at the moment are relatively loose, if the barrage semantics at the moment are centralized and consistent, the segmentation point is corrected, namely: for any video segment S obtained in the previous stage _i For all the bullet screens in the bullet screen, the similarity of the redundant strings is found pairwise to construct a segment S _i Semantic similarity matrix of (2)

Further obtain a video segment S _i Bullet screen average semantic similarity

Judging each video segment S _i Mean semantic similarity of bullet screens

Discarding video segments with very high average semantic similarity (determined by adopting an empirical threshold through practical experiments), and finally obtaining a video segment set to be analyzed { s } with relatively independent and natural plot _p }。

S103, calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;

in the embodiment of the invention, the video to be analyzed contains complex multi-mode content, the emotion is complex, and the emotion barrage of the video audience can be regarded as indirect reflection of the video emotion, so that the method is suitable for video emotion analysis. Set of video segments to be analyzed s _p Any fragment of s _i Fragment s _i The emotion bullet screen set is

Segment s _i The set of emotion vectors corresponding to the emotion bullet screen is

Each bullet screen

Emotion vector of

Is composed of

Summing the emotion vectors of all bullet screens of the segment according to each dimension to obtain a segment s _i Emotion and vector of

As shown in the following formula:

wherein u is a segment s _i Number of medium emotion barrages, and vector

I.e. fragment s _i Represents the segment s _i Emotion tags in each emotion dimension.

In the information theory: the entropy is the quantity for describing the disorder of the system, the greater the entropy is, the more disorder the system is, the less information is carried, and the smaller the entropy is, the more ordered the system is, and the more information is carried. In segment s _i Emotion vector of

In (2), the distribution concentration degree of the emotion semantic information of each emotion dimension can also be used as the segment s _i The entropy of the emotion analysis data is measured, and then the emotion fragment s to be analyzed is judged _i Emotional tendency of (1), emotional segment to be analyzed s _i Also referred to as segment s in the present invention _i The segment s can be obtained according to an entropy formula in the entropy _i The emotional entropy of (a) is shown as follows:

and S104, identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.

In the embodiment of the invention, because the video emotion is complex, the emotional tendency of the emotion video segments is not all in a single emotion category, and the video segments with complex emotion are common. The present application is directed to finding video segments that contain no more than two distinct emotional tendencies: one is a video emotion fragment with only one obvious emotional tendency; the other is a video emotion fragment with two obvious emotional tendencies.

Wherein the processing procedure for the video emotion fragments with single emotional tendency is as followsThe following: when video segment s is to be analyzed _i Emotional entropy of

Very small, less than its emotional entropy threshold H (e) _threshold When the fragment s _i The emotion semantics of all dimensions of the emotion bullet screen tend to be consistent, which means that the video segment s to be analyzed _i Only one significant emotional tendency is involved.

On the basis of the above, when the segment s _i Emotional entropy of

Only slightly above threshold H (e) _threshold When the fragment s _i The emotional tendency (c) is not necessarily only one, and further judgment is needed: in segment s _i Emotion vector of

In, when

Maximum component of

Far greater than

Second largest component of

Then the segment s _i There is only one emotional tendency of, i.e.

Maximum component of

Of dimension (d) ofThe emotion category belongs to the following formula.

The processing process of the emotion fragments of the video to be analyzed containing two emotion tendencies comprises the following steps: as shown in the formula (18), when the video emotion fragment s is to be analyzed _i Emotional entropy of

Only slightly above threshold H (e) _threshold Time to analyze video emotion fragment s _i The emotional tendency of (a) is not necessarily only one: when the video emotion fragment s to be analyzed _i Emotion vector of

In the step (1), the first step,

maximum component of

And

second largest component of

When the difference is small, the belonging emotion categories of the dimensionalities of the two components can be regarded as video emotion fragments s to be analyzed _i I.e. the video emotion fragments s to be analyzed _i There are two main emotional tendencies.

Considering that the theme of the bullet screen is loose and the semantic noise is more, the invention also carries out noise reduction from the semantic angle and utilizes each video segment s to be analyzed _i The text of the emotion bullet screenVector, for segment s _i Feeling semantic similarity matrix

Is a symmetric matrix, each value of which represents the pairwise semantic relevance of each emotion bullet screen, and

and analyzing the upper triangular part, and if the semantic similarity of the two emotion bullet screens is lower than the semantic similarity threshold of the bullet screens in the segments (the semantic similarity threshold is determined according to actual experiments), determining the two emotion bullet screens as semantic distortion points and deleting the corresponding emotion bullet screens. By this operation, the embodiment of the present invention can have better robustness.

The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.

Based on the identification method, aiming at the barrage with rich emotion in the video, the attention mechanism and the multi-mode fusion thought are applied, and the barrage emotion semantics and the visual information of the video scene are fused in time sequence, so that the enhancement representation of the barrage emotion semantics is realized, and the representation integrated with the text and the visual information is utilized to judge the emotion of the video segmentation segment based on the visual semantics and the scene switching, accurately identify the emotion segment in the video, and make up the defect that the video emotion segment is not identified in the existing video emotion understanding.

Based on the foregoing identification method for video emotion fragments, an embodiment of the present invention further provides an identification apparatus for video emotion fragments, where a structural block diagram of the identification apparatus is shown in fig. 2, and the identification apparatus includes:

a determination module 201, a segmentation module 202, a calculation module 203 and a recognition module 204.

the determining module 201 is configured to determine a bullet screen emotion tag of each emotion bullet screen in a video to be analyzed;

the segmentation module 202 is configured to segment the video to be analyzed to obtain each video segment to be analyzed;

the calculating module 203 is configured to calculate segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion tags in each video segment to be analyzed;

the identifying module 204 is configured to identify an emotion fragment in each to-be-analyzed video fragment according to the fragment emotion vector and the emotion entropy.

The invention discloses a video emotion fragment recognition device, which comprises: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification device, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.

In this embodiment of the present invention, the determining module 201 includes:

an acquisition unit 205, a screening unit 206 and a tag determination unit 207.

Wherein the content of the first and second substances,

the obtaining unit 205 is configured to obtain each barrage of the video to be analyzed;

the screening unit 206 is configured to screen the bullet screens to obtain emotion bullet screens;

the label determining unit 207 is configured to determine a bullet screen emotion label of each emotion bullet screen according to a preset neural network model.

In this embodiment of the present invention, the dividing module 202 includes:

a semantic determination unit 208, a first judgment unit 209, and a slicing unit 210.

Wherein the content of the first and second substances,

the semantic determining unit 208 is configured to determine visual semantics of each frame in the video to be analyzed;

the first judging unit 209 is configured to sequentially compare the visual semantics of the adjacent frames, and judge whether a difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold;

and the segmentation unit 210 is configured to segment the adjacent frames as segmentation points if the adjacent frames are the segmentation points, so as to obtain each video segment to be analyzed.

In an embodiment of the present invention, the identifying module 204 includes:

a second determination unit 211, a first determination unit 212, a third determination unit 213, and a second determination unit 214.

Wherein the content of the first and second substances,

the second determining unit 211 is configured to determine whether the emotion entropy is smaller than a preset emotion entropy threshold;

the first determining unit 212 is configured to determine that the current video segment to be analyzed includes an emotion, or;

the third determining unit 213 is configured to determine, if the ratio of the maximum component to the second largest component in the current segment emotion vector is greater than a preset ratio threshold;

the second determining unit 214 is configured to determine that the current video segment to be analyzed includes one emotion if the current video segment to be analyzed includes one emotion, or determine that the current video segment to be analyzed includes two emotions if the current video segment to be analyzed does not include one emotion.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in one or more of software and/or hardware in implementing the invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The method and the device for identifying the video emotion fragments provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for identifying video emotion fragments is characterized by comprising the following steps:

identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy;

identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy, wherein the identification comprises the following steps:

if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportion threshold value or not;

if yes, judging that the current video segment to be analyzed contains one emotion, or otherwise, judging that the current video segment to be analyzed contains two emotions.

2. The method of claim 1, wherein determining the bullet screen emotion label for each emotion bullet screen in the video to be analyzed comprises:

acquiring each bullet screen of the video to be analyzed;

screening the bullet screens to obtain all emotion bullet screens;

3. The method of claim 2, wherein determining the bullet screen sentiment label of each sentiment bullet screen according to a preset neural network model comprises:

determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing the fine-granularity semantic representation and the coarse-granularity semantic representation of the corresponding emotion bullet screen;

and transmitting the target semantic representation and the visual vector representation to the preset barrage emotion recognition neural network model to obtain an emotion label corresponding to the emotion barrage.

4. The method of claim 1, wherein segmenting the video to be analyzed to obtain each video segment to be analyzed comprises:

determining the visual semantics of each frame in the video to be analyzed;

5. The method of claim 4, further comprising:

acquiring the bullet screen semanteme of the adjacent frames;

6. An apparatus for identifying emotion fragments in a video, comprising:

the determining module is used for determining the bullet screen emotion labels of all emotion bullet screens in the video to be analyzed;

the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy;

wherein the identification module comprises:

the first judging unit is used for judging whether the current video segment to be analyzed contains an emotion or not if the current video segment to be analyzed contains the emotion;

7. The apparatus of claim 6, wherein the determining module comprises:

the screening unit is used for screening each bullet screen to obtain each emotion bullet screen;

8. The apparatus of claim 6, wherein the segmentation module comprises: