CN111860237A

CN111860237A - Video emotion fragment identification method and device

Info

Publication number: CN111860237A
Application number: CN202010645824.XA
Authority: CN
Inventors: 陈恩红; 徐童; 曹卫; 张琨; 吕广弈; 何明; 武晗
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2020-07-07
Filing date: 2020-07-07
Publication date: 2020-10-30
Anticipated expiration: 2040-07-07
Also published as: CN111860237B

Abstract

The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.

Description

Video emotion fragment identification method and device

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying video emotion fragments.

Background

With the development of multimedia technology, the data volume of multimedia videos is increased explosively, a large number of users are attracted, people tend to watch videos to relieve pressure and boredom, the watching of videos becomes a new mode for meeting the emotional requirements of people, contradiction exists between the huge scale of videos and the limited time of users, and audiences only want to watch partial emotional segments of videos instead of the whole videos. Therefore, the emotion labels (five types of emotions: happiness, surprise, dislike, sadness and fear) synchronized with time are needed for the video, the emotion fragments in the video are identified, and the personalized emotion requirements of audiences are better met.

The first challenge of the work is that the video lacks a time sequence emotion tag, at present, the emotion tag is mainly marked on each frame in the video manually, emotion fragments are identified based on the marked emotion tag, and the emotion fragment identification period is long due to the fact that the manual marking emotion tag is long in marking time.

Disclosure of Invention

In view of the above, the present invention provides a method and an apparatus for identifying video emotion fragments, so as to solve the problem in the prior art that, currently, emotion tags are mainly manually tagged to each frame in a video, and emotion fragments are identified based on the tagged emotion tags, and because the manual tagging of emotion tags is long in tagging time, the emotion fragment identification period is long, and the specific scheme is as follows:

A video emotion fragment identification method comprises the following steps:

determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;

segmenting the video to be analyzed to obtain each video segment to be analyzed;

calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;

and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.

Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen in the video to be analyzed includes:

acquiring each bullet screen of the video to be analyzed;

screening the bullet screens to obtain all emotion bullet screens;

and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.

Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen according to the preset neural network model includes:

determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing a fine-granularity semantic representation and a coarse-granularity semantic representation of the corresponding emotion bullet screen;

Determining visual vector representation of scene image data at each emotion bullet screen generation moment;

and transmitting the target semantic representation and the visual vector representation to the preset neural network model to obtain a bullet screen emotion label corresponding to the emotion bullet screen.

Optionally, in the method, the segmenting the video to be analyzed to obtain each video segment to be analyzed includes:

determining the visual semantics of each frame in the video to be analyzed;

sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;

and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.

The above method, optionally, further includes:

acquiring the bullet screen semantics of the adjacent frames;

and correcting the segmentation points according to the bullet screen semantics.

Optionally, in the method, identifying the emotion fragment in each video fragment to be analyzed according to the fragment emotion vector and the emotion entropy includes:

judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;

if yes, judging that the current video segment to be analyzed contains an emotion, or;

If not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold value or not;

if so, judging that the current video segment to be analyzed contains one emotion, or if not, judging that the current video to be analyzed contains two emotions.

7. An apparatus for identifying emotion fragments in a video, comprising:

the determining module is used for determining the bullet screen emotion labels of all emotion bullet screens in the video to be analyzed;

the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;

the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;

and the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.

The above apparatus, optionally, the determining module includes:

the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;

the screening unit is used for screening the bullet screens to obtain all emotion bullet screens;

and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.

The above apparatus, optionally, the dividing module includes:

the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;

the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;

and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.

The above apparatus, optionally, the identification module includes:

the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;

the first judging unit is used for judging whether the current video segment to be analyzed contains an emotion or not if the current video segment to be analyzed contains the emotion;

the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;

and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.

Compared with the prior art, the invention has the following advantages:

Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart of a method for identifying video emotion fragments disclosed in an embodiment of the present application;

fig. 2 is a block diagram of a structure of an apparatus for recognizing video emotion fragments according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The invention discloses a method and a device for identifying video emotion fragments, which are applied to the process of identifying the emotion fragments in a video, wherein the emotion fragments comprise: according to the method, an emotion label is manually marked, emotion fragments in a video are identified based on the emotion label, but the marking time of the manually marked emotion label is long, so that the identification period is long, the embodiment of the invention provides a video emotion fragment identification method for solving the problems, a plurality of video sharing platforms widely have time synchronization comments named as 'barrage', the time synchronization comments are the instant feelings of audience viewing, contain rich emotion expressions, are consistent with video emotion development and can be used for video emotion analysis, therefore, the identification method identifies the video to be analyzed based on the barrage, the execution flow of the identification method is shown in figure 1, and the method comprises the following steps:

s101, determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed;

in the embodiment of the invention, each bullet screen of the video to be analyzed is obtained, as the bullet screen is the instant feeling of a video user, not all audiences participate in the interaction of the bullet screen of the whole video, and card punching bullet screens or sign-in bullet screens irrelevant to video semantics and video emotions often appear, the themes of each bullet screen are loose, the semantic noise is more, and preferably, denoising processing is carried out firstly.

Further, an emotion label of each emotion barrage is determined according to a preset neural network model, wherein the preset neural network model is a barrage emotion model, the barrage emotion model needs to be trained in advance, and the training process of the barrage emotion model is as follows:

first, a training data set is constructed, never containingConstructing a bullet screen emotion data set C with emotion labels in a bullet screen data set C with emotion labels_eBased on C_eAnd training the barrage emotion model. Bullet screen emotion data set C considering high cost of manual annotation_eThe emotion labels are obtained through a two-stage word matching method, and the basic idea is established on the basis of the fact that the emotion expression of the bullet screen is very common. The bullet screen contains rich emotion expression, and the bullet screen with the explicit emotion expression can perform automatic emotion recognition through a two-stage emotion dictionary matching method. The first stage is that the emotion polarity recognition is carried out on all video barrages through a comprehensive emotion polarity dictionary integrating a general emotion dictionary and a barrage emotion dictionary, and the barrage which contains explicit emotion expression and can recognize the positive and negative emotion polarities is selected; and in the second stage, fine-grained emotion recognition (five types of emotions: happiness, surprise, dislike, sadness and fear) is carried out on the bullet screen with positive and negative emotion polarities obtained in the first stage by using a fine-grained emotion dictionary, and an emotion bullet screen containing an emotion label is finally obtained by an emotion dictionary matching method in the two stages. Barrage data set C and emotion barrage data set C _eIs expressed as follows:

C＝{(C¹,T¹,I¹),K(Cⁱ,Tⁱ,Iⁱ),K(C^N,T^N,I^N)} (1)

wherein any element (C) in the barrage data set Cⁱ,Tⁱ,Iⁱ) Respectively represents TⁱBarrage C corresponding to momentⁱAnd scene image data I of video key frameⁱ. Emotion barrage data set C_eAny element of (1)

Respectively represent

Bullet screen that moment corresponds

And scene image data of video key frame

And barrage

Corresponding five-classification sentiment tag

N and M represent the number of sentences of the bullet screen text and the number of sentences of the emotion bullet screen text, respectively.

In the embodiment of the invention, based on the emotion bullet screen data set C_eTraining the barrage emotion model, wherein the input of the barrage emotion model is an emotion barrage data set C_eOf which any one element is

Namely:

bullet screen text corresponding to moment

And emotion tag text

And visual data information of the video key frame at that time

The characterization process of the input data is as follows: obtaining bullet screen text by using pre-training language model Bert

Sentence vector characterization of

Sum word vector characterization

Barrage emotion label obtained by using pre-training language model Bert

Sentence vector characterization of

Processing visual image information of video key frame by using existing deep network model VGG

Extracting the last convolution layer of the VGG model as the result

Vector characterization of

The correlation formula is as follows:

in view of the fact that the text semantics of the bullet screen are related to the semantics of the video scene at the corresponding moment, in the embodiment of the invention, the scene visual information of the bullet screen is used

Word vector fused into bullet screen text in attention mode

Middle and help model closingAnnotating words related to barrage visual scene to obtain barrage word vectors with visual attention

The relevant formula for the attention mechanism is as follows.

α＝soft max(W₃M) (7)

Wherein, W₁、W₂And W₃Which may be set empirically or on a case-by-case basis, for the training parameters of the attention unit, tanh represents the activation function of the deep neural network, M is an intermediate quantity,

expressing normalization operation, alpha expressing each word of visual information in the bullet screen text

The visual attention weight alpha acts on the word vector of the bullet screen text

Obtaining bullet screen word vector with visual attention

Considering the sequence information of the words contained in the sentence text, the invention utilizes the recurrent neural network BilSTM and the self-attention mechanism to carry out the bullet screen word vector fused with the visual information

Modeling to obtain fine-grained semantic representation of the bullet screen

As shown in the formula (3), the bullet screen sentence vector representation is obtained by the Bert model

That is, the coarse-grained semantic representation of the whole sentence of the bullet screen is expressed with the fine-grained sentence semantic representation of the bullet screen

Carrying out splicing operation with weight to obtain bullet target meaning representation

See the following formula.

Wherein, γ is a weight adjustment parameter, and the sign '+' is a splicing operation of tensor.

Subsequently, the target semantic representation

Training and outputting through a full connection layer FC, and obtaining the emotion probability P of the bullet screen:

wherein y represents the emotion type to which the bullet screen belongs,

representing through input

Calculating to obtain the bullet screen

The emotion category probability of (1). FC is a single-layer full-connection network structure, outputs through the full-connection layer, obtains the emotion probability P of each emotion barrage, trains the emotion model through minimizing the following objective function:

wherein,

barrage for emotion

The original emotion tag of (a) is set,

barrage for emotion

After model training and output emotion probability, softmax _ cross _ entropy is a cross entropy loss function, and the original emotion label of each bullet screen is calculated

And emotion prediction results

Cross entropy loss of (2). In order to minimize the objective function, an Adam optimizer is adopted to iteratively update each parameter in the model (Tensorflow automatic derivation implementation), so that the emotion recognition model of the bullet screen is trained.

Finally, any bullet screen C in the bullet screen data set C is identified by the trained bullet screen emotion identification model ^k:(C^k,T^k,I^k) Performing emotion prediction and outputting P (y | C)^k,I^k) And further obtaining an emotion probability vector

Wherein, P (y | C)^k,I^k) As a bullet screen C^kThe results of the model(s) of (2),

the method aims to calculate the proportion of each class in multiple classes and ensure that the sum of all the proportions is 1. The invention is provided with

Pair barrage C^kFurther processing the prediction result to obtain a bullet screen C^kEmotion probability vector of

It is a five-dimensional emotion vector whose value in each dimension can be regarded as bullet screen C^kEmotional semantic distribution in each dimension, measuring bullet screen C^kThe emotional semantic value in each emotional dimension also represents the bullet screen C^kThe emotion tag of (1).

S102, segmenting the video to be analyzed to obtain each video segment to be analyzed;

in the embodiment of the invention, because the barrage comment is the instant response of the audience, the contained emotion is always instant. Therefore, a video emotion analysis based on one continuous period of time is most suitable. In fact, a video contains many relatively independent scene segments, and the content of these segments usually has relatively independence and topics, which evolve with the development of video scenes, namely: the change of the video plot is generally consistent with the switching of the video scene, and the change of the video scene can be used as the segmentation basis of the video clip. Compared with the conventional equal-length video clip segmentation, the video clip segmentation is more suitable for the application from the perspective of scene switching.

Firstly, visual data information of each video key frame is identified by using an object identification method based on bottom-up and top-down attention

Carrying out object recognition to obtain visual words of each frame

Can be viewed as a frame

The visual semantics of which describe the frame

Of the visual scene. The text of the visual words of two adjacent frames is changed remarkably, which means that the described scene is changed, and the moment can be used as a video segmentation point.

Furthermore, in order to improve the segmentation accuracy, the invention also corrects segmentation points from the perspective of video semantics. This operation is implemented by means of a bullet screen that can reflect the video semantics: the segment segmentation point is used as a plot conversion point of the video, the bullet screen semantics at the moment are relatively loose, if the bullet screen semantics at the moment are concentrated and consistent, the segmentation point is corrected, namely: for any video segment S obtained in the previous stage_iFor all the bullet screens in the bullet screen, the similarity of the redundant strings is found pairwise to construct a segment S_iSemantic similarity matrix of

Further obtain a video segment S_iBullet screen average semantic similarity

Judging each video segment S _iBullet screen average semantic similarity

Discarding video segments with very high average semantic similarity (determined by adopting an empirical threshold through practical experiments), and finally obtaining a video segment set to be analyzed { s } with relatively independent and natural plot_p}。

S103, calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;

in the embodiment of the invention, the video to be analyzed contains complex multi-mode content and complex emotion, and the emotion barrage of the video audience can be regarded as indirect reflection of video emotion, so that the method is suitable for video emotion analysis. Set of video segments to be analyzed s_pAny fragment of s_iFragment s_iThe emotion bullet screen set is

Segment s_iThe set of emotion vectors corresponding to the emotion bullet screen is

Each barrage

Emotion vector of

Is composed of

Summing the emotion vectors of all bullet screens of the segment according to each dimension to obtain a segment s_iEmotion and vector of

As shown in the following formula:

wherein u is a fragment s_iMiddle energizerNumber of bullet-sensing screens, and vectors

I.e. fragment s_iRepresents the segment s_iEmotion tags in each emotion dimension.

In the information theory: the entropy is the quantity for describing the disorder of the system, the greater the entropy is, the more disorder the system is, the less information is carried, and the smaller the entropy is, the more ordered the system is, and the more information is carried. In segment s _iEmotion vector of

In (2), the distribution concentration degree of the emotion semantic information of each emotion dimension can also be used as the segment s_iThe entropy of the emotion is measured, and then the emotion fragment s to be analyzed is judged_iEmotional tendency of, emotional segment to be analyzed s_iEntropy of (D) is also referred to herein as segment s_iThe segment s can be obtained according to an entropy formula in the entropy_iThe emotional entropy of (a) is shown as follows:

and S104, identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.

In the embodiment of the invention, because the video emotion is complex, the emotional tendency of the emotion video segments is not all in a single emotion category, and the video segments with complex emotion are common. The present application aims to find video segments containing no more than two significant emotional tendencies: one is a video emotion fragment with only one obvious emotional tendency; the other is a video emotion fragment with two obvious emotional tendencies.

The processing process of the video emotion fragment with single emotional tendency is as follows: when video segment s is to be analyzed_iEmotional entropy of

Very small, smaller than itEmotion entropy threshold H (e)_thresholdWhen the fragment s_iThe emotion semantics of all dimensions of the emotion bullet screen tend to be consistent, which means that the video segment s to be analyzed _iOnly one significant emotional tendency is involved.

On the basis of the above, when the segment s_iEmotional entropy of

Only slightly above threshold H (e)_thresholdWhen the fragment s_iThe emotional tendency (c) is not necessarily only one, and further judgment is needed: in segment s_iEmotion vector of

In, when

Maximum component of

Far greater than

Second largest component of

Segment s_iThere is only one emotional tendency of, i.e.

Maximum component of

The formula of the emotion category of the dimension in which the emotion belongs is as follows.

The processing process for the emotion fragments of the video to be analyzed containing two emotion tendencies is as follows: as shown in the formula (18), when the video emotion fragment s is to be analyzed_iEmotional entropy of

Only slightly above threshold H (e)_thresholdTime to analyze video emotion fragment s_iThe emotional tendency of (a) is not necessarily only one: when the video emotion fragment s to be analyzed_iEmotion vector of

In (1),

maximum component of

And

second largest component of

When the difference is small, the emotion categories of the dimensionalities of the two components can be regarded as the video emotion fragments s to be analyzed_iI.e. the video emotion fragments s to be analyzed_iThere are two main emotional tendencies.

Considering that the theme of the bullet screen is loose and the semantic noise is more, the invention also carries out noise reduction from the semantic angle and utilizes each video segment s to be analyzed _iThe text vector of the emotion bullet screen of (1), for segment s_iFeeling semantic similarity matrix

Is a symmetric matrix, each value of which represents the pairwise semantic relevance of each emotion bullet screen, and

and analyzing the upper triangular part, and if the semantic similarity of the two emotion bullet screens is lower than the semantic similarity threshold of the bullet screens in the segments (the semantic similarity threshold is determined according to an actual experiment), determining the two emotion bullet screens as semantic distortion points, and deleting the corresponding emotion bullet screens. By this operation, the embodiment of the present invention can have better robustness.

Based on the identification method, aiming at the barrage with rich emotion in the video, the attention mechanism and the multi-mode fusion thought are applied, and the barrage emotion semantics and the visual information of the video scene are fused in time sequence, so that the enhancement representation of the barrage emotion semantics is realized, and the representation integrated with the text and the visual information is utilized to judge the emotion of the video segmentation segment based on the visual semantics and the scene switching, accurately identify the emotion segment in the video, and make up the defect that the video emotion segment is not identified in the existing video emotion understanding.

Based on the foregoing identification method for video emotion fragments, an embodiment of the present invention further provides an identification apparatus for video emotion fragments, where a structural block diagram of the identification apparatus is shown in fig. 2, and the identification apparatus includes:

a determination module 201, a segmentation module 202, a calculation module 203 and a recognition module 204.

Wherein,

the determining module 201 is configured to determine a bullet screen emotion tag of each emotion bullet screen in a video to be analyzed;

the segmentation module 202 is configured to segment the video to be analyzed to obtain each video segment to be analyzed;

the calculating module 203 is configured to calculate segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion tags in each video segment to be analyzed;

The identifying module 204 is configured to identify an emotion fragment in each to-be-analyzed video fragment according to the fragment emotion vector and the emotion entropy.

The invention discloses a video emotion fragment recognition device, which comprises: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification device, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.

In this embodiment of the present invention, the determining module 201 includes:

An acquisition unit 205, a screening unit 206 and a tag determination unit 207.

Wherein,

the obtaining unit 205 is configured to obtain each barrage of the video to be analyzed;

the screening unit 206 is configured to screen the bullet screens to obtain emotion bullet screens;

the label determining unit 207 is configured to determine a bullet screen emotion label of each emotion bullet screen according to a preset neural network model.

In this embodiment of the present invention, the dividing module 202 includes:

a semantic determination unit 208, a first judgment unit 209, and a slicing unit 210.

Wherein,

the semantic determining unit 208 is configured to determine visual semantics of each frame in the video to be analyzed;

the first judging unit 209 is configured to sequentially compare the visual semantics of the adjacent frames, and judge whether a difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold;

and the segmentation unit 210 is configured to segment the adjacent frames as segmentation points if the adjacent frames are the segmentation points, so as to obtain each video segment to be analyzed.

In this embodiment of the present invention, the identifying module 204 includes:

a second judging unit 211, a first judging unit 212, a third judging unit 213, and a second judging unit 214.

Wherein,

The second determining unit 211 is configured to determine whether the emotion entropy is smaller than a preset emotion entropy threshold;

the first determining unit 212 is configured to determine that the current video segment to be analyzed includes an emotion, or;

the third determining unit 213 is configured to determine whether a ratio of a maximum component to a next largest component in the current segment emotion vector is greater than a preset ratio threshold if the ratio is not greater than the preset ratio threshold;

the second determining unit 214 is configured to determine that the current video segment to be analyzed includes one emotion if the current video segment to be analyzed includes one emotion, or determine that the current video segment to be analyzed includes two emotions if the current video segment to be analyzed includes one emotion.

It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.

From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.

The method and the device for identifying the video emotion fragments provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method for identifying video emotion fragments is characterized by comprising the following steps:

2. The method of claim 1, wherein determining the bullet screen emotion label for each emotion bullet screen in the video to be analyzed comprises:

acquiring each bullet screen of the video to be analyzed;

screening the bullet screens to obtain all emotion bullet screens;

3. The method of claim 2, wherein determining the bullet screen emotion label for each emotion bullet screen according to a preset neural network model comprises:

and transmitting the target semantic representation and the visual vector representation to the preset barrage emotion recognition neural network model to obtain an emotion label corresponding to the emotion barrage.

4. The method of claim 1, wherein segmenting the video to be analyzed to obtain each video segment to be analyzed comprises:

determining the visual semantics of each frame in the video to be analyzed;

5. The method of claim 4, further comprising:

acquiring the bullet screen semantics of the adjacent frames;

6. The method of claim 1, wherein identifying the emotion fragments in the respective video fragments to be analyzed according to the fragment emotion vector and the emotion entropy comprises:

7. An apparatus for identifying emotion fragments in a video, comprising:

8. The apparatus of claim 7, wherein the determining module comprises:

9. The apparatus of claim 7, wherein the segmentation module comprises:

10. The apparatus of claim 7, wherein the identification module comprises: