CN111860237B - Video emotion fragment identification method and device - Google Patents

Video emotion fragment identification method and device Download PDF

Info

Publication number
CN111860237B
CN111860237B CN202010645824.XA CN202010645824A CN111860237B CN 111860237 B CN111860237 B CN 111860237B CN 202010645824 A CN202010645824 A CN 202010645824A CN 111860237 B CN111860237 B CN 111860237B
Authority
CN
China
Prior art keywords
emotion
video
analyzed
bullet screen
fragments
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010645824.XA
Other languages
Chinese (zh)
Other versions
CN111860237A (en
Inventor
陈恩红
徐童
曹卫
张琨
吕广弈
何明
武晗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202010645824.XA priority Critical patent/CN111860237B/en
Publication of CN111860237A publication Critical patent/CN111860237A/en
Application granted granted Critical
Publication of CN111860237B publication Critical patent/CN111860237B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Image Analysis (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)

Abstract

The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and recognizing the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the recognition period of the emotion fragments is shortened, and the problem of long recognition period of the emotion fragments caused by long marking time of the emotion labels manually identified is solved.

Description

Video emotion fragment identification method and device
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying video emotion fragments.
Background
With the development of multimedia technology, the data volume of multimedia videos is increased explosively, a large number of users are attracted, people tend to watch videos to relieve pressure and boredom, the watching of videos becomes a new mode meeting the emotional requirements of people, contradictions exist between the huge scale of videos and the limited time of users, and audiences only want to watch partial emotional segments of videos instead of the whole videos. Therefore, the emotion labels (five types of emotions: happiness, surprise, dislike, sadness and fear) synchronized with time are needed for the video, the emotion fragments in the video are identified, and the personalized emotion requirements of audiences are better met.
The first challenge of the work is that the video lacks a time sequence emotion tag, at present, the emotion tag is mainly marked on each frame in the video manually, emotion fragments are identified based on the marked emotion tag, and the emotion fragment identification period is long due to the fact that the manual marking emotion tag is long in marking time.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for identifying video emotion fragments, so as to solve the problem in the prior art that, currently, emotion tags are mainly manually tagged to each frame in a video, and emotion fragments are identified based on the tagged emotion tags, and because the manual tagging emotion tags are long in tagging time, the emotion fragment identification period is long, and the specific scheme is as follows:
a video emotion fragment identification method comprises the following steps:
determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;
segmenting the video to be analyzed to obtain each video segment to be analyzed;
calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen in the video to be analyzed includes:
acquiring each bullet screen of the video to be analyzed;
screening the bullet screens to obtain all emotion bullet screens;
and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen according to the preset neural network model includes:
determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing a fine-granularity semantic representation and a coarse-granularity semantic representation of the corresponding emotion bullet screen;
determining visual vector representation of scene image data at each emotion bullet screen generation moment;
and transmitting the target semantic representation and the visual vector representation to the preset neural network model to obtain a bullet screen emotion label corresponding to the emotion bullet screen.
Optionally, in the method, the segmenting the video to be analyzed to obtain each video segment to be analyzed includes:
determining the visual semantics of each frame in the video to be analyzed;
sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.
The above method, optionally, further includes:
acquiring the bullet screen semanteme of the adjacent frames;
and correcting the segmentation points according to the bullet screen semantics.
Optionally, in the method, identifying the emotion fragment in each video fragment to be analyzed according to the fragment emotion vector and the emotion entropy includes:
judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
if yes, judging that the current video segment to be analyzed contains an emotion, or;
if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold value or not;
if so, judging that the current video segment to be analyzed contains one emotion, or if not, judging that the current video to be analyzed contains two emotions.
7. An apparatus for identifying emotion fragments in a video, comprising:
the determining module is used for determining the barrage emotion tags of all emotion barrages in the video to be analyzed;
the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;
the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
The above apparatus, optionally, the determining module includes:
the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;
the screening unit is used for screening the bullet screens to obtain all emotion bullet screens;
and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.
The above apparatus, optionally, the dividing module includes:
the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;
the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.
The above apparatus, optionally, the identification module includes:
the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
the first judging unit is used for judging whether the current video clip to be analyzed contains an emotion or not if the current video clip to be analyzed contains the emotion;
the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;
and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.
Compared with the prior art, the invention has the following advantages:
the invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and recognizing the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the recognition period of the emotion fragments is shortened, and the problem of long recognition period of the emotion fragments caused by long marking time of the emotion labels manually identified is solved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a method for identifying video emotion fragments disclosed in an embodiment of the present application;
fig. 2 is a block diagram of a structure of an apparatus for recognizing video emotion fragments according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a method and a device for identifying video emotion fragments, which are applied to the process of identifying the emotion fragments in a video, wherein the emotion fragments comprise: according to the method, an emotion label is manually marked, emotion fragments in a video are identified based on the emotion label, but the marking time of the manually marked emotion label is long, so that the identification period is long, the embodiment of the invention provides a video emotion fragment identification method for solving the problems, a plurality of video sharing platforms widely have time synchronization comments named as 'barrage', the time synchronization comments are the instant feelings of audience viewing, contain rich emotion expressions, are consistent with video emotion development and can be used for video emotion analysis, therefore, the identification method identifies the video to be analyzed based on the barrage, the execution flow of the identification method is shown in figure 1, and the method comprises the following steps:
s101, determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed;
in the embodiment of the invention, each bullet screen of the video to be analyzed is obtained, as the bullet screen is the instant feeling of a video user, not all audiences participate in the interaction of the bullet screen of the whole video, and card punching bullet screens or sign-in bullet screens irrelevant to video semantics and video emotions often appear, the themes of each bullet screen are loose, the semantic noise is more, and preferably, denoising processing is carried out firstly.
Further, an emotion label of each emotion barrage is determined according to a preset neural network model, wherein the preset neural network model is a barrage emotion model, the barrage emotion model needs to be trained in advance, and the training process of the barrage emotion model is as follows:
firstly, a training data set is constructed, and a bullet screen emotional data set C with emotional labels is constructed from a bullet screen data set C without emotional labels e Based on C e Training the barrage emotion modelAnd (4) molding. Bullet screen emotion data set C considering high cost of manual annotation e The emotion labels are obtained through a two-stage word matching method, and the basic idea is established on the basis of the fact that the emotion expression of the bullet screen is very common. The bullet screen contains rich emotion expressions, and the bullet screen with the explicit emotion expressions can perform automatic emotion recognition through a two-stage emotion dictionary matching method. The first stage is that the emotion polarity recognition is carried out on all video barrages through a comprehensive emotion polarity dictionary integrating a general emotion dictionary and a barrage emotion dictionary, and the barrage which contains explicit emotion expression and can recognize the positive and negative emotion polarities is selected; and in the second stage, fine-grained emotion recognition (five types of emotions: happiness, surprise, dislike, sadness and fear) is carried out on the bullet screen with positive and negative emotion polarities obtained in the first stage by using a fine-grained emotion dictionary, and an emotion bullet screen containing an emotion label is finally obtained by an emotion dictionary matching method in the two stages. Bullet screen data set C and emotion bullet screen data set C e Is expressed as follows:
C={(C 1 ,T 1 ,I 1 ),K(C i ,T i ,I i ),K(C N ,T N ,I N )} (1)
Figure BDA0002572985120000061
wherein any element (C) in the barrage data set C i ,T i ,I i ) Respectively represents T i Barrage C corresponding to moment i And scene image data I of video key frame i . Emotion barrage data set C e Any element of (1)
Figure BDA0002572985120000062
Respectively represent
Figure BDA0002572985120000063
Bullet screen that moment corresponds
Figure BDA0002572985120000064
And scene image data of video key frame
Figure BDA0002572985120000065
And barrage
Figure BDA0002572985120000066
Corresponding five-classification sentiment tag
Figure BDA0002572985120000067
N and M represent the number of sentences of the bullet screen text and the number of sentences of the emotion bullet screen text, respectively.
In the embodiment of the invention, based on the emotion bullet screen data set C e Training the barrage emotion model, wherein the input of the barrage emotion model is an emotion barrage data set C e Of which any one element is
Figure BDA0002572985120000068
Namely:
Figure BDA0002572985120000069
bullet screen text corresponding to moment
Figure BDA00025729851200000610
And emotion tag text
Figure BDA00025729851200000611
And visual data information of the video key frame at that time
Figure BDA00025729851200000612
The characterization process of the input data is as follows: obtaining bullet screen text by using pre-training language model Bert
Figure BDA00025729851200000613
Sentence vector characterization of
Figure BDA00025729851200000614
Sum word vector characterization
Figure BDA00025729851200000615
Barrage emotion label obtained by using pre-training language model Bert
Figure BDA00025729851200000616
Sentence vector characterization of
Figure BDA00025729851200000617
Processing visual image information of video key frame by using existing deep network model VGG
Figure BDA00025729851200000618
Extracting the last convolution layer of the VGG model as the result
Figure BDA00025729851200000619
Vector characterization of
Figure BDA00025729851200000620
The correlation formula is as follows:
Figure BDA00025729851200000621
Figure BDA0002572985120000071
Figure BDA0002572985120000072
in view of the fact that the text semantics of the bullet screen are related to the semantics of the video scene at the corresponding moment, in the embodiment of the invention, the scene visual information of the bullet screen is used
Figure BDA0002572985120000073
Word vector fused into bullet screen text in attention mode
Figure BDA0002572985120000074
In (1), help model focuses on related to barrage visual sceneWord to obtain bullet screen word vector with visual attention
Figure BDA0002572985120000075
The relevant formula for the attention mechanism is as follows.
Figure BDA0002572985120000076
α=soft max(W 3 M) (7)
Figure BDA0002572985120000077
Wherein, W 1 、W 2 And W 3 Which may be set empirically or on a case-by-case basis, for the training parameters of the attention unit, tanh represents the activation function of the deep neural network, M is an intermediate quantity,
Figure BDA00025729851200000719
expressing normalization operation, alpha expressing each word of visual information in the bullet screen text
Figure BDA0002572985120000078
The visual attention weight alpha acts on the word vector of the bullet screen text
Figure BDA0002572985120000079
Obtaining bullet screen word vectors with visual attention
Figure BDA00025729851200000710
Considering the sequence information of the words contained in the sentence text, the invention utilizes the recurrent neural network BilSTM and the self-attention mechanism to carry out the bullet screen word vector fused with the visual information
Figure BDA00025729851200000711
Modeling to obtain fine-grained semantic representation of the bullet screen
Figure BDA00025729851200000712
Figure BDA00025729851200000713
As shown in the formula (3), the bullet screen sentence vector representation is obtained by the Bert model
Figure BDA00025729851200000714
The method is the coarse-grained semantic representation of the whole sentence of the bullet screen and represents the whole sentence of the bullet screen and the fine-grained sentence of the bullet screen semantically
Figure BDA00025729851200000715
Carrying out splicing operation with weight to obtain bullet target meaning representation
Figure BDA00025729851200000716
See the following formula.
Figure BDA00025729851200000717
Wherein, γ is a weight adjustment parameter, and the sign '+' is a splicing operation of tensor.
Subsequently, the target semantic representation
Figure BDA00025729851200000718
Training and outputting through full link FC, obtaining emotion probability P of the bullet screen:
Figure BDA0002572985120000081
wherein y represents the emotion type to which the bullet screen belongs,
Figure BDA00025729851200000815
representing through input
Figure BDA0002572985120000082
Calculating to obtain the bullet screen
Figure BDA0002572985120000083
The emotion category probability of (1). FC is a single-layer full-connection network structure, outputs through the full-connection layer, obtains the emotion probability P of each emotion barrage, trains the emotion model through minimizing the following objective function:
Figure BDA0002572985120000084
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0002572985120000085
barrage for emotion
Figure BDA0002572985120000086
The original emotion tag of (1),
Figure BDA0002572985120000087
barrage for emotion
Figure BDA0002572985120000088
After model training and output emotion probability, softmax _ cross _ entropy is a cross entropy loss function, and the original emotion label of each bullet screen is calculated
Figure BDA0002572985120000089
And emotion prediction results
Figure BDA00025729851200000810
Cross entropy loss of (c). In order to minimize the objective function, an Adam optimizer is adopted to iteratively update each parameter in the model (Tensorflow automatic derivation implementation), so that the emotion recognition model of the bullet screen is trained.
Finally, any bullet screen C in the bullet screen data set C is identified by the trained bullet screen emotion identification model k :(C k ,T k ,I k ) Performing emotion prediction and outputting P (y | C) k ,I k ) And further obtaining an emotion probability vector
Figure BDA00025729851200000811
Figure BDA00025729851200000812
Wherein, P (y | C) k ,I k ) As a bullet screen C k The result of the model(s) of (3) is predicted,
Figure BDA00025729851200000816
the method aims to calculate the proportion of each class in multiple classes and ensure that the sum of all the proportions is 1. The invention is provided with
Figure BDA00025729851200000817
Pair barrage C k Further processing the prediction result to obtain a bullet screen C k Of the emotion probability vector
Figure BDA00025729851200000813
Figure BDA00025729851200000814
It is a five-dimensional emotion vector, the value of which in each dimension can be regarded as bullet screen C k Sentiment semantic distribution in each dimension, measuring bullet screen C k The emotional semantic value in each emotional dimension also represents the bullet screen C k The emotion tag of (1).
S102, segmenting the video to be analyzed to obtain video segments to be analyzed;
in the embodiment of the invention, because the barrage comment is the instant response of the audience, the contained emotion is always instant. Therefore, a video emotion analysis based on one continuous period is most suitable. In fact, a video contains many relatively independent scene segments, and the content of these segments usually has relatively independence and topics, which will evolve with the development of video scenes, namely: the change of the video plot is generally consistent with the switching of the video scene, and the change of the video scene can be used as the segmentation basis of the video clip. Compared with the conventional equal-length video clip segmentation, the video clip segmentation is more suitable for the application from the viewpoint of scene switching.
Firstly, visual data information of each video key frame is identified by using an object identification method based on bottom-up and top-down attention
Figure BDA0002572985120000091
Carrying out object recognition to obtain visual words of each frame
Figure BDA0002572985120000092
Can be regarded as a frame
Figure BDA0002572985120000093
The visual semantics of which describe the frame
Figure BDA0002572985120000094
Of the visual scene. The text of the visual words of two adjacent frames is changed remarkably, which means that the described scene is changed, and the moment can be used as a video segmentation point.
Furthermore, in order to improve the segmentation accuracy, the invention also corrects the segmentation points from the perspective of video semantics. This operation is implemented by means of a bullet screen that can reflect the video semantics: the segmentation point is used as the plot transition point of the video, the barrage semantics at the moment are relatively loose, if the barrage semantics at the moment are centralized and consistent, the segmentation point is corrected, namely: for any video segment S obtained in the previous stage i For all the bullet screens in the bullet screen, the similarity of the redundant strings is found pairwise to construct a segment S i Semantic similarity matrix of (2)
Figure BDA00025729851200000912
Further obtain a video segment S i Bullet screen average semantic similarity
Figure BDA00025729851200000910
Judging each video segment S i Mean semantic similarity of bullet screens
Figure BDA00025729851200000911
Discarding video segments with very high average semantic similarity (determined by adopting an empirical threshold through practical experiments), and finally obtaining a video segment set to be analyzed { s } with relatively independent and natural plot p }。
S103, calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
in the embodiment of the invention, the video to be analyzed contains complex multi-mode content, the emotion is complex, and the emotion barrage of the video audience can be regarded as indirect reflection of the video emotion, so that the method is suitable for video emotion analysis. Set of video segments to be analyzed s p Any fragment of s i Fragment s i The emotion bullet screen set is
Figure BDA0002572985120000095
Segment s i The set of emotion vectors corresponding to the emotion bullet screen is
Figure BDA0002572985120000096
Each bullet screen
Figure BDA00025729851200000913
Emotion vector of
Figure BDA0002572985120000097
Is composed of
Figure BDA0002572985120000098
Summing the emotion vectors of all bullet screens of the segment according to each dimension to obtain a segment s i Emotion and vector of
Figure BDA0002572985120000099
As shown in the following formula:
Figure BDA0002572985120000101
wherein u is a segment s i Number of medium emotion barrages, and vector
Figure BDA0002572985120000102
I.e. fragment s i Represents the segment s i Emotion tags in each emotion dimension.
In the information theory: the entropy is the quantity for describing the disorder of the system, the greater the entropy is, the more disorder the system is, the less information is carried, and the smaller the entropy is, the more ordered the system is, and the more information is carried. In segment s i Emotion vector of
Figure BDA00025729851200001014
In (2), the distribution concentration degree of the emotion semantic information of each emotion dimension can also be used as the segment s i The entropy of the emotion analysis data is measured, and then the emotion fragment s to be analyzed is judged i Emotional tendency of (1), emotional segment to be analyzed s i Also referred to as segment s in the present invention i The segment s can be obtained according to an entropy formula in the entropy i The emotional entropy of (a) is shown as follows:
Figure BDA0002572985120000103
and S104, identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
In the embodiment of the invention, because the video emotion is complex, the emotional tendency of the emotion video segments is not all in a single emotion category, and the video segments with complex emotion are common. The present application is directed to finding video segments that contain no more than two distinct emotional tendencies: one is a video emotion fragment with only one obvious emotional tendency; the other is a video emotion fragment with two obvious emotional tendencies.
Wherein the processing procedure for the video emotion fragments with single emotional tendency is as followsThe following: when video segment s is to be analyzed i Emotional entropy of
Figure BDA0002572985120000104
Very small, less than its emotional entropy threshold H (e) threshold When the fragment s i The emotion semantics of all dimensions of the emotion bullet screen tend to be consistent, which means that the video segment s to be analyzed i Only one significant emotional tendency is involved.
Figure BDA0002572985120000105
On the basis of the above, when the segment s i Emotional entropy of
Figure BDA0002572985120000106
Only slightly above threshold H (e) threshold When the fragment s i The emotional tendency (c) is not necessarily only one, and further judgment is needed: in segment s i Emotion vector of
Figure BDA0002572985120000107
In, when
Figure BDA0002572985120000108
Maximum component of
Figure BDA0002572985120000109
Far greater than
Figure BDA00025729851200001010
Second largest component of
Figure BDA00025729851200001011
Then the segment s i There is only one emotional tendency of, i.e.
Figure BDA00025729851200001012
Maximum component of
Figure BDA00025729851200001013
Of dimension (d) ofThe emotion category belongs to the following formula.
Figure BDA0002572985120000111
The processing process of the emotion fragments of the video to be analyzed containing two emotion tendencies comprises the following steps: as shown in the formula (18), when the video emotion fragment s is to be analyzed i Emotional entropy of
Figure BDA0002572985120000112
Only slightly above threshold H (e) threshold Time to analyze video emotion fragment s i The emotional tendency of (a) is not necessarily only one: when the video emotion fragment s to be analyzed i Emotion vector of
Figure BDA0002572985120000113
In the step (1), the first step,
Figure BDA0002572985120000114
maximum component of
Figure BDA0002572985120000115
And
Figure BDA0002572985120000116
second largest component of
Figure BDA0002572985120000117
When the difference is small, the belonging emotion categories of the dimensionalities of the two components can be regarded as video emotion fragments s to be analyzed i I.e. the video emotion fragments s to be analyzed i There are two main emotional tendencies.
Figure BDA0002572985120000118
Considering that the theme of the bullet screen is loose and the semantic noise is more, the invention also carries out noise reduction from the semantic angle and utilizes each video segment s to be analyzed i The text of the emotion bullet screenVector, for segment s i Feeling semantic similarity matrix
Figure BDA0002572985120000119
Is a symmetric matrix, each value of which represents the pairwise semantic relevance of each emotion bullet screen, and
Figure BDA00025729851200001110
and analyzing the upper triangular part, and if the semantic similarity of the two emotion bullet screens is lower than the semantic similarity threshold of the bullet screens in the segments (the semantic similarity threshold is determined according to actual experiments), determining the two emotion bullet screens as semantic distortion points and deleting the corresponding emotion bullet screens. By this operation, the embodiment of the present invention can have better robustness.
The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
Based on the identification method, aiming at the barrage with rich emotion in the video, the attention mechanism and the multi-mode fusion thought are applied, and the barrage emotion semantics and the visual information of the video scene are fused in time sequence, so that the enhancement representation of the barrage emotion semantics is realized, and the representation integrated with the text and the visual information is utilized to judge the emotion of the video segmentation segment based on the visual semantics and the scene switching, accurately identify the emotion segment in the video, and make up the defect that the video emotion segment is not identified in the existing video emotion understanding.
Based on the foregoing identification method for video emotion fragments, an embodiment of the present invention further provides an identification apparatus for video emotion fragments, where a structural block diagram of the identification apparatus is shown in fig. 2, and the identification apparatus includes:
a determination module 201, a segmentation module 202, a calculation module 203 and a recognition module 204.
Wherein, the first and the second end of the pipe are connected with each other,
the determining module 201 is configured to determine a bullet screen emotion tag of each emotion bullet screen in a video to be analyzed;
the segmentation module 202 is configured to segment the video to be analyzed to obtain each video segment to be analyzed;
the calculating module 203 is configured to calculate segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion tags in each video segment to be analyzed;
the identifying module 204 is configured to identify an emotion fragment in each to-be-analyzed video fragment according to the fragment emotion vector and the emotion entropy.
The invention discloses a video emotion fragment recognition device, which comprises: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification device, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
In this embodiment of the present invention, the determining module 201 includes:
an acquisition unit 205, a screening unit 206 and a tag determination unit 207.
Wherein the content of the first and second substances,
the obtaining unit 205 is configured to obtain each barrage of the video to be analyzed;
the screening unit 206 is configured to screen the bullet screens to obtain emotion bullet screens;
the label determining unit 207 is configured to determine a bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
In this embodiment of the present invention, the dividing module 202 includes:
a semantic determination unit 208, a first judgment unit 209, and a slicing unit 210.
Wherein the content of the first and second substances,
the semantic determining unit 208 is configured to determine visual semantics of each frame in the video to be analyzed;
the first judging unit 209 is configured to sequentially compare the visual semantics of the adjacent frames, and judge whether a difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold;
and the segmentation unit 210 is configured to segment the adjacent frames as segmentation points if the adjacent frames are the segmentation points, so as to obtain each video segment to be analyzed.
In an embodiment of the present invention, the identifying module 204 includes:
a second determination unit 211, a first determination unit 212, a third determination unit 213, and a second determination unit 214.
Wherein the content of the first and second substances,
the second determining unit 211 is configured to determine whether the emotion entropy is smaller than a preset emotion entropy threshold;
the first determining unit 212 is configured to determine that the current video segment to be analyzed includes an emotion, or;
the third determining unit 213 is configured to determine, if the ratio of the maximum component to the second largest component in the current segment emotion vector is greater than a preset ratio threshold;
the second determining unit 214 is configured to determine that the current video segment to be analyzed includes one emotion if the current video segment to be analyzed includes one emotion, or determine that the current video segment to be analyzed includes two emotions if the current video segment to be analyzed does not include one emotion.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in one or more of software and/or hardware in implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The method and the device for identifying the video emotion fragments provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (8)

1. A method for identifying video emotion fragments is characterized by comprising the following steps:
determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;
segmenting the video to be analyzed to obtain each video segment to be analyzed;
calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy;
identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy, wherein the identification comprises the following steps:
judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
if yes, judging that the current video segment to be analyzed contains an emotion, or;
if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportion threshold value or not;
if yes, judging that the current video segment to be analyzed contains one emotion, or otherwise, judging that the current video segment to be analyzed contains two emotions.
2. The method of claim 1, wherein determining the bullet screen emotion label for each emotion bullet screen in the video to be analyzed comprises:
acquiring each bullet screen of the video to be analyzed;
screening the bullet screens to obtain all emotion bullet screens;
and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
3. The method of claim 2, wherein determining the bullet screen sentiment label of each sentiment bullet screen according to a preset neural network model comprises:
determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing the fine-granularity semantic representation and the coarse-granularity semantic representation of the corresponding emotion bullet screen;
determining visual vector representation of scene image data at each emotion bullet screen generation moment;
and transmitting the target semantic representation and the visual vector representation to the preset barrage emotion recognition neural network model to obtain an emotion label corresponding to the emotion barrage.
4. The method of claim 1, wherein segmenting the video to be analyzed to obtain each video segment to be analyzed comprises:
determining the visual semantics of each frame in the video to be analyzed;
sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.
5. The method of claim 4, further comprising:
acquiring the bullet screen semanteme of the adjacent frames;
and correcting the segmentation points according to the bullet screen semantics.
6. An apparatus for identifying emotion fragments in a video, comprising:
the determining module is used for determining the bullet screen emotion labels of all emotion bullet screens in the video to be analyzed;
the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;
the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy;
wherein the identification module comprises:
the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
the first judging unit is used for judging whether the current video segment to be analyzed contains an emotion or not if the current video segment to be analyzed contains the emotion;
the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;
and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.
7. The apparatus of claim 6, wherein the determining module comprises:
the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;
the screening unit is used for screening each bullet screen to obtain each emotion bullet screen;
and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.
8. The apparatus of claim 6, wherein the segmentation module comprises:
the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;
the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.
CN202010645824.XA 2020-07-07 2020-07-07 Video emotion fragment identification method and device Active CN111860237B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010645824.XA CN111860237B (en) 2020-07-07 2020-07-07 Video emotion fragment identification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010645824.XA CN111860237B (en) 2020-07-07 2020-07-07 Video emotion fragment identification method and device

Publications (2)

Publication Number Publication Date
CN111860237A CN111860237A (en) 2020-10-30
CN111860237B true CN111860237B (en) 2022-09-06

Family

ID=73153438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010645824.XA Active CN111860237B (en) 2020-07-07 2020-07-07 Video emotion fragment identification method and device

Country Status (1)

Country Link
CN (1) CN111860237B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112699831B (en) * 2021-01-07 2022-04-01 重庆邮电大学 Video hotspot segment detection method and device based on barrage emotion and storage medium
CN113221689B (en) * 2021-04-27 2022-07-29 苏州工业职业技术学院 Video multi-target emotion degree prediction method
CN114339375B (en) * 2021-08-17 2024-04-02 腾讯科技(深圳)有限公司 Video playing method, method for generating video catalogue and related products
CN113656643B (en) * 2021-08-20 2024-05-03 珠海九松科技有限公司 Method for analyzing film viewing mood by using AI

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108513176A (en) * 2017-12-06 2018-09-07 北京邮电大学 A kind of socialization video subject extraction system and method based on topic model
CN108537139A (en) * 2018-03-20 2018-09-14 校宝在线(杭州)科技股份有限公司 A kind of Online Video wonderful analysis method based on barrage information
CN108737859A (en) * 2018-05-07 2018-11-02 华东师范大学 Video recommendation method based on barrage and device
CN109862397A (en) * 2019-02-02 2019-06-07 广州虎牙信息科技有限公司 A kind of video analysis method, apparatus, equipment and storage medium
CN110020437A (en) * 2019-04-11 2019-07-16 江南大学 The sentiment analysis and method for visualizing that a kind of video and barrage combine
CN110113659A (en) * 2019-04-19 2019-08-09 北京大米科技有限公司 Generate method, apparatus, electronic equipment and the medium of video
CN110198482A (en) * 2019-04-11 2019-09-03 华东理工大学 A kind of video emphasis bridge section mask method, terminal and storage medium
CN110263215A (en) * 2019-05-09 2019-09-20 众安信息技术服务有限公司 A kind of video feeling localization method and system
CN110569354A (en) * 2019-07-22 2019-12-13 中国农业大学 Barrage emotion analysis method and device
CN110852360A (en) * 2019-10-30 2020-02-28 腾讯科技(深圳)有限公司 Image emotion recognition method, device, equipment and storage medium
CN111163366A (en) * 2019-12-30 2020-05-15 厦门市美亚柏科信息股份有限公司 Video processing method and terminal

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108513176A (en) * 2017-12-06 2018-09-07 北京邮电大学 A kind of socialization video subject extraction system and method based on topic model
CN108537139A (en) * 2018-03-20 2018-09-14 校宝在线(杭州)科技股份有限公司 A kind of Online Video wonderful analysis method based on barrage information
CN108737859A (en) * 2018-05-07 2018-11-02 华东师范大学 Video recommendation method based on barrage and device
CN109862397A (en) * 2019-02-02 2019-06-07 广州虎牙信息科技有限公司 A kind of video analysis method, apparatus, equipment and storage medium
CN110020437A (en) * 2019-04-11 2019-07-16 江南大学 The sentiment analysis and method for visualizing that a kind of video and barrage combine
CN110198482A (en) * 2019-04-11 2019-09-03 华东理工大学 A kind of video emphasis bridge section mask method, terminal and storage medium
CN110113659A (en) * 2019-04-19 2019-08-09 北京大米科技有限公司 Generate method, apparatus, electronic equipment and the medium of video
CN110263215A (en) * 2019-05-09 2019-09-20 众安信息技术服务有限公司 A kind of video feeling localization method and system
CN110569354A (en) * 2019-07-22 2019-12-13 中国农业大学 Barrage emotion analysis method and device
CN110852360A (en) * 2019-10-30 2020-02-28 腾讯科技(深圳)有限公司 Image emotion recognition method, device, equipment and storage medium
CN111163366A (en) * 2019-12-30 2020-05-15 厦门市美亚柏科信息股份有限公司 Video processing method and terminal

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《Visual-Texual Emotion Analysis With Deep Coupled Video and Danmu Neural Networks》;Chenchen Li等;《IEEE Transactions on Multimedia》;20200630;第22卷(第6期);第1634-1646页 *
《基于弹幕情感分析的视频片段推荐模型》;邓扬等;《计算机应用》;20170410;第37卷(第4期);第1065-1070,1134页 *

Also Published As

Publication number Publication date
CN111860237A (en) 2020-10-30

Similar Documents

Publication Publication Date Title
Yang et al. Video captioning by adversarial LSTM
CN111860237B (en) Video emotion fragment identification method and device
Cheng et al. Fully convolutional networks for continuous sign language recognition
Poria et al. Context-dependent sentiment analysis in user-generated videos
CN106878632B (en) Video data processing method and device
CN112199956B (en) Entity emotion analysis method based on deep representation learning
CN110825867B (en) Similar text recommendation method and device, electronic equipment and storage medium
Xu et al. Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature
US11727915B1 (en) Method and terminal for generating simulated voice of virtual teacher
WO2023124647A1 (en) Summary determination method and related device thereof
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN115580758A (en) Video content generation method and device, electronic equipment and storage medium
CN115408488A (en) Segmentation method and system for novel scene text
CN115830610A (en) Multi-mode advertisement recognition method and system, electronic equipment and storage medium
CN114880496A (en) Multimedia information topic analysis method, device, equipment and storage medium
CN113268592B (en) Short text object emotion classification method based on multi-level interactive attention mechanism
Zaoad et al. An attention-based hybrid deep learning approach for bengali video captioning
CN113761377A (en) Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium
US20240037941A1 (en) Search results within segmented communication session content
CN115169472A (en) Music matching method and device for multimedia data and computer equipment
Zhang et al. Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning
CN114676699A (en) Entity emotion analysis method and device, computer equipment and storage medium
Gomes Jr et al. Framework for knowledge discovery in educational video repositories
Angrave et al. Creating tiktoks, memes, accessible content, and books from engineering videos? first solve the scene detection problem
Wang et al. Multimodal Cross-Attention Bayesian Network for Social News Emotion Recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant