CN111860237A - Video emotion fragment identification method and device - Google Patents
Video emotion fragment identification method and device Download PDFInfo
- Publication number
- CN111860237A CN111860237A CN202010645824.XA CN202010645824A CN111860237A CN 111860237 A CN111860237 A CN 111860237A CN 202010645824 A CN202010645824 A CN 202010645824A CN 111860237 A CN111860237 A CN 111860237A
- Authority
- CN
- China
- Prior art keywords
- emotion
- video
- analyzed
- bullet screen
- segment
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 326
- 239000012634 fragment Substances 0.000 title claims abstract description 103
- 238000000034 method Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 61
- 230000000007 visual effect Effects 0.000 claims description 36
- 230000011218 segmentation Effects 0.000 claims description 27
- 238000003062 neural network model Methods 0.000 claims description 11
- 238000012216 screening Methods 0.000 claims description 8
- 230000008909 emotion recognition Effects 0.000 claims description 4
- 230000002996 emotional effect Effects 0.000 description 16
- 238000012549 training Methods 0.000 description 9
- 230000008569 process Effects 0.000 description 8
- 230000006870 function Effects 0.000 description 6
- 238000012512 characterization method Methods 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000009471 action Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 239000010410 layer Substances 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 206010048909 Boredom Diseases 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 238000004080 punching Methods 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 239000002356 single layer Substances 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Image Analysis (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying video emotion fragments.
Background
With the development of multimedia technology, the data volume of multimedia videos is increased explosively, a large number of users are attracted, people tend to watch videos to relieve pressure and boredom, the watching of videos becomes a new mode for meeting the emotional requirements of people, contradiction exists between the huge scale of videos and the limited time of users, and audiences only want to watch partial emotional segments of videos instead of the whole videos. Therefore, the emotion labels (five types of emotions: happiness, surprise, dislike, sadness and fear) synchronized with time are needed for the video, the emotion fragments in the video are identified, and the personalized emotion requirements of audiences are better met.
The first challenge of the work is that the video lacks a time sequence emotion tag, at present, the emotion tag is mainly marked on each frame in the video manually, emotion fragments are identified based on the marked emotion tag, and the emotion fragment identification period is long due to the fact that the manual marking emotion tag is long in marking time.
Disclosure of Invention
In view of the above, the present invention provides a method and an apparatus for identifying video emotion fragments, so as to solve the problem in the prior art that, currently, emotion tags are mainly manually tagged to each frame in a video, and emotion fragments are identified based on the tagged emotion tags, and because the manual tagging of emotion tags is long in tagging time, the emotion fragment identification period is long, and the specific scheme is as follows:
A video emotion fragment identification method comprises the following steps:
determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;
segmenting the video to be analyzed to obtain each video segment to be analyzed;
calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen in the video to be analyzed includes:
acquiring each bullet screen of the video to be analyzed;
screening the bullet screens to obtain all emotion bullet screens;
and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen according to the preset neural network model includes:
determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing a fine-granularity semantic representation and a coarse-granularity semantic representation of the corresponding emotion bullet screen;
Determining visual vector representation of scene image data at each emotion bullet screen generation moment;
and transmitting the target semantic representation and the visual vector representation to the preset neural network model to obtain a bullet screen emotion label corresponding to the emotion bullet screen.
Optionally, in the method, the segmenting the video to be analyzed to obtain each video segment to be analyzed includes:
determining the visual semantics of each frame in the video to be analyzed;
sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.
The above method, optionally, further includes:
acquiring the bullet screen semantics of the adjacent frames;
and correcting the segmentation points according to the bullet screen semantics.
Optionally, in the method, identifying the emotion fragment in each video fragment to be analyzed according to the fragment emotion vector and the emotion entropy includes:
judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
if yes, judging that the current video segment to be analyzed contains an emotion, or;
If not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold value or not;
if so, judging that the current video segment to be analyzed contains one emotion, or if not, judging that the current video to be analyzed contains two emotions.
7. An apparatus for identifying emotion fragments in a video, comprising:
the determining module is used for determining the bullet screen emotion labels of all emotion bullet screens in the video to be analyzed;
the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;
the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
The above apparatus, optionally, the determining module includes:
the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;
the screening unit is used for screening the bullet screens to obtain all emotion bullet screens;
and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.
The above apparatus, optionally, the dividing module includes:
the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;
the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.
The above apparatus, optionally, the identification module includes:
the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
the first judging unit is used for judging whether the current video segment to be analyzed contains an emotion or not if the current video segment to be analyzed contains the emotion;
the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;
and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.
Compared with the prior art, the invention has the following advantages:
The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flowchart of a method for identifying video emotion fragments disclosed in an embodiment of the present application;
fig. 2 is a block diagram of a structure of an apparatus for recognizing video emotion fragments according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a method and a device for identifying video emotion fragments, which are applied to the process of identifying the emotion fragments in a video, wherein the emotion fragments comprise: according to the method, an emotion label is manually marked, emotion fragments in a video are identified based on the emotion label, but the marking time of the manually marked emotion label is long, so that the identification period is long, the embodiment of the invention provides a video emotion fragment identification method for solving the problems, a plurality of video sharing platforms widely have time synchronization comments named as 'barrage', the time synchronization comments are the instant feelings of audience viewing, contain rich emotion expressions, are consistent with video emotion development and can be used for video emotion analysis, therefore, the identification method identifies the video to be analyzed based on the barrage, the execution flow of the identification method is shown in figure 1, and the method comprises the following steps:
s101, determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed;
in the embodiment of the invention, each bullet screen of the video to be analyzed is obtained, as the bullet screen is the instant feeling of a video user, not all audiences participate in the interaction of the bullet screen of the whole video, and card punching bullet screens or sign-in bullet screens irrelevant to video semantics and video emotions often appear, the themes of each bullet screen are loose, the semantic noise is more, and preferably, denoising processing is carried out firstly.
Further, an emotion label of each emotion barrage is determined according to a preset neural network model, wherein the preset neural network model is a barrage emotion model, the barrage emotion model needs to be trained in advance, and the training process of the barrage emotion model is as follows:
first, a training data set is constructed, never containingConstructing a bullet screen emotion data set C with emotion labels in a bullet screen data set C with emotion labelseBased on CeAnd training the barrage emotion model. Bullet screen emotion data set C considering high cost of manual annotationeThe emotion labels are obtained through a two-stage word matching method, and the basic idea is established on the basis of the fact that the emotion expression of the bullet screen is very common. The bullet screen contains rich emotion expression, and the bullet screen with the explicit emotion expression can perform automatic emotion recognition through a two-stage emotion dictionary matching method. The first stage is that the emotion polarity recognition is carried out on all video barrages through a comprehensive emotion polarity dictionary integrating a general emotion dictionary and a barrage emotion dictionary, and the barrage which contains explicit emotion expression and can recognize the positive and negative emotion polarities is selected; and in the second stage, fine-grained emotion recognition (five types of emotions: happiness, surprise, dislike, sadness and fear) is carried out on the bullet screen with positive and negative emotion polarities obtained in the first stage by using a fine-grained emotion dictionary, and an emotion bullet screen containing an emotion label is finally obtained by an emotion dictionary matching method in the two stages. Barrage data set C and emotion barrage data set C eIs expressed as follows:
C={(C1,T1,I1),K(Ci,Ti,Ii),K(CN,TN,IN)} (1)
wherein any element (C) in the barrage data set Ci,Ti,Ii) Respectively represents TiBarrage C corresponding to momentiAnd scene image data I of video key framei. Emotion barrage data set CeAny element of (1)Respectively representBullet screen that moment correspondsAnd scene image data of video key frameAnd barrageCorresponding five-classification sentiment tagN and M represent the number of sentences of the bullet screen text and the number of sentences of the emotion bullet screen text, respectively.
In the embodiment of the invention, based on the emotion bullet screen data set CeTraining the barrage emotion model, wherein the input of the barrage emotion model is an emotion barrage data set CeOf which any one element isNamely:bullet screen text corresponding to momentAnd emotion tag textAnd visual data information of the video key frame at that timeThe characterization process of the input data is as follows: obtaining bullet screen text by using pre-training language model BertSentence vector characterization ofSum word vector characterizationBarrage emotion label obtained by using pre-training language model BertSentence vector characterization ofProcessing visual image information of video key frame by using existing deep network model VGGExtracting the last convolution layer of the VGG model as the result Vector characterization ofThe correlation formula is as follows:
in view of the fact that the text semantics of the bullet screen are related to the semantics of the video scene at the corresponding moment, in the embodiment of the invention, the scene visual information of the bullet screen is usedWord vector fused into bullet screen text in attention modeMiddle and help model closingAnnotating words related to barrage visual scene to obtain barrage word vectors with visual attentionThe relevant formula for the attention mechanism is as follows.
α=soft max(W3M) (7)
Wherein, W1、W2And W3Which may be set empirically or on a case-by-case basis, for the training parameters of the attention unit, tanh represents the activation function of the deep neural network, M is an intermediate quantity,expressing normalization operation, alpha expressing each word of visual information in the bullet screen textThe visual attention weight alpha acts on the word vector of the bullet screen textObtaining bullet screen word vector with visual attention
Considering the sequence information of the words contained in the sentence text, the invention utilizes the recurrent neural network BilSTM and the self-attention mechanism to carry out the bullet screen word vector fused with the visual informationModeling to obtain fine-grained semantic representation of the bullet screen
As shown in the formula (3), the bullet screen sentence vector representation is obtained by the Bert modelThat is, the coarse-grained semantic representation of the whole sentence of the bullet screen is expressed with the fine-grained sentence semantic representation of the bullet screen Carrying out splicing operation with weight to obtain bullet target meaning representationSee the following formula.
Wherein, γ is a weight adjustment parameter, and the sign '+' is a splicing operation of tensor.
Subsequently, the target semantic representationTraining and outputting through a full connection layer FC, and obtaining the emotion probability P of the bullet screen:
wherein y represents the emotion type to which the bullet screen belongs,representing through inputCalculating to obtain the bullet screenThe emotion category probability of (1). FC is a single-layer full-connection network structure, outputs through the full-connection layer, obtains the emotion probability P of each emotion barrage, trains the emotion model through minimizing the following objective function:
wherein,barrage for emotionThe original emotion tag of (a) is set,barrage for emotionAfter model training and output emotion probability, softmax _ cross _ entropy is a cross entropy loss function, and the original emotion label of each bullet screen is calculatedAnd emotion prediction resultsCross entropy loss of (2). In order to minimize the objective function, an Adam optimizer is adopted to iteratively update each parameter in the model (Tensorflow automatic derivation implementation), so that the emotion recognition model of the bullet screen is trained.
Finally, any bullet screen C in the bullet screen data set C is identified by the trained bullet screen emotion identification model k:(Ck,Tk,Ik) Performing emotion prediction and outputting P (y | C)k,Ik) And further obtaining an emotion probability vector
Wherein, P (y | C)k,Ik) As a bullet screen CkThe results of the model(s) of (2),the method aims to calculate the proportion of each class in multiple classes and ensure that the sum of all the proportions is 1. The invention is provided withPair barrage CkFurther processing the prediction result to obtain a bullet screen CkEmotion probability vector of It is a five-dimensional emotion vector whose value in each dimension can be regarded as bullet screen CkEmotional semantic distribution in each dimension, measuring bullet screen CkThe emotional semantic value in each emotional dimension also represents the bullet screen CkThe emotion tag of (1).
S102, segmenting the video to be analyzed to obtain each video segment to be analyzed;
in the embodiment of the invention, because the barrage comment is the instant response of the audience, the contained emotion is always instant. Therefore, a video emotion analysis based on one continuous period of time is most suitable. In fact, a video contains many relatively independent scene segments, and the content of these segments usually has relatively independence and topics, which evolve with the development of video scenes, namely: the change of the video plot is generally consistent with the switching of the video scene, and the change of the video scene can be used as the segmentation basis of the video clip. Compared with the conventional equal-length video clip segmentation, the video clip segmentation is more suitable for the application from the perspective of scene switching.
Firstly, visual data information of each video key frame is identified by using an object identification method based on bottom-up and top-down attentionCarrying out object recognition to obtain visual words of each frameCan be viewed as a frameThe visual semantics of which describe the frameOf the visual scene. The text of the visual words of two adjacent frames is changed remarkably, which means that the described scene is changed, and the moment can be used as a video segmentation point.
Furthermore, in order to improve the segmentation accuracy, the invention also corrects segmentation points from the perspective of video semantics. This operation is implemented by means of a bullet screen that can reflect the video semantics: the segment segmentation point is used as a plot conversion point of the video, the bullet screen semantics at the moment are relatively loose, if the bullet screen semantics at the moment are concentrated and consistent, the segmentation point is corrected, namely: for any video segment S obtained in the previous stageiFor all the bullet screens in the bullet screen, the similarity of the redundant strings is found pairwise to construct a segment SiSemantic similarity matrix ofFurther obtain a video segment SiBullet screen average semantic similarityJudging each video segment S iBullet screen average semantic similarityDiscarding video segments with very high average semantic similarity (determined by adopting an empirical threshold through practical experiments), and finally obtaining a video segment set to be analyzed { s } with relatively independent and natural plotp}。
S103, calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
in the embodiment of the invention, the video to be analyzed contains complex multi-mode content and complex emotion, and the emotion barrage of the video audience can be regarded as indirect reflection of video emotion, so that the method is suitable for video emotion analysis. Set of video segments to be analyzed spAny fragment of siFragment siThe emotion bullet screen set isSegment siThe set of emotion vectors corresponding to the emotion bullet screen isEach barrageEmotion vector ofIs composed ofSumming the emotion vectors of all bullet screens of the segment according to each dimension to obtain a segment siEmotion and vector ofAs shown in the following formula:
wherein u is a fragment siMiddle energizerNumber of bullet-sensing screens, and vectorsI.e. fragment siRepresents the segment siEmotion tags in each emotion dimension.
In the information theory: the entropy is the quantity for describing the disorder of the system, the greater the entropy is, the more disorder the system is, the less information is carried, and the smaller the entropy is, the more ordered the system is, and the more information is carried. In segment s iEmotion vector ofIn (2), the distribution concentration degree of the emotion semantic information of each emotion dimension can also be used as the segment siThe entropy of the emotion is measured, and then the emotion fragment s to be analyzed is judgediEmotional tendency of, emotional segment to be analyzed siEntropy of (D) is also referred to herein as segment siThe segment s can be obtained according to an entropy formula in the entropyiThe emotional entropy of (a) is shown as follows:
and S104, identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
In the embodiment of the invention, because the video emotion is complex, the emotional tendency of the emotion video segments is not all in a single emotion category, and the video segments with complex emotion are common. The present application aims to find video segments containing no more than two significant emotional tendencies: one is a video emotion fragment with only one obvious emotional tendency; the other is a video emotion fragment with two obvious emotional tendencies.
The processing process of the video emotion fragment with single emotional tendency is as follows: when video segment s is to be analyzediEmotional entropy ofVery small, smaller than itEmotion entropy threshold H (e)thresholdWhen the fragment siThe emotion semantics of all dimensions of the emotion bullet screen tend to be consistent, which means that the video segment s to be analyzed iOnly one significant emotional tendency is involved.
On the basis of the above, when the segment siEmotional entropy ofOnly slightly above threshold H (e)thresholdWhen the fragment siThe emotional tendency (c) is not necessarily only one, and further judgment is needed: in segment siEmotion vector ofIn, whenMaximum component ofFar greater thanSecond largest component ofSegment siThere is only one emotional tendency of, i.e.Maximum component ofThe formula of the emotion category of the dimension in which the emotion belongs is as follows.
The processing process for the emotion fragments of the video to be analyzed containing two emotion tendencies is as follows: as shown in the formula (18), when the video emotion fragment s is to be analyzediEmotional entropy ofOnly slightly above threshold H (e)thresholdTime to analyze video emotion fragment siThe emotional tendency of (a) is not necessarily only one: when the video emotion fragment s to be analyzediEmotion vector ofIn (1),maximum component ofAndsecond largest component ofWhen the difference is small, the emotion categories of the dimensionalities of the two components can be regarded as the video emotion fragments s to be analyzediI.e. the video emotion fragments s to be analyzediThere are two main emotional tendencies.
Considering that the theme of the bullet screen is loose and the semantic noise is more, the invention also carries out noise reduction from the semantic angle and utilizes each video segment s to be analyzed iThe text vector of the emotion bullet screen of (1), for segment siFeeling semantic similarity matrixIs a symmetric matrix, each value of which represents the pairwise semantic relevance of each emotion bullet screen, andand analyzing the upper triangular part, and if the semantic similarity of the two emotion bullet screens is lower than the semantic similarity threshold of the bullet screens in the segments (the semantic similarity threshold is determined according to an actual experiment), determining the two emotion bullet screens as semantic distortion points, and deleting the corresponding emotion bullet screens. By this operation, the embodiment of the present invention can have better robustness.
The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
Based on the identification method, aiming at the barrage with rich emotion in the video, the attention mechanism and the multi-mode fusion thought are applied, and the barrage emotion semantics and the visual information of the video scene are fused in time sequence, so that the enhancement representation of the barrage emotion semantics is realized, and the representation integrated with the text and the visual information is utilized to judge the emotion of the video segmentation segment based on the visual semantics and the scene switching, accurately identify the emotion segment in the video, and make up the defect that the video emotion segment is not identified in the existing video emotion understanding.
Based on the foregoing identification method for video emotion fragments, an embodiment of the present invention further provides an identification apparatus for video emotion fragments, where a structural block diagram of the identification apparatus is shown in fig. 2, and the identification apparatus includes:
a determination module 201, a segmentation module 202, a calculation module 203 and a recognition module 204.
Wherein,
the determining module 201 is configured to determine a bullet screen emotion tag of each emotion bullet screen in a video to be analyzed;
the segmentation module 202 is configured to segment the video to be analyzed to obtain each video segment to be analyzed;
the calculating module 203 is configured to calculate segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion tags in each video segment to be analyzed;
The identifying module 204 is configured to identify an emotion fragment in each to-be-analyzed video fragment according to the fragment emotion vector and the emotion entropy.
The invention discloses a video emotion fragment recognition device, which comprises: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification device, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
In this embodiment of the present invention, the determining module 201 includes:
An acquisition unit 205, a screening unit 206 and a tag determination unit 207.
Wherein,
the obtaining unit 205 is configured to obtain each barrage of the video to be analyzed;
the screening unit 206 is configured to screen the bullet screens to obtain emotion bullet screens;
the label determining unit 207 is configured to determine a bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
In this embodiment of the present invention, the dividing module 202 includes:
a semantic determination unit 208, a first judgment unit 209, and a slicing unit 210.
Wherein,
the semantic determining unit 208 is configured to determine visual semantics of each frame in the video to be analyzed;
the first judging unit 209 is configured to sequentially compare the visual semantics of the adjacent frames, and judge whether a difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold;
and the segmentation unit 210 is configured to segment the adjacent frames as segmentation points if the adjacent frames are the segmentation points, so as to obtain each video segment to be analyzed.
In this embodiment of the present invention, the identifying module 204 includes:
a second judging unit 211, a first judging unit 212, a third judging unit 213, and a second judging unit 214.
Wherein,
The second determining unit 211 is configured to determine whether the emotion entropy is smaller than a preset emotion entropy threshold;
the first determining unit 212 is configured to determine that the current video segment to be analyzed includes an emotion, or;
the third determining unit 213 is configured to determine whether a ratio of a maximum component to a next largest component in the current segment emotion vector is greater than a preset ratio threshold if the ratio is not greater than the preset ratio threshold;
the second determining unit 214 is configured to determine that the current video segment to be analyzed includes one emotion if the current video segment to be analyzed includes one emotion, or determine that the current video segment to be analyzed includes two emotions if the current video segment to be analyzed includes one emotion.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in the same software and/or hardware or in a plurality of software and/or hardware when implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The method and the device for identifying the video emotion fragments provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (10)
1. A method for identifying video emotion fragments is characterized by comprising the following steps:
determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;
segmenting the video to be analyzed to obtain each video segment to be analyzed;
calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
2. The method of claim 1, wherein determining the bullet screen emotion label for each emotion bullet screen in the video to be analyzed comprises:
acquiring each bullet screen of the video to be analyzed;
screening the bullet screens to obtain all emotion bullet screens;
and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
3. The method of claim 2, wherein determining the bullet screen emotion label for each emotion bullet screen according to a preset neural network model comprises:
determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing a fine-granularity semantic representation and a coarse-granularity semantic representation of the corresponding emotion bullet screen;
Determining visual vector representation of scene image data at each emotion bullet screen generation moment;
and transmitting the target semantic representation and the visual vector representation to the preset barrage emotion recognition neural network model to obtain an emotion label corresponding to the emotion barrage.
4. The method of claim 1, wherein segmenting the video to be analyzed to obtain each video segment to be analyzed comprises:
determining the visual semantics of each frame in the video to be analyzed;
sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.
5. The method of claim 4, further comprising:
acquiring the bullet screen semantics of the adjacent frames;
and correcting the segmentation points according to the bullet screen semantics.
6. The method of claim 1, wherein identifying the emotion fragments in the respective video fragments to be analyzed according to the fragment emotion vector and the emotion entropy comprises:
judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
If yes, judging that the current video segment to be analyzed contains an emotion, or;
if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold value or not;
if so, judging that the current video segment to be analyzed contains one emotion, or if not, judging that the current video to be analyzed contains two emotions.
7. An apparatus for identifying emotion fragments in a video, comprising:
the determining module is used for determining the bullet screen emotion labels of all emotion bullet screens in the video to be analyzed;
the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;
the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
8. The apparatus of claim 7, wherein the determining module comprises:
the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;
The screening unit is used for screening the bullet screens to obtain all emotion bullet screens;
and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.
9. The apparatus of claim 7, wherein the segmentation module comprises:
the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;
the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.
10. The apparatus of claim 7, wherein the identification module comprises:
the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
the first judging unit is used for judging whether the current video segment to be analyzed contains an emotion or not if the current video segment to be analyzed contains the emotion;
the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;
And the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010645824.XA CN111860237B (en) | 2020-07-07 | 2020-07-07 | Video emotion fragment identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010645824.XA CN111860237B (en) | 2020-07-07 | 2020-07-07 | Video emotion fragment identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111860237A true CN111860237A (en) | 2020-10-30 |
CN111860237B CN111860237B (en) | 2022-09-06 |
Family
ID=73153438
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010645824.XA Active CN111860237B (en) | 2020-07-07 | 2020-07-07 | Video emotion fragment identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111860237B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364743A (en) * | 2020-11-02 | 2021-02-12 | 北京工商大学 | Video classification method based on semi-supervised learning and bullet screen analysis |
CN112699831A (en) * | 2021-01-07 | 2021-04-23 | 重庆邮电大学 | Video hotspot segment detection method and device based on barrage emotion and storage medium |
CN113221689A (en) * | 2021-04-27 | 2021-08-06 | 苏州工业职业技术学院 | Video multi-target emotion prediction method and system |
CN113656643A (en) * | 2021-08-20 | 2021-11-16 | 珠海九松科技有限公司 | Algorithm for analyzing film-watching mood by using AI (artificial intelligence) |
CN114339375A (en) * | 2021-08-17 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Video playing method, method for generating video directory and related product |
CN117710777A (en) * | 2024-02-06 | 2024-03-15 | 腾讯科技(深圳)有限公司 | Model training method, key frame extraction method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108513176A (en) * | 2017-12-06 | 2018-09-07 | 北京邮电大学 | A kind of socialization video subject extraction system and method based on topic model |
CN108537139A (en) * | 2018-03-20 | 2018-09-14 | 校宝在线(杭州)科技股份有限公司 | A kind of Online Video wonderful analysis method based on barrage information |
CN108737859A (en) * | 2018-05-07 | 2018-11-02 | 华东师范大学 | Video recommendation method based on barrage and device |
CN109862397A (en) * | 2019-02-02 | 2019-06-07 | 广州虎牙信息科技有限公司 | A kind of video analysis method, apparatus, equipment and storage medium |
CN110020437A (en) * | 2019-04-11 | 2019-07-16 | 江南大学 | The sentiment analysis and method for visualizing that a kind of video and barrage combine |
CN110113659A (en) * | 2019-04-19 | 2019-08-09 | 北京大米科技有限公司 | Generate method, apparatus, electronic equipment and the medium of video |
CN110198482A (en) * | 2019-04-11 | 2019-09-03 | 华东理工大学 | A kind of video emphasis bridge section mask method, terminal and storage medium |
CN110263215A (en) * | 2019-05-09 | 2019-09-20 | 众安信息技术服务有限公司 | A kind of video feeling localization method and system |
CN110569354A (en) * | 2019-07-22 | 2019-12-13 | 中国农业大学 | Barrage emotion analysis method and device |
CN110852360A (en) * | 2019-10-30 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Image emotion recognition method, device, equipment and storage medium |
CN111163366A (en) * | 2019-12-30 | 2020-05-15 | 厦门市美亚柏科信息股份有限公司 | Video processing method and terminal |
-
2020
- 2020-07-07 CN CN202010645824.XA patent/CN111860237B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108513176A (en) * | 2017-12-06 | 2018-09-07 | 北京邮电大学 | A kind of socialization video subject extraction system and method based on topic model |
CN108537139A (en) * | 2018-03-20 | 2018-09-14 | 校宝在线(杭州)科技股份有限公司 | A kind of Online Video wonderful analysis method based on barrage information |
CN108737859A (en) * | 2018-05-07 | 2018-11-02 | 华东师范大学 | Video recommendation method based on barrage and device |
CN109862397A (en) * | 2019-02-02 | 2019-06-07 | 广州虎牙信息科技有限公司 | A kind of video analysis method, apparatus, equipment and storage medium |
CN110020437A (en) * | 2019-04-11 | 2019-07-16 | 江南大学 | The sentiment analysis and method for visualizing that a kind of video and barrage combine |
CN110198482A (en) * | 2019-04-11 | 2019-09-03 | 华东理工大学 | A kind of video emphasis bridge section mask method, terminal and storage medium |
CN110113659A (en) * | 2019-04-19 | 2019-08-09 | 北京大米科技有限公司 | Generate method, apparatus, electronic equipment and the medium of video |
CN110263215A (en) * | 2019-05-09 | 2019-09-20 | 众安信息技术服务有限公司 | A kind of video feeling localization method and system |
CN110569354A (en) * | 2019-07-22 | 2019-12-13 | 中国农业大学 | Barrage emotion analysis method and device |
CN110852360A (en) * | 2019-10-30 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Image emotion recognition method, device, equipment and storage medium |
CN111163366A (en) * | 2019-12-30 | 2020-05-15 | 厦门市美亚柏科信息股份有限公司 | Video processing method and terminal |
Non-Patent Citations (2)
Title |
---|
CHENCHEN LI等: "《Visual-Texual Emotion Analysis With Deep Coupled Video and Danmu Neural Networks》", 《IEEE TRANSACTIONS ON MULTIMEDIA》 * |
邓扬等: "《基于弹幕情感分析的视频片段推荐模型》", 《计算机应用》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364743A (en) * | 2020-11-02 | 2021-02-12 | 北京工商大学 | Video classification method based on semi-supervised learning and bullet screen analysis |
CN112699831A (en) * | 2021-01-07 | 2021-04-23 | 重庆邮电大学 | Video hotspot segment detection method and device based on barrage emotion and storage medium |
CN112699831B (en) * | 2021-01-07 | 2022-04-01 | 重庆邮电大学 | Video hotspot segment detection method and device based on barrage emotion and storage medium |
CN113221689A (en) * | 2021-04-27 | 2021-08-06 | 苏州工业职业技术学院 | Video multi-target emotion prediction method and system |
CN114339375A (en) * | 2021-08-17 | 2022-04-12 | 腾讯科技(深圳)有限公司 | Video playing method, method for generating video directory and related product |
CN114339375B (en) * | 2021-08-17 | 2024-04-02 | 腾讯科技(深圳)有限公司 | Video playing method, method for generating video catalogue and related products |
CN113656643A (en) * | 2021-08-20 | 2021-11-16 | 珠海九松科技有限公司 | Algorithm for analyzing film-watching mood by using AI (artificial intelligence) |
CN113656643B (en) * | 2021-08-20 | 2024-05-03 | 珠海九松科技有限公司 | Method for analyzing film viewing mood by using AI |
CN117710777A (en) * | 2024-02-06 | 2024-03-15 | 腾讯科技(深圳)有限公司 | Model training method, key frame extraction method and device |
CN117710777B (en) * | 2024-02-06 | 2024-06-04 | 腾讯科技(深圳)有限公司 | Model training method, key frame extraction method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111860237B (en) | 2022-09-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111860237B (en) | Video emotion fragment identification method and device | |
Yang et al. | Video captioning by adversarial LSTM | |
Poria et al. | Context-dependent sentiment analysis in user-generated videos | |
CN109117777B (en) | Method and device for generating information | |
CN113255755B (en) | Multi-modal emotion classification method based on heterogeneous fusion network | |
CN106878632B (en) | Video data processing method and device | |
CN110825867B (en) | Similar text recommendation method and device, electronic equipment and storage medium | |
Dinkov et al. | Predicting the leading political ideology of YouTube channels using acoustic, textual, and metadata information | |
US11727915B1 (en) | Method and terminal for generating simulated voice of virtual teacher | |
Xu et al. | Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature | |
Singh et al. | An encoder-decoder based framework for hindi image caption generation | |
CN113761377A (en) | Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium | |
WO2023124647A1 (en) | Summary determination method and related device thereof | |
CN115169472A (en) | Music matching method and device for multimedia data and computer equipment | |
CN115830610A (en) | Multi-mode advertisement recognition method and system, electronic equipment and storage medium | |
CN116049557A (en) | Educational resource recommendation method based on multi-mode pre-training model | |
Zaoad et al. | An attention-based hybrid deep learning approach for bengali video captioning | |
Li et al. | An attention-based, context-aware multimodal fusion method for sarcasm detection using inter-modality inconsistency | |
CN113268592A (en) | Short text object emotion classification method based on multi-level interactive attention mechanism | |
US20240037941A1 (en) | Search results within segmented communication session content | |
Attai et al. | A survey on arabic image captioning systems using deep learning models | |
Elabora et al. | Evaluating citizens’ sentiments in smart cities: A deep learning approach | |
Zhang et al. | Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning | |
Gomes Jr et al. | Framework for knowledge discovery in educational video repositories | |
CN114676699A (en) | Entity emotion analysis method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |