CN111860237B - Video emotion fragment identification method and device - Google Patents
Video emotion fragment identification method and device Download PDFInfo
- Publication number
- CN111860237B CN111860237B CN202010645824.XA CN202010645824A CN111860237B CN 111860237 B CN111860237 B CN 111860237B CN 202010645824 A CN202010645824 A CN 202010645824A CN 111860237 B CN111860237 B CN 111860237B
- Authority
- CN
- China
- Prior art keywords
- emotion
- video
- analyzed
- bullet screen
- fragments
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Image Analysis (AREA)
- Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
Abstract
The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and recognizing the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the recognition period of the emotion fragments is shortened, and the problem of long recognition period of the emotion fragments caused by long marking time of the emotion labels manually identified is solved.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for identifying video emotion fragments.
Background
With the development of multimedia technology, the data volume of multimedia videos is increased explosively, a large number of users are attracted, people tend to watch videos to relieve pressure and boredom, the watching of videos becomes a new mode meeting the emotional requirements of people, contradictions exist between the huge scale of videos and the limited time of users, and audiences only want to watch partial emotional segments of videos instead of the whole videos. Therefore, the emotion labels (five types of emotions: happiness, surprise, dislike, sadness and fear) synchronized with time are needed for the video, the emotion fragments in the video are identified, and the personalized emotion requirements of audiences are better met.
The first challenge of the work is that the video lacks a time sequence emotion tag, at present, the emotion tag is mainly marked on each frame in the video manually, emotion fragments are identified based on the marked emotion tag, and the emotion fragment identification period is long due to the fact that the manual marking emotion tag is long in marking time.
Disclosure of Invention
In view of this, the present invention provides a method and an apparatus for identifying video emotion fragments, so as to solve the problem in the prior art that, currently, emotion tags are mainly manually tagged to each frame in a video, and emotion fragments are identified based on the tagged emotion tags, and because the manual tagging emotion tags are long in tagging time, the emotion fragment identification period is long, and the specific scheme is as follows:
a video emotion fragment identification method comprises the following steps:
determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;
segmenting the video to be analyzed to obtain each video segment to be analyzed;
calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen in the video to be analyzed includes:
acquiring each bullet screen of the video to be analyzed;
screening the bullet screens to obtain all emotion bullet screens;
and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
Optionally, the method for determining the bullet screen emotion tag of each emotion bullet screen according to the preset neural network model includes:
determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing a fine-granularity semantic representation and a coarse-granularity semantic representation of the corresponding emotion bullet screen;
determining visual vector representation of scene image data at each emotion bullet screen generation moment;
and transmitting the target semantic representation and the visual vector representation to the preset neural network model to obtain a bullet screen emotion label corresponding to the emotion bullet screen.
Optionally, in the method, the segmenting the video to be analyzed to obtain each video segment to be analyzed includes:
determining the visual semantics of each frame in the video to be analyzed;
sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.
The above method, optionally, further includes:
acquiring the bullet screen semanteme of the adjacent frames;
and correcting the segmentation points according to the bullet screen semantics.
Optionally, in the method, identifying the emotion fragment in each video fragment to be analyzed according to the fragment emotion vector and the emotion entropy includes:
judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
if yes, judging that the current video segment to be analyzed contains an emotion, or;
if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold value or not;
if so, judging that the current video segment to be analyzed contains one emotion, or if not, judging that the current video to be analyzed contains two emotions.
7. An apparatus for identifying emotion fragments in a video, comprising:
the determining module is used for determining the barrage emotion tags of all emotion barrages in the video to be analyzed;
the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;
the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
and the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
The above apparatus, optionally, the determining module includes:
the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;
the screening unit is used for screening the bullet screens to obtain all emotion bullet screens;
and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.
The above apparatus, optionally, the dividing module includes:
the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;
the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.
The above apparatus, optionally, the identification module includes:
the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
the first judging unit is used for judging whether the current video clip to be analyzed contains an emotion or not if the current video clip to be analyzed contains the emotion;
the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;
and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.
Compared with the prior art, the invention has the following advantages:
the invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and recognizing the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the recognition period of the emotion fragments is shortened, and the problem of long recognition period of the emotion fragments caused by long marking time of the emotion labels manually identified is solved.
Of course, it is not necessary for any product in which the invention is practiced to achieve all of the above-described advantages at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
FIG. 1 is a flowchart of a method for identifying video emotion fragments disclosed in an embodiment of the present application;
fig. 2 is a block diagram of a structure of an apparatus for recognizing video emotion fragments according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The invention discloses a method and a device for identifying video emotion fragments, which are applied to the process of identifying the emotion fragments in a video, wherein the emotion fragments comprise: according to the method, an emotion label is manually marked, emotion fragments in a video are identified based on the emotion label, but the marking time of the manually marked emotion label is long, so that the identification period is long, the embodiment of the invention provides a video emotion fragment identification method for solving the problems, a plurality of video sharing platforms widely have time synchronization comments named as 'barrage', the time synchronization comments are the instant feelings of audience viewing, contain rich emotion expressions, are consistent with video emotion development and can be used for video emotion analysis, therefore, the identification method identifies the video to be analyzed based on the barrage, the execution flow of the identification method is shown in figure 1, and the method comprises the following steps:
s101, determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed;
in the embodiment of the invention, each bullet screen of the video to be analyzed is obtained, as the bullet screen is the instant feeling of a video user, not all audiences participate in the interaction of the bullet screen of the whole video, and card punching bullet screens or sign-in bullet screens irrelevant to video semantics and video emotions often appear, the themes of each bullet screen are loose, the semantic noise is more, and preferably, denoising processing is carried out firstly.
Further, an emotion label of each emotion barrage is determined according to a preset neural network model, wherein the preset neural network model is a barrage emotion model, the barrage emotion model needs to be trained in advance, and the training process of the barrage emotion model is as follows:
firstly, a training data set is constructed, and a bullet screen emotional data set C with emotional labels is constructed from a bullet screen data set C without emotional labels e Based on C e Training the barrage emotion modelAnd (4) molding. Bullet screen emotion data set C considering high cost of manual annotation e The emotion labels are obtained through a two-stage word matching method, and the basic idea is established on the basis of the fact that the emotion expression of the bullet screen is very common. The bullet screen contains rich emotion expressions, and the bullet screen with the explicit emotion expressions can perform automatic emotion recognition through a two-stage emotion dictionary matching method. The first stage is that the emotion polarity recognition is carried out on all video barrages through a comprehensive emotion polarity dictionary integrating a general emotion dictionary and a barrage emotion dictionary, and the barrage which contains explicit emotion expression and can recognize the positive and negative emotion polarities is selected; and in the second stage, fine-grained emotion recognition (five types of emotions: happiness, surprise, dislike, sadness and fear) is carried out on the bullet screen with positive and negative emotion polarities obtained in the first stage by using a fine-grained emotion dictionary, and an emotion bullet screen containing an emotion label is finally obtained by an emotion dictionary matching method in the two stages. Bullet screen data set C and emotion bullet screen data set C e Is expressed as follows:
C={(C 1 ,T 1 ,I 1 ),K(C i ,T i ,I i ),K(C N ,T N ,I N )} (1)
wherein any element (C) in the barrage data set C i ,T i ,I i ) Respectively represents T i Barrage C corresponding to moment i And scene image data I of video key frame i . Emotion barrage data set C e Any element of (1)Respectively representBullet screen that moment correspondsAnd scene image data of video key frameAnd barrageCorresponding five-classification sentiment tagN and M represent the number of sentences of the bullet screen text and the number of sentences of the emotion bullet screen text, respectively.
In the embodiment of the invention, based on the emotion bullet screen data set C e Training the barrage emotion model, wherein the input of the barrage emotion model is an emotion barrage data set C e Of which any one element isNamely:bullet screen text corresponding to momentAnd emotion tag textAnd visual data information of the video key frame at that timeThe characterization process of the input data is as follows: obtaining bullet screen text by using pre-training language model BertSentence vector characterization ofSum word vector characterizationBarrage emotion label obtained by using pre-training language model BertSentence vector characterization ofProcessing visual image information of video key frame by using existing deep network model VGGExtracting the last convolution layer of the VGG model as the resultVector characterization ofThe correlation formula is as follows:
in view of the fact that the text semantics of the bullet screen are related to the semantics of the video scene at the corresponding moment, in the embodiment of the invention, the scene visual information of the bullet screen is usedWord vector fused into bullet screen text in attention modeIn (1), help model focuses on related to barrage visual sceneWord to obtain bullet screen word vector with visual attentionThe relevant formula for the attention mechanism is as follows.
α=soft max(W 3 M) (7)
Wherein, W 1 、W 2 And W 3 Which may be set empirically or on a case-by-case basis, for the training parameters of the attention unit, tanh represents the activation function of the deep neural network, M is an intermediate quantity,expressing normalization operation, alpha expressing each word of visual information in the bullet screen textThe visual attention weight alpha acts on the word vector of the bullet screen textObtaining bullet screen word vectors with visual attention
Considering the sequence information of the words contained in the sentence text, the invention utilizes the recurrent neural network BilSTM and the self-attention mechanism to carry out the bullet screen word vector fused with the visual informationModeling to obtain fine-grained semantic representation of the bullet screen
As shown in the formula (3), the bullet screen sentence vector representation is obtained by the Bert modelThe method is the coarse-grained semantic representation of the whole sentence of the bullet screen and represents the whole sentence of the bullet screen and the fine-grained sentence of the bullet screen semanticallyCarrying out splicing operation with weight to obtain bullet target meaning representationSee the following formula.
Wherein, γ is a weight adjustment parameter, and the sign '+' is a splicing operation of tensor.
Subsequently, the target semantic representationTraining and outputting through full link FC, obtaining emotion probability P of the bullet screen:
wherein y represents the emotion type to which the bullet screen belongs,representing through inputCalculating to obtain the bullet screenThe emotion category probability of (1). FC is a single-layer full-connection network structure, outputs through the full-connection layer, obtains the emotion probability P of each emotion barrage, trains the emotion model through minimizing the following objective function:
wherein, the first and the second end of the pipe are connected with each other,barrage for emotionThe original emotion tag of (1),barrage for emotionAfter model training and output emotion probability, softmax _ cross _ entropy is a cross entropy loss function, and the original emotion label of each bullet screen is calculatedAnd emotion prediction resultsCross entropy loss of (c). In order to minimize the objective function, an Adam optimizer is adopted to iteratively update each parameter in the model (Tensorflow automatic derivation implementation), so that the emotion recognition model of the bullet screen is trained.
Finally, any bullet screen C in the bullet screen data set C is identified by the trained bullet screen emotion identification model k :(C k ,T k ,I k ) Performing emotion prediction and outputting P (y | C) k ,I k ) And further obtaining an emotion probability vector
Wherein, P (y | C) k ,I k ) As a bullet screen C k The result of the model(s) of (3) is predicted,the method aims to calculate the proportion of each class in multiple classes and ensure that the sum of all the proportions is 1. The invention is provided withPair barrage C k Further processing the prediction result to obtain a bullet screen C k Of the emotion probability vector It is a five-dimensional emotion vector, the value of which in each dimension can be regarded as bullet screen C k Sentiment semantic distribution in each dimension, measuring bullet screen C k The emotional semantic value in each emotional dimension also represents the bullet screen C k The emotion tag of (1).
S102, segmenting the video to be analyzed to obtain video segments to be analyzed;
in the embodiment of the invention, because the barrage comment is the instant response of the audience, the contained emotion is always instant. Therefore, a video emotion analysis based on one continuous period is most suitable. In fact, a video contains many relatively independent scene segments, and the content of these segments usually has relatively independence and topics, which will evolve with the development of video scenes, namely: the change of the video plot is generally consistent with the switching of the video scene, and the change of the video scene can be used as the segmentation basis of the video clip. Compared with the conventional equal-length video clip segmentation, the video clip segmentation is more suitable for the application from the viewpoint of scene switching.
Firstly, visual data information of each video key frame is identified by using an object identification method based on bottom-up and top-down attentionCarrying out object recognition to obtain visual words of each frameCan be regarded as a frameThe visual semantics of which describe the frameOf the visual scene. The text of the visual words of two adjacent frames is changed remarkably, which means that the described scene is changed, and the moment can be used as a video segmentation point.
Furthermore, in order to improve the segmentation accuracy, the invention also corrects the segmentation points from the perspective of video semantics. This operation is implemented by means of a bullet screen that can reflect the video semantics: the segmentation point is used as the plot transition point of the video, the barrage semantics at the moment are relatively loose, if the barrage semantics at the moment are centralized and consistent, the segmentation point is corrected, namely: for any video segment S obtained in the previous stage i For all the bullet screens in the bullet screen, the similarity of the redundant strings is found pairwise to construct a segment S i Semantic similarity matrix of (2)Further obtain a video segment S i Bullet screen average semantic similarityJudging each video segment S i Mean semantic similarity of bullet screensDiscarding video segments with very high average semantic similarity (determined by adopting an empirical threshold through practical experiments), and finally obtaining a video segment set to be analyzed { s } with relatively independent and natural plot p }。
S103, calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
in the embodiment of the invention, the video to be analyzed contains complex multi-mode content, the emotion is complex, and the emotion barrage of the video audience can be regarded as indirect reflection of the video emotion, so that the method is suitable for video emotion analysis. Set of video segments to be analyzed s p Any fragment of s i Fragment s i The emotion bullet screen set isSegment s i The set of emotion vectors corresponding to the emotion bullet screen isEach bullet screenEmotion vector ofIs composed ofSumming the emotion vectors of all bullet screens of the segment according to each dimension to obtain a segment s i Emotion and vector ofAs shown in the following formula:
wherein u is a segment s i Number of medium emotion barrages, and vectorI.e. fragment s i Represents the segment s i Emotion tags in each emotion dimension.
In the information theory: the entropy is the quantity for describing the disorder of the system, the greater the entropy is, the more disorder the system is, the less information is carried, and the smaller the entropy is, the more ordered the system is, and the more information is carried. In segment s i Emotion vector ofIn (2), the distribution concentration degree of the emotion semantic information of each emotion dimension can also be used as the segment s i The entropy of the emotion analysis data is measured, and then the emotion fragment s to be analyzed is judged i Emotional tendency of (1), emotional segment to be analyzed s i Also referred to as segment s in the present invention i The segment s can be obtained according to an entropy formula in the entropy i The emotional entropy of (a) is shown as follows:
and S104, identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy.
In the embodiment of the invention, because the video emotion is complex, the emotional tendency of the emotion video segments is not all in a single emotion category, and the video segments with complex emotion are common. The present application is directed to finding video segments that contain no more than two distinct emotional tendencies: one is a video emotion fragment with only one obvious emotional tendency; the other is a video emotion fragment with two obvious emotional tendencies.
Wherein the processing procedure for the video emotion fragments with single emotional tendency is as followsThe following: when video segment s is to be analyzed i Emotional entropy ofVery small, less than its emotional entropy threshold H (e) threshold When the fragment s i The emotion semantics of all dimensions of the emotion bullet screen tend to be consistent, which means that the video segment s to be analyzed i Only one significant emotional tendency is involved.
On the basis of the above, when the segment s i Emotional entropy ofOnly slightly above threshold H (e) threshold When the fragment s i The emotional tendency (c) is not necessarily only one, and further judgment is needed: in segment s i Emotion vector ofIn, whenMaximum component ofFar greater thanSecond largest component ofThen the segment s i There is only one emotional tendency of, i.e.Maximum component ofOf dimension (d) ofThe emotion category belongs to the following formula.
The processing process of the emotion fragments of the video to be analyzed containing two emotion tendencies comprises the following steps: as shown in the formula (18), when the video emotion fragment s is to be analyzed i Emotional entropy ofOnly slightly above threshold H (e) threshold Time to analyze video emotion fragment s i The emotional tendency of (a) is not necessarily only one: when the video emotion fragment s to be analyzed i Emotion vector ofIn the step (1), the first step,maximum component ofAndsecond largest component ofWhen the difference is small, the belonging emotion categories of the dimensionalities of the two components can be regarded as video emotion fragments s to be analyzed i I.e. the video emotion fragments s to be analyzed i There are two main emotional tendencies.
Considering that the theme of the bullet screen is loose and the semantic noise is more, the invention also carries out noise reduction from the semantic angle and utilizes each video segment s to be analyzed i The text of the emotion bullet screenVector, for segment s i Feeling semantic similarity matrixIs a symmetric matrix, each value of which represents the pairwise semantic relevance of each emotion bullet screen, andand analyzing the upper triangular part, and if the semantic similarity of the two emotion bullet screens is lower than the semantic similarity threshold of the bullet screens in the segments (the semantic similarity threshold is determined according to actual experiments), determining the two emotion bullet screens as semantic distortion points and deleting the corresponding emotion bullet screens. By this operation, the embodiment of the present invention can have better robustness.
The invention discloses a method for identifying video emotion fragments, which comprises the following steps: determining a bullet screen emotion label of each emotion bullet screen in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification method, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
Based on the identification method, aiming at the barrage with rich emotion in the video, the attention mechanism and the multi-mode fusion thought are applied, and the barrage emotion semantics and the visual information of the video scene are fused in time sequence, so that the enhancement representation of the barrage emotion semantics is realized, and the representation integrated with the text and the visual information is utilized to judge the emotion of the video segmentation segment based on the visual semantics and the scene switching, accurately identify the emotion segment in the video, and make up the defect that the video emotion segment is not identified in the existing video emotion understanding.
Based on the foregoing identification method for video emotion fragments, an embodiment of the present invention further provides an identification apparatus for video emotion fragments, where a structural block diagram of the identification apparatus is shown in fig. 2, and the identification apparatus includes:
a determination module 201, a segmentation module 202, a calculation module 203 and a recognition module 204.
Wherein, the first and the second end of the pipe are connected with each other,
the determining module 201 is configured to determine a bullet screen emotion tag of each emotion bullet screen in a video to be analyzed;
the segmentation module 202 is configured to segment the video to be analyzed to obtain each video segment to be analyzed;
the calculating module 203 is configured to calculate segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion tags in each video segment to be analyzed;
the identifying module 204 is configured to identify an emotion fragment in each to-be-analyzed video fragment according to the fragment emotion vector and the emotion entropy.
The invention discloses a video emotion fragment recognition device, which comprises: determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed; segmenting the video to be analyzed to obtain each video segment to be analyzed; calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy. In the identification device, a video to be analyzed is divided into a plurality of video segments to be analyzed, and segment emotion vectors and emotion entropies of the video segments to be analyzed are calculated; and identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors of the fragments obtained by the barrage and the emotion entropy, so that the identification period of the emotion fragments is shortened, and the problem of long emotion fragment identification period caused by long emotion label marking time of manual identification is solved.
In this embodiment of the present invention, the determining module 201 includes:
an acquisition unit 205, a screening unit 206 and a tag determination unit 207.
Wherein the content of the first and second substances,
the obtaining unit 205 is configured to obtain each barrage of the video to be analyzed;
the screening unit 206 is configured to screen the bullet screens to obtain emotion bullet screens;
the label determining unit 207 is configured to determine a bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
In this embodiment of the present invention, the dividing module 202 includes:
a semantic determination unit 208, a first judgment unit 209, and a slicing unit 210.
Wherein the content of the first and second substances,
the semantic determining unit 208 is configured to determine visual semantics of each frame in the video to be analyzed;
the first judging unit 209 is configured to sequentially compare the visual semantics of the adjacent frames, and judge whether a difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold;
and the segmentation unit 210 is configured to segment the adjacent frames as segmentation points if the adjacent frames are the segmentation points, so as to obtain each video segment to be analyzed.
In an embodiment of the present invention, the identifying module 204 includes:
a second determination unit 211, a first determination unit 212, a third determination unit 213, and a second determination unit 214.
Wherein the content of the first and second substances,
the second determining unit 211 is configured to determine whether the emotion entropy is smaller than a preset emotion entropy threshold;
the first determining unit 212 is configured to determine that the current video segment to be analyzed includes an emotion, or;
the third determining unit 213 is configured to determine, if the ratio of the maximum component to the second largest component in the current segment emotion vector is greater than a preset ratio threshold;
the second determining unit 214 is configured to determine that the current video segment to be analyzed includes one emotion if the current video segment to be analyzed includes one emotion, or determine that the current video segment to be analyzed includes two emotions if the current video segment to be analyzed does not include one emotion.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. For the device-like embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the units may be implemented in one or more of software and/or hardware in implementing the invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The method and the device for identifying the video emotion fragments provided by the invention are described in detail, a specific example is applied in the text to explain the principle and the implementation mode of the invention, and the description of the embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
Claims (8)
1. A method for identifying video emotion fragments is characterized by comprising the following steps:
determining the bullet screen emotion labels of all emotion bullet screens in a video to be analyzed;
segmenting the video to be analyzed to obtain each video segment to be analyzed;
calculating segment emotion vectors and emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy;
identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy, wherein the identification comprises the following steps:
judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
if yes, judging that the current video segment to be analyzed contains an emotion, or;
if not, judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportion threshold value or not;
if yes, judging that the current video segment to be analyzed contains one emotion, or otherwise, judging that the current video segment to be analyzed contains two emotions.
2. The method of claim 1, wherein determining the bullet screen emotion label for each emotion bullet screen in the video to be analyzed comprises:
acquiring each bullet screen of the video to be analyzed;
screening the bullet screens to obtain all emotion bullet screens;
and determining the bullet screen emotion label of each emotion bullet screen according to a preset neural network model.
3. The method of claim 2, wherein determining the bullet screen sentiment label of each sentiment bullet screen according to a preset neural network model comprises:
determining a target semantic representation of each emotion bullet screen, wherein the target semantic representation is obtained by splicing the fine-granularity semantic representation and the coarse-granularity semantic representation of the corresponding emotion bullet screen;
determining visual vector representation of scene image data at each emotion bullet screen generation moment;
and transmitting the target semantic representation and the visual vector representation to the preset barrage emotion recognition neural network model to obtain an emotion label corresponding to the emotion barrage.
4. The method of claim 1, wherein segmenting the video to be analyzed to obtain each video segment to be analyzed comprises:
determining the visual semantics of each frame in the video to be analyzed;
sequentially comparing the visual semantics of the adjacent frames, and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and if so, segmenting the adjacent frames as segmentation points to obtain each video segment to be analyzed.
5. The method of claim 4, further comprising:
acquiring the bullet screen semanteme of the adjacent frames;
and correcting the segmentation points according to the bullet screen semantics.
6. An apparatus for identifying emotion fragments in a video, comprising:
the determining module is used for determining the bullet screen emotion labels of all emotion bullet screens in the video to be analyzed;
the segmentation module is used for segmenting the video to be analyzed to obtain each video segment to be analyzed;
the computing module is used for computing the segment emotion vectors and the emotion entropies of the video segments to be analyzed according to the bullet screen emotion labels in the video segments to be analyzed;
the identification module is used for identifying the emotion fragments in the video fragments to be analyzed according to the fragment emotion vectors and the emotion entropy;
wherein the identification module comprises:
the second judgment unit is used for judging whether the emotion entropy is smaller than a preset emotion entropy threshold value or not;
the first judging unit is used for judging whether the current video segment to be analyzed contains an emotion or not if the current video segment to be analyzed contains the emotion;
the third judging unit is used for judging whether the ratio of the maximum component to the second maximum component in the current segment emotion vector is larger than a preset proportional threshold or not if the ratio is not larger than the preset proportional threshold;
and the second judging unit is used for judging that the current video segment to be analyzed contains one emotion if the judgment result is positive, or judging that the current video segment to be analyzed contains two emotions if the judgment result is negative.
7. The apparatus of claim 6, wherein the determining module comprises:
the acquisition unit is used for acquiring each bullet screen of the video to be analyzed;
the screening unit is used for screening each bullet screen to obtain each emotion bullet screen;
and the label determining unit is used for determining the bullet screen emotion labels of each emotion bullet screen according to the preset neural network model.
8. The apparatus of claim 6, wherein the segmentation module comprises:
the semantic determining unit is used for determining the visual semantics of each frame in the video to be analyzed;
the first judgment unit is used for sequentially comparing the visual semantics of the adjacent frames and judging whether the difference degree of the visual semantics of the adjacent frames is greater than a preset difference degree threshold value;
and the segmentation unit is used for segmenting the adjacent frames serving as segmentation points if the adjacent frames serve as segmentation points to obtain each video segment to be analyzed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010645824.XA CN111860237B (en) | 2020-07-07 | 2020-07-07 | Video emotion fragment identification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010645824.XA CN111860237B (en) | 2020-07-07 | 2020-07-07 | Video emotion fragment identification method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111860237A CN111860237A (en) | 2020-10-30 |
CN111860237B true CN111860237B (en) | 2022-09-06 |
Family
ID=73153438
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010645824.XA Active CN111860237B (en) | 2020-07-07 | 2020-07-07 | Video emotion fragment identification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111860237B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112364743A (en) * | 2020-11-02 | 2021-02-12 | 北京工商大学 | Video classification method based on semi-supervised learning and bullet screen analysis |
CN112699831B (en) * | 2021-01-07 | 2022-04-01 | 重庆邮电大学 | Video hotspot segment detection method and device based on barrage emotion and storage medium |
CN113221689B (en) * | 2021-04-27 | 2022-07-29 | 苏州工业职业技术学院 | Video multi-target emotion degree prediction method |
CN114339375B (en) * | 2021-08-17 | 2024-04-02 | 腾讯科技(深圳)有限公司 | Video playing method, method for generating video catalogue and related products |
CN113656643B (en) * | 2021-08-20 | 2024-05-03 | 珠海九松科技有限公司 | Method for analyzing film viewing mood by using AI |
CN117710777B (en) * | 2024-02-06 | 2024-06-04 | 腾讯科技(深圳)有限公司 | Model training method, key frame extraction method and device |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108513176A (en) * | 2017-12-06 | 2018-09-07 | 北京邮电大学 | A kind of socialization video subject extraction system and method based on topic model |
CN108537139A (en) * | 2018-03-20 | 2018-09-14 | 校宝在线(杭州)科技股份有限公司 | A kind of Online Video wonderful analysis method based on barrage information |
CN108737859A (en) * | 2018-05-07 | 2018-11-02 | 华东师范大学 | Video recommendation method based on barrage and device |
CN109862397A (en) * | 2019-02-02 | 2019-06-07 | 广州虎牙信息科技有限公司 | A kind of video analysis method, apparatus, equipment and storage medium |
CN110020437A (en) * | 2019-04-11 | 2019-07-16 | 江南大学 | The sentiment analysis and method for visualizing that a kind of video and barrage combine |
CN110113659A (en) * | 2019-04-19 | 2019-08-09 | 北京大米科技有限公司 | Generate method, apparatus, electronic equipment and the medium of video |
CN110198482A (en) * | 2019-04-11 | 2019-09-03 | 华东理工大学 | A kind of video emphasis bridge section mask method, terminal and storage medium |
CN110263215A (en) * | 2019-05-09 | 2019-09-20 | 众安信息技术服务有限公司 | A kind of video feeling localization method and system |
CN110569354A (en) * | 2019-07-22 | 2019-12-13 | 中国农业大学 | Barrage emotion analysis method and device |
CN110852360A (en) * | 2019-10-30 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Image emotion recognition method, device, equipment and storage medium |
CN111163366A (en) * | 2019-12-30 | 2020-05-15 | 厦门市美亚柏科信息股份有限公司 | Video processing method and terminal |
-
2020
- 2020-07-07 CN CN202010645824.XA patent/CN111860237B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108513176A (en) * | 2017-12-06 | 2018-09-07 | 北京邮电大学 | A kind of socialization video subject extraction system and method based on topic model |
CN108537139A (en) * | 2018-03-20 | 2018-09-14 | 校宝在线(杭州)科技股份有限公司 | A kind of Online Video wonderful analysis method based on barrage information |
CN108737859A (en) * | 2018-05-07 | 2018-11-02 | 华东师范大学 | Video recommendation method based on barrage and device |
CN109862397A (en) * | 2019-02-02 | 2019-06-07 | 广州虎牙信息科技有限公司 | A kind of video analysis method, apparatus, equipment and storage medium |
CN110020437A (en) * | 2019-04-11 | 2019-07-16 | 江南大学 | The sentiment analysis and method for visualizing that a kind of video and barrage combine |
CN110198482A (en) * | 2019-04-11 | 2019-09-03 | 华东理工大学 | A kind of video emphasis bridge section mask method, terminal and storage medium |
CN110113659A (en) * | 2019-04-19 | 2019-08-09 | 北京大米科技有限公司 | Generate method, apparatus, electronic equipment and the medium of video |
CN110263215A (en) * | 2019-05-09 | 2019-09-20 | 众安信息技术服务有限公司 | A kind of video feeling localization method and system |
CN110569354A (en) * | 2019-07-22 | 2019-12-13 | 中国农业大学 | Barrage emotion analysis method and device |
CN110852360A (en) * | 2019-10-30 | 2020-02-28 | 腾讯科技(深圳)有限公司 | Image emotion recognition method, device, equipment and storage medium |
CN111163366A (en) * | 2019-12-30 | 2020-05-15 | 厦门市美亚柏科信息股份有限公司 | Video processing method and terminal |
Non-Patent Citations (2)
Title |
---|
《Visual-Texual Emotion Analysis With Deep Coupled Video and Danmu Neural Networks》;Chenchen Li等;《IEEE Transactions on Multimedia》;20200630;第22卷(第6期);第1634-1646页 * |
《基于弹幕情感分析的视频片段推荐模型》;邓扬等;《计算机应用》;20170410;第37卷(第4期);第1065-1070,1134页 * |
Also Published As
Publication number | Publication date |
---|---|
CN111860237A (en) | 2020-10-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111860237B (en) | Video emotion fragment identification method and device | |
Yang et al. | Video captioning by adversarial LSTM | |
Poria et al. | Context-dependent sentiment analysis in user-generated videos | |
CN106878632B (en) | Video data processing method and device | |
CN113255755B (en) | Multi-modal emotion classification method based on heterogeneous fusion network | |
CN112199956B (en) | Entity emotion analysis method based on deep representation learning | |
CN110825867B (en) | Similar text recommendation method and device, electronic equipment and storage medium | |
US11727915B1 (en) | Method and terminal for generating simulated voice of virtual teacher | |
WO2023124647A1 (en) | Summary determination method and related device thereof | |
CN113657115A (en) | Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion | |
CN113761377A (en) | Attention mechanism multi-feature fusion-based false information detection method and device, electronic equipment and storage medium | |
CN115580758A (en) | Video content generation method and device, electronic equipment and storage medium | |
CN115408488A (en) | Segmentation method and system for novel scene text | |
CN115830610A (en) | Multi-mode advertisement recognition method and system, electronic equipment and storage medium | |
CN114880496A (en) | Multimedia information topic analysis method, device, equipment and storage medium | |
CN113268592B (en) | Short text object emotion classification method based on multi-level interactive attention mechanism | |
Zaoad et al. | An attention-based hybrid deep learning approach for bengali video captioning | |
US20240037941A1 (en) | Search results within segmented communication session content | |
CN115169472A (en) | Music matching method and device for multimedia data and computer equipment | |
Zhang et al. | Recognition of emotions in user-generated videos through frame-level adaptation and emotion intensity learning | |
CN114676699A (en) | Entity emotion analysis method and device, computer equipment and storage medium | |
Gomes Jr et al. | Framework for knowledge discovery in educational video repositories | |
Wang et al. | RSRNeT: a novel multi-modal network framework for named entity recognition and relation extraction | |
Wang et al. | Multimodal Cross-Attention Bayesian Network for Social News Emotion Recognition | |
EP4248415A1 (en) | Automated video and audio annotation techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |