CN116226347A - Fine granularity video emotion content question-answering method and system based on multi-mode data - Google Patents

Fine granularity video emotion content question-answering method and system based on multi-mode data Download PDF

Info

Publication number
CN116226347A
CN116226347A CN202310184746.1A CN202310184746A CN116226347A CN 116226347 A CN116226347 A CN 116226347A CN 202310184746 A CN202310184746 A CN 202310184746A CN 116226347 A CN116226347 A CN 116226347A
Authority
CN
China
Prior art keywords
video
features
question
answer
emotion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310184746.1A
Other languages
Chinese (zh)
Inventor
马翠霞
秦航宇
杜肖兵
邓小明
王宏安
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Software of CAS
Original Assignee
Institute of Software of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Software of CAS filed Critical Institute of Software of CAS
Publication of CN116226347A publication Critical patent/CN116226347A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention belongs to the field of video questions and answers, and particularly relates to a fine-granularity video emotion content question and answer method and system based on multi-mode data. The video emotion reasoning base line model is built based on the plot memory network, a multi-branch processing module aiming at visual, audio and text data is designed, time sequence dependence in multi-mode data is encoded by means of a transducer encoder, the extracted multi-mode features comprise multi-angle emotion content, and fine-granularity video emotion content question-answering tasks can be accurately completed. The invention utilizes a transducer encoder to learn time sequence association relation on video, audio and text sequences, and extracts high-dimensional multi-modal characteristics related to emotion classification, wherein the time sequence association relation is important to emotion information contained in analysis video. The method can effectively improve the accuracy of the multi-mode-based fine-granularity video emotion content question-answering task results.

Description

Fine granularity video emotion content question-answering method and system based on multi-mode data
Technical Field
The invention belongs to the field of video questions and answers, and particularly relates to a fine-granularity video emotion content question and answer method and system based on multi-mode data.
Background
In recent years, emotion analysis in movies/television programs has gained increasing attention in the fields of emotion calculation, artificial intelligence system design, and the like. The movie/television program contains rich interactive scenes and character relationships, and the characters in the video can experience the same emotion as the human in the real world, such as excitement caused by rewards and sadness caused by separation. The video scene with the artificial center in the video is closely related to the social scene in the real life, which provides a platform for training the artificial intelligence system to understand the advanced semantic information behind the emotion, namely the reason and intention for inducing the emotion, the behavioral motivation of the person and the like, and the rich emotion contained in the video provides data support for researching the emotion content contained in the video content.
Intelligent systems need to have the ability to fine-tune understanding of video scenes, not only to be able to recognize emotion categories, but also to be able to infer the cause behind emotion, intent, and the user's behavioral motivation, etc. in an interpretive manner. In the work of studying the emotional content contained in a video, multimodal information of the video is often used as a source of analysis data, such as a video emotion recognition person. Video emotion recognition methods based on multimodal information understand emotion conveyed by video content primarily through audio, text, visual, and other information. For example, based on video context information and facial expression information of a person in the video, identifying emotion in the video using a cascading structural model composed of RNNs; identifying a video emotion based on the dual-stream coding model by integrating facial expression features of a person in the video with video background information; based on the interaction between visual content and text vocabulary, adopting a collaborative memory network to identify emotion in the multi-modal information; based on the multi-modal transformation network, the visual and auditory characterizations are uniformly mapped to the same feature space for video emotion recognition and the like. The video emotion understanding technology research is mainly focused on a video emotion recognition method, and the emotion reasoning research in the video is less. With the development of intelligent interactive applications, researchers began exploring methods for inferring potential causes behind emotions based on multimodal scenes, such as extracting emotion-cause pairs based on multimodal information in video conversations, and inferring causes behind emotions through understanding of multimodal information. Therefore, on the basis of the video emotion recognition based on the multi-mode information, the video emotion reasoning method is further researched, and support is provided for deep understanding of the reason for inducing emotion in the video multi-mode content.
In order to improve the intelligence of man-machine interaction, on the basis of identifying video emotion, the emotion requirement and intention of a user are required to be correctly understood, and the method is the main research content of video emotion reasoning work. Thus, in the deep research of video emotion understanding technology, there is a need to further study how to use interpretive ways to infer emotion in video interactive scenes, i.e. understand the cause, intention and user's behavioral motivation behind the video. The video emotion reasoning work needs to be carried out aiming at a video interaction scene with people as a center, and only when people are placed in specific situations, multi-mode information in the interaction process can be fully utilized to finish understanding of emotion, so that reasoning is carried out on multiple aspects related to the emotion.
1) The prior art comprises the following steps: emotion cause extraction and emotion reasoning
The discovery of potential reasons behind a particular emotion expression in a conversational text context or multi-modal scenario has been a popular topic in the emotion computing field. Emotion cause extraction is a refinement task of emotion analysis, which aims at exploring the potential cause behind certain emotion expressions in a dialogue. Rui Xia et al (ref: xia R, zhang M, ding Z.RTHN: A rnn-transformer hierarchical network for emotion cause extraction [ C ]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019,Macao,China,August 10-16,2019.Ijcai.org, 2019:5285-5291.) consider relationships between multiple clauses in a conversation and are used for emotion cause extraction work. Since the dialog itself in video is composed of multimodal information, rui Xia et al (ref: wang F, ding Z, xia R, et al Multi-model project-cause pair extraction in conversations [ J ]. CoRR,2021, abs/2110.08020.) further propose a multimodal emotion-reason pair extraction task and jointly extract emotion and its evoked reason from the multimodal information of the dialog. The meces is a further investigation of the ECE task requiring a model with strong video multimodal information understanding and emotion inference capabilities, as emotional causes are not necessarily from text information only, possibly from visual scenes.
In addition, opinion mining is an important issue in text emotion analysis, where emotion inference is a subtask that deals with the "who holds the opinion and why" problem in the text emotion analysis task. Although the research on video emotion reasoning tasks is relatively few in the field of video emotion understanding at present, the related work can see that the emotion reasoning on the video with multi-mode information has profound research significance and application value. Therefore, there is a need for an interpretable way to infer emotion in video, so as to fully understand the reason and intention of emotion in video multi-modal content, and the behavioral motivation of people.
2) The prior art comprises the following steps: multimodal emotion analysis
Multimodal emotion analysis in video aims to understand emotion conveyed by video content through audio, text and visual information. As Sun Man-Chin et al (ref: sun M C, hsu S H, yang M C, et al, context-aware cascade attention-based RNN for video emotion recognition [ C ]//2018First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 2018:1-6.) propose a cascading structural model consisting of two RNNs, using video context information and facial expression information of a person in a video to identify emotion in the video. Lee Jiyou et al (ref: lee J, kim S, kim S, et al Context-aware emotion recognition networks [ C ]// Proceedings of the IEEE/CVF International Conference on Computer vision.2019:10143-10152.) designed a deep network to integrate facial expression features of characters in video and video background information in a joint and enhanced manner to recognize video emotions. Nan Xu et al (ref: xu N, mao W, chen G.A co-memory network for multimodal sentiment analysis [ C ]// The 41st International ACM SIGIR Conference on Research&Development in Information Re-trieval, SIGIR 2018,Ann Arbor,MI,USA,July 08-12,2018.ACM, 2018:929-932.) iteratively model interactions between visual content and text vocabulary using collaborative memory networks for multimodal emotion analysis. Qi Fan et al (ref: qi F, yang X, xu c.zero-shot video emotion recognition via multimodal protagonist-aware transformer network [ C ]// MM'21:ACM Multimedia Conference,Virtual Event,China,Oc-tober 20-24,2021.Acm, 2021:1074-1083.) propose a multimodal transformation network that uniformly maps visual and auditory representations to the same feature space for video emotion recognition.
However, the above-described multi-modal emotion analysis studies have focused mainly on emotion recognition tasks supervised by emotion tags. In recent years, some efforts have begun focusing on further understanding of video moods. Unlike direct identification of the emotional categories of characters in video, guangyao Shen et al (ref: shen G, wang X, duan X, et al memory: A dataset for multimodal emotion reasoning in videos [ C ]// MM'20:The 28th ACM International Conference on Multimedia,Virtual Event/Seattle, WA, USA, october 12-16,2020.ACM, 2020:493-502.) propose role emotion inference based on video multimodal information, and propose emotion inference by means of the emotion of other characters in the same scene for the problem of emotion identification of characters lacking multimodal information in video interactive scenes, the process does not involve understanding the cause behind emotion, intention, and motivation of human behavior. Zadeh et al (ref: zadeh A, chan M, liang P, et al Social-IQ: A question answering benchmark for artificial Social intelligence [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognizing.2019:8807-8817.) construct a Social-IQ dataset and label part of question-answer pairs with emotion or emotion information to support the Social intelligence system to learn the mental state of the user through the answers to the inference questions. Based on the multi-modal emotion analysis, the video emotion reasoning accords with the development trend of the multi-modal emotion analysis based on the multi-modal information, and is the problem that the next step needs to be focused on.
3) The prior art comprises the following steps: question and answer technology
Unlike traditional emotion recognition based on emotion tags, artificial intelligence systems need emotion recognition beyond emotion tag supervision, enabling the ability to infer emotion in video in an interpretable manner. The question-answering technique is an explanatory method of exploring the understanding degree of potential phenomena, and is used in different research fields and achieves good results, such as natural language processing, vision and language, and common sense reasoning fields. Video questions and answers are a task that answers a given video question in a natural language format, and have attracted considerable attention in the past few years due to their applicability in the fields of social intelligence systems, cognitive robots, and the like. At present, a plurality of methods exist in the field of video question and answer, including a attention mechanism, a multi-mode fusion method, a dynamic memory network, multi-mode relation learning and the like, and good results are obtained based on a video question and answer disclosure data set, so that the invention selects to rely on question and answer tasks to realize the reasoning of a model on video emotion.
Disclosure of Invention
Video emotion reasoning is a research hotspot in the field of video emotion understanding at present, and an efficient method is provided for understanding emotion in video based on an interpretive reasoning mode of a question-and-answer form. In general, related information such as reasons, intentions and human behavioral motivations behind emotion in a video are contained in video multimodal data, so that fine-grained understanding of video scenes is performed, and information useful for emotion reasoning is necessary to extract from multimodal information. In general, in emotion reasoning work based on video multi-mode information, one-time reasoning is difficult to learn sufficient effective information, so multi-step reasoning is a key for improving model effect.
In order to solve the problems, the technical scheme provided by the invention is as follows:
a fine granularity video emotion content question-answering method based on multi-mode data comprises the following steps:
1) Dividing the long video by taking a plurality of sentences of dialogue as a unit, and dividing corresponding caption text and audio to obtain a plurality of video clips;
2) Extracting multi-modal features, including visual features, audio features, and text features, for a video segment, and associating the extracted multi-modal features with the video segmentThe question-answer pair is encoded to obtain a question code q T Answer coding
Figure BDA0004103337060000041
3) Respectively carrying out time sequence coding on the extracted multi-mode characteristics;
4) Extracting problem related information from multi-modal features of the video based on the visual branch, the audio branch and the text branch, enhancing the visual branch by using facial features of characters in the video, enhancing the text branch by using story line information in the video story-taking information, and obtaining enhanced multi-modal features;
5) Inputting the enhanced multi-modal features into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal features by using the plot memory network, capturing multi-modal key information in the emotion reasoning process, and obtaining a video context representation C v,a,t
6) Encode q the above problem T Answer coding
Figure BDA0004103337060000042
And video context representation C v,a,t And the input answer prediction module respectively learns the situational awareness attention aiming at the question codes and the answer codes to obtain a final emotion question and answer result P.
Further, the long video is segmented by taking a plurality of sentences of dialogue as a unit, and n sentences of dialogue as a unit, wherein n is less than or equal to 20.
Further, the visual features include global visual features and facial features.
Further, the method for extracting the global visual features comprises the following steps: a pre-trained Resnet-152 model on the ImageNet dataset was used.
Further, the method of extracting facial features includes: detecting a face area in a video frame by using a pre-training model MTCNN, and fine-tuning a VGGFace2 model pre-trained on Facenet by using face area data of main roles in the video (for example, 6 main roles in the video of 'old friends')to obtain face recognition characteristics; extracting facial expression features using a Facenet model pre-trained on the FER2013 dataset; and the facial recognition features and the facial expression features are spliced to form facial features of the person in the video.
Further, the method for extracting the audio features comprises the following steps: an opensimple audio feature extractor is used.
Further, the method for extracting text features and encoding question-answer pairs comprises: the pre-trained GloVe word embedding tool method is adopted.
Further, the method of encoding multi-modal information includes: a transducer encoder was used.
Further, the extracting problem-related information from the multi-modal feature of the video is to use the problem to guide attention to obtain a feature representation related to the problem through the following steps:
1) Characterizing facial features
Figure BDA0004103337060000051
And problem code q T Performing dot multiplication to obtain the similarity s between the problem and the feature;
2) Processing the point multiplication result s by a softmax function to obtain a representation of facial features
Figure BDA0004103337060000052
Is of (a) spatial concentration a f
3) Directing spatial attention a f Facial features
Figure BDA0004103337060000053
Dot multiplication is performed to obtain a question-related feature representation +.>
Figure BDA0004103337060000054
Further, a video context representation C output by a episode memory network is obtained by v,a,t
1) Attention mechanism: computing door mechanism attention score for t update process
Figure BDA0004103337060000055
Wherein F is attn Representing the attention function (reference: xiong C, merity S, socher R.Dynamic memory networks for visual and textual question answering [ C)]//JMLR Workshop and Conference Proceedings:volume 48Proceedings of the 33nd International Conference on Machine Learning,ICML 2016,New York City,NY,USA,June 19-24,2016.JMLR.org,2016:2397-2406.),f i Representing the ith fact vector in the output real-time sequence, m t-1 Is the state after t-1 st update in the memory network module, q represents the problem code vector;
2) Memory cell refresh mechanism: calculating hidden layer state of ith element of GRU (gated recurrent unit) in memory network module
Figure BDA0004103337060000056
Wherein h is i Representing hidden layer state of ith cell in GRU, and last layer hidden layer state of GRU as video multi-modal context representation of t-th memory cell update
Figure BDA0004103337060000057
Wherein t of the subscript is the t-th update, and t of the superscript is text mode; finally, the memory cell state at the t-th time is updated
Figure BDA0004103337060000058
Figure BDA0004103337060000059
Wherein F is mem Is a memory update function.
Further, the video question and answer result of the answer prediction module is obtained through the following steps:
1) Using a context matching module (reference: seo M J, kembhavi A, faradai A, et al Bidirection attention flow for machine comprehension [ C]Calculating the respective modality characteristics,// 5th International Conference on Learning Representations,ICLR 2017,Toulon,France,April 24-26,2017,Conference Track Proceedings.OpenReview.net,2017.)Fusion representation of sign and question, answer representations
Figure BDA00041033370600000512
Figure BDA00041033370600000510
Wherein E is m Memorizing each modal characteristic output by the network module, q v,a,t Is a question representation of context awareness, < >>
Figure BDA00041033370600000511
Is an answer representation of context awareness.
2) Processing the fusion representation using a softmax function and an FC layer to obtain a predictive probability distribution for each branch answer
Figure BDA0004103337060000061
Figure BDA0004103337060000062
3) The answer prediction distributions obtained by the three modes are spliced and processed by using linear and softmax functions to obtain a final answer prediction probability distribution p=softmax (linear ([ P) v ;P a ;P t ]))。
Further, steps 3) -6) are implemented using a video emotional content question-answering model, wherein the video emotional content question-answering model uses end-to-end training, and a loss function of the video emotional content question-answering model
Figure BDA0004103337060000063
Wherein is the predictive distribution of five candidate answers, p= [ P ] 0 ,…,p 4 ]Each element p of (a) i Indicating that the answer corresponding to the sample is a i Probability of (2); y= [ y ] 0 ,…,y 4 ]Is a one-hot coded representation of the sample label, when the answer corresponding to the sample is a i Time y i =1, otherwise y i =0。
A fine-grained video emotional content question-answering system based on multimodal data, comprising:
the video segmentation module is used for segmenting the long video by taking a plurality of sentences of dialogue as a unit and segmenting corresponding subtitle texts and audios to obtain a plurality of video clips;
the multi-modal feature extraction module is used for extracting multi-modal features, including visual features, audio features and text features, of a video clip, and encoding corresponding question-answer pairs to obtain question codes and answer codes;
the coding module is used for respectively carrying out time sequence coding on the extracted multi-mode characteristics;
the multi-modal feature enhancement module is used for extracting problem related information from multi-modal features of the video based on the visual branches, the audio branches and the text branches, enhancing the visual branches by using facial features of characters in the video, enhancing the text branches by using story line information in the video story line information, and obtaining enhanced multi-modal features;
the plot memory network module is used for inputting the enhanced multi-modal characteristics into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal characteristics by using the plot memory network, capturing multi-modal key information in the emotion reasoning process, and obtaining video context representation;
and the answer prediction module is used for respectively learning the context awareness attention aiming at the question codes and the answer codes by taking the question codes and the answer codes and the video context as inputs to obtain the final emotion question and answer result.
A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method described above.
A computer readable storage medium storing a computer program which, when executed by a computer, implements the method described above.
In summary, compared with the prior art, the invention has the following advantages and positive effects:
1. the video emotion reasoning base line model is built based on the plot memory network (Episode Memory Network), a multi-branch processing module aiming at visual, audio and text data is designed, time sequence dependence in multi-mode data is encoded by means of a transducer encoder, the extracted multi-mode features comprise multi-angle emotion content, and fine-granularity video emotion content question-answering tasks can be accurately completed.
2. The invention utilizes a transducer encoder to learn time sequence association relation on video, audio and text sequences, and extracts high-dimensional multi-modal characteristics related to emotion classification, wherein the time sequence association relation is important to emotion information contained in analysis video.
3. According to the invention, attention mechanisms (Question-guided Attention) for Question guidance are respectively used in a visual branch and a text branch to pay Attention to the facial expression features in visual features and the story information in text features in the emotion reasoning process of a person, and the story outline information in the text features is extracted, so that the reasoning precision and Question-answering accuracy of the model method are effectively improved.
4. Aiming at the problem that effective information learned by single-step reasoning is less, the multi-mode clues for reasoning are accumulated in the multi-step updating process of the memory unit through the plot memory network and are used for final emotion related problem answer prediction, and the multi-mode-based fine-granularity video emotion content question-answering task result can be effectively improved.
Drawings
FIG. 1 is a schematic diagram of a multimodal fine granularity video emotional content question-answering task.
FIG. 2 is a network framework flow chart of a fine granularity video emotional content questioning and answering method based on multi-modal data.
Detailed Description
In order to enable those skilled in the art to better understand the present invention, the fine-grained video emotional content question-answering method of the multimodal data provided by the present invention is described in further detail below with reference to the accompanying drawings, but not by way of limitation of the present invention.
Referring to fig. 1 and 2, the present invention proposes a fine-grained video emotion content question-answering method based on multi-modal information, which extracts relevant features of a problem from the video multi-modal information based on a visual branch, an audio branch and a text branch; extracting facial expression feature information in visual features and story-outline information in text features by using question guidance attention; capturing emotion-related clues through a multi-step iterative updating plot memory storage unit to realize multi-step emotion reasoning, and respectively utilizing a transducer encoder to learn multi-mode data time sequence dependency relations; extracting facial expression information in a visual branch and story line information in a text branch respectively by using a problem guiding attention mechanism; and finally, accumulating emotion reasoning clues through multi-step reasoning of the plot memory network, and learning effective video context representation for answer prediction.
1. Multimodal feature extraction
1) Visual feature extraction
Video frames are first extracted from the video at 3fps for subsequent visual feature extraction. The invention mainly aims at extracting two visual characteristics of video frame data: video frame global visual features and facial features.
Global visual features: video frames were processed using a pre-trained Resnet-152 model on the ImageNet dataset to obtain video frame visual features with feature dimensions 204. Then, the visual features of the video frames corresponding to one video clip are stacked to obtain the visual features representing the whole video clip
Figure BDA0004103337060000081
Wherein n is clp Representing the number of video frames in a video clip.
Facial features: firstly, a pre-training model MTCNN is used for detecting a facial area in a video frame, and two features, namely a facial recognition feature and a facial expression feature, are mainly extracted based on the facial area of a person in the video. For example, facial recognition features are obtained by fine-tuning a VGGFace2 model pre-trained on Facenet using facial region data for 6 primary roles in the "old friends" video; facial expression features were extracted using a pretrained Facenet model on the FER2013 dataset. Finally, the facial recognition features and the facial expression features are splicedHuman object face feature in composed video
Figure BDA0004103337060000082
Wherein n is f Representing the number of facial regions of the face in each video clip. />
The input data of the visual branch in the method is V clp And V f
2) Audio feature extraction
Firstly, in order to align the audio data with the caption text, the audio data corresponding to the video clips are segmented into 20 segments corresponding to the caption text according to the time stamp of the caption text, and the contextual relation learning of the model to the audio clips is supported. 6373-dimensional acoustic features are extracted for 20 pieces of audio corresponding to each video clip using an openSIMLE feature extractor under the direction of the coparoe_2016 profile. Finally, the acoustic characteristics corresponding to each video segment can be expressed as
Figure BDA0004103337060000083
Wherein n is a Representing the number of audio segments corresponding to a video segment.
The input data of the audio branch in the method is a.
3) Text feature extraction
Firstly extracting character vector characteristics of input text data, obtaining 300-dimensional word coding characteristics by means of Glove trained on Wikipedia2014 and Gigaword5, and obtaining characteristic representation S epsilon of subtitle text corresponding to each video segment
Figure BDA0004103337060000084
Wherein n is set Representing the number of caption text sentences corresponding to a video clip, n wrd Representing the number of words in each sentence. In addition, the story-outline text feature corresponding to each video clip may be expressed as
Figure BDA0004103337060000085
Wherein n is ks Representing the number of sentences in the story outline corresponding to each video clip, n kw Representing the number of words in each sentence.
The input data corresponding to the text branches in the method are S and K.
2. Transformer-based multi-modal information encoding
The transducer is excellent in a large number of natural language processing problems and has the ability to learn long-range data dependencies. The long-term learning ability of a transducer to sequence data is primarily dependent on self-attention mechanisms. Furthermore, in order to be able to learn the information of the sequence data in different feature subspaces, the invention is implemented in a transducer by processing multiple self-attention mechanisms in parallel using a multi-head attention mechanism, expressed as:
MHA(Q,K,V)=Concat(head 1 ,…,head k )W 0
Figure BDA0004103337060000086
Figure BDA0004103337060000091
where Q represents a query matrix, K represents a key value matrix, V represents a value matrix, W is a weight matrix, and Attention () represents a self-Attention mechanism calculation process. head part i Representing the ith "attention head" in the multi-head attention mechanism. The method uses three independent transform encoders to perform time sequence encoding processing on the sequence input data of each branch, wherein the number of layers of the transform encoder is 2, and the number of 'attention heads' in multi-head attention is 6.
For visual branches, the input data includes V clp And V f . Before further encoding the input data, the visual features V are first processed separately using a linear transform layer clp And V f The dimension of the two visual features is unified to 300, and the visual features after linear transformation are obtained
Figure BDA0004103337060000092
And->
Figure BDA0004103337060000093
Then, will->
Figure BDA0004103337060000094
And->
Figure BDA0004103337060000095
And respectively used as input data of a transducer encoder to learn time sequence dependency information of video visual sequence characteristics. Then, the last layer output of the transducer encoder is used as the visual feature obtained by further encoding +.>
Figure BDA0004103337060000096
And->
Figure BDA0004103337060000097
For the audio branches, the same method is used for obtaining the further coding characteristic of the audio data through the linear transformation and the coding processing of a transducer
Figure BDA0004103337060000098
For text branching, text features are obtained by linear transformation and encoding processing of a transducer by the same method
Figure BDA0004103337060000099
And->
Figure BDA00041033370600000910
For the text features of the questions and the text features of the answers in the data, an independent transducer encoder is used for encoding to obtain the text representation of the questions>
Figure BDA00041033370600000911
And answer text representation +.>
Figure BDA00041033370600000912
Figure BDA00041033370600000913
Wherein n is q Is the number of words in the question text sentence, +.>
Figure BDA00041033370600000914
Is the answer text sentence a iT The number of words in (a) is provided.
3. Problem directing attention
Inspired by the human being's cognitive process of inferring the underlying cause behind a certain emotion, the present invention uses the facial features of the person in the video to enhance the visual branch, and uses the storyline information in the video storyline information to enhance the text branch. Facial expressions are a direct reflection of human emotion, and people tend to focus on the facial expression of a person in a video when understanding and reasoning about emotion in a video. The first step in using facial expressions to enhance the visual branch is to extract information related to the problem in the facial features by means of the attention mechanism. Taking the visual branch as an example, the problem-guided attention mechanism is as follows:
Figure BDA00041033370600000915
a f =softmax(s)
Figure BDA00041033370600000916
where s represents the similarity between the problem and the facial features, a f Representing facial features
Figure BDA00041033370600000917
Finally obtaining a facial feature representation by extracting features related to the question according to the spatial attention score>
Figure BDA00041033370600000918
Enhanced visual separationThe visual characteristics of the branch may be expressed as +.>
Figure BDA0004103337060000101
Wherein->
Figure BDA0004103337060000102
n v The number of video frames corresponding to the video clips; representing the connection symbol.
4. Multi-modal plot memory network
The stormwork is designed to retrieve information related to a question from a sequence of facts entered by the network to answer the question, and is particularly applicable to questions that require video-context-dependent reasoning. The invention uses the plot memory network to update and store the emotion reasoning clues extracted from the multi-modal characteristics.
In the invention, the memory network modules corresponding to the three independent branches are identical in structure with the scenario memory network. Each memory network module needs to be updated 3 times and each modality feature is mapped into an input fact representation matrix F.
Visual-M is the Visual memory network module corresponding to the Visual branch. The visual features are first organized into a visual fact matrix that is input to a visual memory network module, and the video clips are organized every 10 seconds as a processing unit. Since 3fps is used for extracting video frames, the video segments are sliced every 30 frames to obtain the video visual representation V s
Figure BDA0004103337060000103
Wherein s is i Represents the i-th segmentation segment, n s Representing the number of video slicing segments. Therefore (S)>
Figure BDA0004103337060000104
Figure BDA0004103337060000105
A visual fact matrix representing an input visual memory network module, and f i =s i . Through a visual memory network moduleProcessing, the final visual context representation +.>
Figure BDA0004103337060000106
Which contains visual information related to the problem.
The Audio-M is an Audio memory network module corresponding to the Audio branch, and the characteristics related to the problem are extracted from the Audio characteristics through multi-step updating. Audio features encoded by a transducer encoder in an audio branch
Figure BDA0004103337060000107
As input to the audio memory network module, i.e. the audio facts matrix of the audio memory network module is
Figure BDA0004103337060000108
Wherein n is a Is the number of audio segments. By a process similar to that of the visual memory network module, the last updated state of the audio memory network module can be obtained>
Figure BDA0004103337060000109
As final audio context representation +.>
Figure BDA00041033370600001010
/>
text-M is a text-memorizing network module corresponding to a text branch, which aims to learn and store emotion information in text features by learning interactive relations between the text features and question codes. Text representation in text branches
Figure BDA00041033370600001011
Figure BDA00041033370600001012
Input text fact matrix F as text memory network t Wherein sn is i Expression of table text sentence vector, l t Representing the number of sentences in the text feature. Learning of text features related to a question via a multi-step update of a text memory networkFinally, the text context representation is output +.>
Figure BDA00041033370600001013
For answer prediction.
5. Answer prediction
The answer prediction module aims at jointly modeling the video multi-modal feature representation and the question-answer pair code to predict an answer corresponding to the question. The core of the answer prediction module is a context matching module. The context matching module represents the final multi-modal context to C v,a,t Problem code q T Answer coding
Figure BDA00041033370600001014
As input, learning context-aware attention to question coding and answer coding, respectively, may yield context-aware question representations q v,a,t And answer representation of context awareness +.>
Figure BDA00041033370600001015
Since the working principle of the context matching module is the same in the three branches of the invention, but the processed video mode features are different, the working process of the context matching module is stated by taking the visual branch as an example.
The visual memory network module outputs visual characteristic input as a context matching module of the visual branch, and visual characteristic perception attention aiming at problem coding can be calculated to obtain visual perception problem representation
Figure BDA0004103337060000111
In addition, visual-perception answer representations can be obtained by calculating visual-feature-perception attentiveness for answer coding
Figure BDA0004103337060000112
Finally, the visual branch answer prediction module predicts the visual feature C v Question representation q v Sum answer representation
Figure BDA0004103337060000113
The fusion of (2) is represented as follows:
Figure BDA0004103337060000114
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004103337060000115
representing multiplication between elements. The visual branch-to-answer prediction probability distribution is calculated as follows:
Figure BDA0004103337060000116
wherein, the liquid crystal display device comprises a liquid crystal display device,
Figure BDA0004103337060000117
for the audio branch and the text branch, the same calculation procedure can be used to obtain the predictive probability distribution of the audio branch to the answer +.>
Figure BDA0004103337060000118
And predictive probability distribution of text branch versus answer +.>
Figure BDA0004103337060000119
The final answer prediction probability distribution is calculated as follows:
P=softmax(linear([P v ;P a ;P t ]))
Figure BDA00041033370600001110
for final answer prediction.
6. Training and verification of fine-grained video emotional content question-answering model based on multimodal data
Further, training and verifying the fine-granularity video emotion content question-answer deep learning model based on the multi-mode data. The loss function of the model is:
Figure BDA00041033370600001111
the overall training objectives of this model are:
Figure BDA00041033370600001112
/>
wherein X is R Representative of all sample data, θ, for the entire dataset vat The parameters of the visual branch, the audio branch and the text branch, respectively. Updating the parameter θ by visual, audio and text branching vat
Another embodiment of the present invention provides a fine-grained video emotional content question-answering system based on multimodal data, comprising:
the video segmentation module is used for segmenting the long video by taking a plurality of sentences of dialogue as a unit and segmenting corresponding subtitle texts and audios to obtain a plurality of video clips;
the multi-modal feature extraction module is used for extracting multi-modal features, including visual features, audio features and text features, of a video clip, and encoding corresponding question-answer pairs to obtain question codes and answer codes;
the coding module is used for respectively carrying out time sequence coding on the extracted multi-mode characteristics;
the multi-modal feature enhancement module is used for extracting problem related information from multi-modal features of the video based on the visual branches, the audio branches and the text branches, enhancing the visual branches by using facial features of characters in the video, enhancing the text branches by using story line information in the video story line information, and obtaining enhanced multi-modal features;
the plot memory network module is used for inputting the enhanced multi-modal characteristics into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal characteristics by using the plot memory network, capturing multi-modal key information in the emotion reasoning process, and obtaining video context representation;
and the answer prediction module is used for respectively learning the context awareness attention aiming at the question codes and the answer codes by taking the question codes and the answer codes and the video context as inputs to obtain the final emotion question and answer result.
Wherein the specific implementation of each module is referred to the previous description of the method of the present invention.
Experimental data: the results of the comparison of the proposed method with other methods are shown in table 1.
TABLE 1
Method Image modality Audio modality Text modality Accuracy (%)
1 Random - - - 20.00
2 Longest Answer - - - 32.24
3 Shortest Answer - - - 16.27
4 HRCN P - - 47.41
5 HGA P - P 57.99
6 Two-stream P P - 59.90
7 Two-stream P - P 58.46
8 Two-stream - P P 58.59
9 Two-stream P P P 61.16
10 The method of the invention-bimodal P P - 61.88
11 The method of the invention-bimodal P - P 69.07
12 The method of the invention-bimodal - P P 58.91
13 The method of the invention P P P 65.62
Another embodiment of the invention provides a computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the above-described method.
Another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a computer, implements the method described above.
The fine-grained video emotional content question-answering algorithm based on multi-mode data according to the present invention is described in detail above, but it is obvious that the specific implementation form of the present invention is not limited thereto. Various obvious modifications thereof will be within the scope of the invention, as will be apparent to those skilled in the art, without departing from the spirit of the method of the invention and the scope of the claims.

Claims (10)

1. The fine-granularity video emotional content question-answering method based on the multi-modal data is characterized by comprising the following steps of:
1) Dividing the long video by taking a plurality of sentences of dialogue as a unit, and dividing corresponding caption text and audio to obtain a plurality of video clips;
2) Extracting multi-modal features including visual features, audio features and text features for a video clip, and encoding corresponding question-answer pairs to obtain question codes and answer codes;
3) Respectively carrying out time sequence coding on the extracted multi-mode characteristics;
4) Extracting problem related information from multi-modal features of the video based on the visual branch, the audio branch and the text branch, enhancing the visual branch by using facial features of characters in the video, enhancing the text branch by using story line information in the video story-taking information, and obtaining enhanced multi-modal features;
5) Inputting the enhanced multi-modal features into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal features by using the plot memory network, and capturing multi-modal key information in the emotion reasoning process to obtain video context representation;
6) And respectively learning the context awareness attention aiming at the question codes and the answer codes by using the question codes, the answer codes and the video context representation input answer prediction modules to obtain final emotion question and answer results.
2. The method of claim 1, wherein the visual features comprise global visual features and facial features; extracting global visual features using a Resnet-152 model pre-trained on an ImageNet dataset; the method for extracting facial features comprises the following steps: detecting a face area in a video frame by using a pre-training model MTCNN, and fine-tuning a VGGFace2 model pre-trained on Facenet by using face area data of a main role in the video to obtain a face recognition feature; extracting facial expression features using a Facenet model pre-trained on the FER2013 dataset; and the facial recognition features and the facial expression features are spliced to form facial features of the person in the video.
3. The method of claim 1, wherein the audio features are extracted using an openSIMLE audio feature extractor; extracting the text features and encoding the question-answer pairs by using a pre-trained GloVe word embedding tool method; the extracted multi-mode features are respectively time-sequence encoded by using a transducer encoder.
4. The method of claim 1, wherein the extracting problem-related information from the multi-modal features of the video is using a problem-guided attention-directed feature representation of the problem, comprising the steps of:
1) Characterizing facial features
Figure FDA0004103337050000011
And problem code q T Performing dot multiplication to obtain the similarity s between the problem and the feature;
2) Processing the point multiplication result s by a softmax function to obtain a representation of facial features
Figure FDA0004103337050000012
Is of (a) spatial concentration a f
3) Directing spatial attention a f And features
Figure FDA0004103337050000013
Dot multiplication is performed to obtain a question-related feature representation +.>
Figure FDA0004103337050000014
5. The method of claim 1, wherein the video context representation C output by the episode memory network is obtained by v,a,t
1) Attention mechanism: computing door mechanism attention score for t update process
Figure FDA0004103337050000021
Wherein F is attn Represents the attention function, f i Representing the ith fact vector in the output real-time sequence, m t-1 Is the state after t-1 st update in the memory network module, q represents the problem code vector;
2) Memory cell refresh mechanism: calculating hidden layer state of ith unit of GRU in memory network module
Figure FDA0004103337050000022
Figure FDA0004103337050000023
Wherein h is i Represents the hidden layer state of the ith cell in the GRU, and the last hidden layer state of the GRU is used as the video context representation of the t-th memory cell update->
Figure FDA0004103337050000024
Finally, the memory cell state of the t-th time is updated +.>
Figure FDA0004103337050000025
Wherein F is mem Is a memory update function.
6. The method of claim 1, wherein the video question-answering result of the answer prediction module is obtained by:
calculating fusion representations of each modal feature and the question and answer representation by using a context matching module;
processing the fusion representation by using a softmax function and an FC layer to obtain probability distribution of each branch for answer prediction;
and splicing the probability distribution of answer prediction obtained by the three modes, and processing by using linear and softmax functions to obtain the final answer prediction probability distribution.
7. The method of claim 1, wherein steps 3) -6) are implemented using a video emotional content question-answering model, wherein the video emotional content question-answering model employs end-to-end training, and wherein a penalty function of the video emotional content question-answering model
Figure FDA0004103337050000026
P=[p 0 ,…,p 4 ]Each element p of (a) i Indicating that the answer corresponding to the sample is a i Probability of (2); y= [ y ] 0 ,…,y 4 ]Is a one-hot coded representation of the sample label, when the answer corresponding to the sample is a i Time y i =1, otherwise y i =0。
8.A fine-grained video emotional content question-answering system based on multi-modal data, comprising the steps of:
the video segmentation module is used for segmenting the long video by taking a plurality of sentences of dialogue as a unit and segmenting corresponding subtitle texts and audios to obtain a plurality of video clips;
the multi-modal feature extraction module is used for extracting multi-modal features, including visual features, audio features and text features, of a video clip, and encoding corresponding question-answer pairs to obtain question codes and answer codes;
the coding module is used for respectively carrying out time sequence coding on the extracted multi-mode characteristics;
the multi-modal feature enhancement module is used for extracting problem related information from multi-modal features of the video based on the visual branches, the audio branches and the text branches, enhancing the visual branches by using facial features of characters in the video, enhancing the text branches by using story line information in the video story line information, and obtaining enhanced multi-modal features;
the plot memory network module is used for inputting the enhanced multi-modal characteristics into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal characteristics by using the plot memory network, capturing multi-modal key information in the emotion reasoning process, and obtaining video context representation;
and the answer prediction module is used for respectively learning the context awareness attention aiming at the question codes and the answer codes by taking the question codes and the answer codes and the video context as inputs to obtain the final emotion question and answer result.
9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.
CN202310184746.1A 2022-10-26 2023-03-01 Fine granularity video emotion content question-answering method and system based on multi-mode data Pending CN116226347A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2022113201198 2022-10-26
CN202211320119 2022-10-26

Publications (1)

Publication Number Publication Date
CN116226347A true CN116226347A (en) 2023-06-06

Family

ID=86580223

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310184746.1A Pending CN116226347A (en) 2022-10-26 2023-03-01 Fine granularity video emotion content question-answering method and system based on multi-mode data

Country Status (1)

Country Link
CN (1) CN116226347A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117635785A (en) * 2024-01-24 2024-03-01 卓世科技(海南)有限公司 Method and system for generating worker protection digital person
CN117891913A (en) * 2023-12-26 2024-04-16 大湾区大学(筹) Answer prediction method for multi-mode audio-visual questions, electronic equipment and medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117891913A (en) * 2023-12-26 2024-04-16 大湾区大学(筹) Answer prediction method for multi-mode audio-visual questions, electronic equipment and medium
CN117635785A (en) * 2024-01-24 2024-03-01 卓世科技(海南)有限公司 Method and system for generating worker protection digital person
CN117635785B (en) * 2024-01-24 2024-05-28 卓世科技(海南)有限公司 Method and system for generating worker protection digital person

Similar Documents

Publication Publication Date Title
Liu et al. Multi-modal fusion network with complementarity and importance for emotion recognition
Huang et al. Image captioning with end-to-end attribute detection and subsequent attributes prediction
Shou et al. Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis
Saha et al. Towards emotion-aided multi-modal dialogue act classification
CN110888980B (en) Knowledge enhancement-based implicit chapter relation recognition method for attention neural network
Huang et al. Multimodal continuous emotion recognition with data augmentation using recurrent neural networks
CN116226347A (en) Fine granularity video emotion content question-answering method and system based on multi-mode data
Islam et al. Exploring video captioning techniques: A comprehensive survey on deep learning methods
CN110991290A (en) Video description method based on semantic guidance and memory mechanism
CN113392265A (en) Multimedia processing method, device and equipment
Gan et al. DHF-Net: A hierarchical feature interactive fusion network for dialogue emotion recognition
Wu et al. Research on the Application of Deep Learning-based BERT Model in Sentiment Analysis
Zeng et al. Robust multimodal sentiment analysis via tag encoding of uncertain missing modalities
Chaudhary et al. Signnet ii: A transformer-based two-way sign language translation model
Xu et al. Gar-net: A graph attention reasoning network for conversation understanding
Yang et al. Self-adaptive context and modal-interaction modeling for multimodal emotion recognition
Du et al. Multimodal emotion recognition based on feature fusion and residual connection
Manousaki et al. Vlmah: Visual-linguistic modeling of action history for effective action anticipation
Weng et al. A survey of artificial intelligence techniques on MOOC of legal education
He et al. VGSG: Vision-Guided Semantic-Group Network for Text-Based Person Search
Xu et al. Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothing
Yuhan et al. Sensory Features in Affective Analysis: A Study Based on Neural Network Models
Xie et al. Global-Shared Text Representation Based Multi-Stage Fusion Transformer Network for Multi-Modal Dense Video Captioning
Yang et al. GME-dialogue-NET: gated multimodal sentiment analysis model based on fusion mechanism
CN115983280B (en) Multi-mode emotion analysis method and system for uncertain mode deletion

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination