CN116226347A

CN116226347A - Fine granularity video emotion content question-answering method and system based on multi-mode data

Info

Publication number: CN116226347A
Application number: CN202310184746.1A
Authority: CN
Inventors: 马翠霞; 秦航宇; 杜肖兵; 邓小明; 王宏安
Original assignee: Institute of Software of CAS
Current assignee: Institute of Software of CAS
Priority date: 2022-10-26
Filing date: 2023-03-01
Publication date: 2023-06-06

Abstract

The invention belongs to the field of video questions and answers, and particularly relates to a fine-granularity video emotion content question and answer method and system based on multi-mode data. The video emotion reasoning base line model is built based on the plot memory network, a multi-branch processing module aiming at visual, audio and text data is designed, time sequence dependence in multi-mode data is encoded by means of a transducer encoder, the extracted multi-mode features comprise multi-angle emotion content, and fine-granularity video emotion content question-answering tasks can be accurately completed. The invention utilizes a transducer encoder to learn time sequence association relation on video, audio and text sequences, and extracts high-dimensional multi-modal characteristics related to emotion classification, wherein the time sequence association relation is important to emotion information contained in analysis video. The method can effectively improve the accuracy of the multi-mode-based fine-granularity video emotion content question-answering task results.

Description

Fine granularity video emotion content question-answering method and system based on multi-mode data

Technical Field

The invention belongs to the field of video questions and answers, and particularly relates to a fine-granularity video emotion content question and answer method and system based on multi-mode data.

Background

In recent years, emotion analysis in movies/television programs has gained increasing attention in the fields of emotion calculation, artificial intelligence system design, and the like. The movie/television program contains rich interactive scenes and character relationships, and the characters in the video can experience the same emotion as the human in the real world, such as excitement caused by rewards and sadness caused by separation. The video scene with the artificial center in the video is closely related to the social scene in the real life, which provides a platform for training the artificial intelligence system to understand the advanced semantic information behind the emotion, namely the reason and intention for inducing the emotion, the behavioral motivation of the person and the like, and the rich emotion contained in the video provides data support for researching the emotion content contained in the video content.

Intelligent systems need to have the ability to fine-tune understanding of video scenes, not only to be able to recognize emotion categories, but also to be able to infer the cause behind emotion, intent, and the user's behavioral motivation, etc. in an interpretive manner. In the work of studying the emotional content contained in a video, multimodal information of the video is often used as a source of analysis data, such as a video emotion recognition person. Video emotion recognition methods based on multimodal information understand emotion conveyed by video content primarily through audio, text, visual, and other information. For example, based on video context information and facial expression information of a person in the video, identifying emotion in the video using a cascading structural model composed of RNNs; identifying a video emotion based on the dual-stream coding model by integrating facial expression features of a person in the video with video background information; based on the interaction between visual content and text vocabulary, adopting a collaborative memory network to identify emotion in the multi-modal information; based on the multi-modal transformation network, the visual and auditory characterizations are uniformly mapped to the same feature space for video emotion recognition and the like. The video emotion understanding technology research is mainly focused on a video emotion recognition method, and the emotion reasoning research in the video is less. With the development of intelligent interactive applications, researchers began exploring methods for inferring potential causes behind emotions based on multimodal scenes, such as extracting emotion-cause pairs based on multimodal information in video conversations, and inferring causes behind emotions through understanding of multimodal information. Therefore, on the basis of the video emotion recognition based on the multi-mode information, the video emotion reasoning method is further researched, and support is provided for deep understanding of the reason for inducing emotion in the video multi-mode content.

In order to improve the intelligence of man-machine interaction, on the basis of identifying video emotion, the emotion requirement and intention of a user are required to be correctly understood, and the method is the main research content of video emotion reasoning work. Thus, in the deep research of video emotion understanding technology, there is a need to further study how to use interpretive ways to infer emotion in video interactive scenes, i.e. understand the cause, intention and user's behavioral motivation behind the video. The video emotion reasoning work needs to be carried out aiming at a video interaction scene with people as a center, and only when people are placed in specific situations, multi-mode information in the interaction process can be fully utilized to finish understanding of emotion, so that reasoning is carried out on multiple aspects related to the emotion.

1) The prior art comprises the following steps: emotion cause extraction and emotion reasoning

The discovery of potential reasons behind a particular emotion expression in a conversational text context or multi-modal scenario has been a popular topic in the emotion computing field. Emotion cause extraction is a refinement task of emotion analysis, which aims at exploring the potential cause behind certain emotion expressions in a dialogue. Rui Xia et al (ref: xia R, zhang M, ding Z.RTHN: A rnn-transformer hierarchical network for emotion cause extraction [ C ]// Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019,Macao,China,August 10-16,2019.Ijcai.org, 2019:5285-5291.) consider relationships between multiple clauses in a conversation and are used for emotion cause extraction work. Since the dialog itself in video is composed of multimodal information, rui Xia et al (ref: wang F, ding Z, xia R, et al Multi-model project-cause pair extraction in conversations [ J ]. CoRR,2021, abs/2110.08020.) further propose a multimodal emotion-reason pair extraction task and jointly extract emotion and its evoked reason from the multimodal information of the dialog. The meces is a further investigation of the ECE task requiring a model with strong video multimodal information understanding and emotion inference capabilities, as emotional causes are not necessarily from text information only, possibly from visual scenes.

In addition, opinion mining is an important issue in text emotion analysis, where emotion inference is a subtask that deals with the "who holds the opinion and why" problem in the text emotion analysis task. Although the research on video emotion reasoning tasks is relatively few in the field of video emotion understanding at present, the related work can see that the emotion reasoning on the video with multi-mode information has profound research significance and application value. Therefore, there is a need for an interpretable way to infer emotion in video, so as to fully understand the reason and intention of emotion in video multi-modal content, and the behavioral motivation of people.

2) The prior art comprises the following steps: multimodal emotion analysis

Multimodal emotion analysis in video aims to understand emotion conveyed by video content through audio, text and visual information. As Sun Man-Chin et al (ref: sun M C, hsu S H, yang M C, et al, context-aware cascade attention-based RNN for video emotion recognition [ C ]//2018First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia). IEEE, 2018:1-6.) propose a cascading structural model consisting of two RNNs, using video context information and facial expression information of a person in a video to identify emotion in the video. Lee Jiyou et al (ref: lee J, kim S, kim S, et al Context-aware emotion recognition networks [ C ]// Proceedings of the IEEE/CVF International Conference on Computer vision.2019:10143-10152.) designed a deep network to integrate facial expression features of characters in video and video background information in a joint and enhanced manner to recognize video emotions. Nan Xu et al (ref: xu N, mao W, chen G.A co-memory network for multimodal sentiment analysis [ C ]// The 41st International ACM SIGIR Conference on Research&Development in Information Re-trieval, SIGIR 2018,Ann Arbor,MI,USA,July 08-12,2018.ACM, 2018:929-932.) iteratively model interactions between visual content and text vocabulary using collaborative memory networks for multimodal emotion analysis. Qi Fan et al (ref: qi F, yang X, xu c.zero-shot video emotion recognition via multimodal protagonist-aware transformer network [ C ]// MM'21:ACM Multimedia Conference,Virtual Event,China,Oc-tober 20-24,2021.Acm, 2021:1074-1083.) propose a multimodal transformation network that uniformly maps visual and auditory representations to the same feature space for video emotion recognition.

However, the above-described multi-modal emotion analysis studies have focused mainly on emotion recognition tasks supervised by emotion tags. In recent years, some efforts have begun focusing on further understanding of video moods. Unlike direct identification of the emotional categories of characters in video, guangyao Shen et al (ref: shen G, wang X, duan X, et al memory: A dataset for multimodal emotion reasoning in videos [ C ]// MM'20:The 28th ACM International Conference on Multimedia,Virtual Event/Seattle, WA, USA, october 12-16,2020.ACM, 2020:493-502.) propose role emotion inference based on video multimodal information, and propose emotion inference by means of the emotion of other characters in the same scene for the problem of emotion identification of characters lacking multimodal information in video interactive scenes, the process does not involve understanding the cause behind emotion, intention, and motivation of human behavior. Zadeh et al (ref: zadeh A, chan M, liang P, et al Social-IQ: A question answering benchmark for artificial Social intelligence [ C ]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognizing.2019:8807-8817.) construct a Social-IQ dataset and label part of question-answer pairs with emotion or emotion information to support the Social intelligence system to learn the mental state of the user through the answers to the inference questions. Based on the multi-modal emotion analysis, the video emotion reasoning accords with the development trend of the multi-modal emotion analysis based on the multi-modal information, and is the problem that the next step needs to be focused on.

3) The prior art comprises the following steps: question and answer technology

Unlike traditional emotion recognition based on emotion tags, artificial intelligence systems need emotion recognition beyond emotion tag supervision, enabling the ability to infer emotion in video in an interpretable manner. The question-answering technique is an explanatory method of exploring the understanding degree of potential phenomena, and is used in different research fields and achieves good results, such as natural language processing, vision and language, and common sense reasoning fields. Video questions and answers are a task that answers a given video question in a natural language format, and have attracted considerable attention in the past few years due to their applicability in the fields of social intelligence systems, cognitive robots, and the like. At present, a plurality of methods exist in the field of video question and answer, including a attention mechanism, a multi-mode fusion method, a dynamic memory network, multi-mode relation learning and the like, and good results are obtained based on a video question and answer disclosure data set, so that the invention selects to rely on question and answer tasks to realize the reasoning of a model on video emotion.

Disclosure of Invention

Video emotion reasoning is a research hotspot in the field of video emotion understanding at present, and an efficient method is provided for understanding emotion in video based on an interpretive reasoning mode of a question-and-answer form. In general, related information such as reasons, intentions and human behavioral motivations behind emotion in a video are contained in video multimodal data, so that fine-grained understanding of video scenes is performed, and information useful for emotion reasoning is necessary to extract from multimodal information. In general, in emotion reasoning work based on video multi-mode information, one-time reasoning is difficult to learn sufficient effective information, so multi-step reasoning is a key for improving model effect.

In order to solve the problems, the technical scheme provided by the invention is as follows:

a fine granularity video emotion content question-answering method based on multi-mode data comprises the following steps:

1) Dividing the long video by taking a plurality of sentences of dialogue as a unit, and dividing corresponding caption text and audio to obtain a plurality of video clips;

2) Extracting multi-modal features, including visual features, audio features, and text features, for a video segment, and associating the extracted multi-modal features with the video segmentThe question-answer pair is encoded to obtain a question code q _T Answer coding

3) Respectively carrying out time sequence coding on the extracted multi-mode characteristics;

4) Extracting problem related information from multi-modal features of the video based on the visual branch, the audio branch and the text branch, enhancing the visual branch by using facial features of characters in the video, enhancing the text branch by using story line information in the video story-taking information, and obtaining enhanced multi-modal features;

5) Inputting the enhanced multi-modal features into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal features by using the plot memory network, capturing multi-modal key information in the emotion reasoning process, and obtaining a video context representation C ^v,a,t ；

6) Encode q the above problem _T Answer coding

And video context representation C ^v,a,t And the input answer prediction module respectively learns the situational awareness attention aiming at the question codes and the answer codes to obtain a final emotion question and answer result P.

Further, the long video is segmented by taking a plurality of sentences of dialogue as a unit, and n sentences of dialogue as a unit, wherein n is less than or equal to 20.

Further, the visual features include global visual features and facial features.

Further, the method for extracting the global visual features comprises the following steps: a pre-trained Resnet-152 model on the ImageNet dataset was used.

Further, the method of extracting facial features includes: detecting a face area in a video frame by using a pre-training model MTCNN, and fine-tuning a VGGFace2 model pre-trained on Facenet by using face area data of main roles in the video (for example, 6 main roles in the video of 'old friends')to obtain face recognition characteristics; extracting facial expression features using a Facenet model pre-trained on the FER2013 dataset; and the facial recognition features and the facial expression features are spliced to form facial features of the person in the video.

Further, the method for extracting the audio features comprises the following steps: an opensimple audio feature extractor is used.

Further, the method for extracting text features and encoding question-answer pairs comprises: the pre-trained GloVe word embedding tool method is adopted.

Further, the method of encoding multi-modal information includes: a transducer encoder was used.

Further, the extracting problem-related information from the multi-modal feature of the video is to use the problem to guide attention to obtain a feature representation related to the problem through the following steps:

1) Characterizing facial features

And problem code q _T Performing dot multiplication to obtain the similarity s between the problem and the feature;

2) Processing the point multiplication result s by a softmax function to obtain a representation of facial features

Is of (a) spatial concentration a ^f ；

3) Directing spatial attention a ^f Facial features

Dot multiplication is performed to obtain a question-related feature representation +.>

Further, a video context representation C output by a episode memory network is obtained by ^v,a,t ：

1) Attention mechanism: computing door mechanism attention score for t update process

Wherein F is _attn Representing the attention function (reference: xiong C, merity S, socher R.Dynamic memory networks for visual and textual question answering [ C)]//JMLR Workshop and Conference Proceedings:volume 48Proceedings of the 33nd International Conference on Machine Learning,ICML 2016,New York City,NY,USA,June 19-24,2016.JMLR.org,2016:2397-2406.)，f _i Representing the ith fact vector in the output real-time sequence, m ^t-1 Is the state after t-1 st update in the memory network module, q represents the problem code vector;

2) Memory cell refresh mechanism: calculating hidden layer state of ith element of GRU (gated recurrent unit) in memory network module

Wherein h is _i Representing hidden layer state of ith cell in GRU, and last layer hidden layer state of GRU as video multi-modal context representation of t-th memory cell update

Wherein t of the subscript is the t-th update, and t of the superscript is text mode; finally, the memory cell state at the t-th time is updated

Wherein F is _mem Is a memory update function.

Further, the video question and answer result of the answer prediction module is obtained through the following steps:

1) Using a context matching module (reference: seo M J, kembhavi A, faradai A, et al Bidirection attention flow for machine comprehension [ C]Calculating the respective modality characteristics,// 5th International Conference on Learning Representations,ICLR 2017,Toulon,France,April 24-26,2017,Conference Track Proceedings.OpenReview.net,2017.)Fusion representation of sign and question, answer representations

Wherein E is _m Memorizing each modal characteristic output by the network module, q ^v,a,t Is a question representation of context awareness, < >>

Is an answer representation of context awareness.

2) Processing the fusion representation using a softmax function and an FC layer to obtain a predictive probability distribution for each branch answer

3) The answer prediction distributions obtained by the three modes are spliced and processed by using linear and softmax functions to obtain a final answer prediction probability distribution p=softmax (linear ([ P) _v ；P _a ；P _t ]))。

Further, steps 3) -6) are implemented using a video emotional content question-answering model, wherein the video emotional content question-answering model uses end-to-end training, and a loss function of the video emotional content question-answering model

Wherein is the predictive distribution of five candidate answers, p= [ P ] ₀ ,…,p ₄ ]Each element p of (a) _i Indicating that the answer corresponding to the sample is a _i Probability of (2); y= [ y ] ₀ ,…,y ₄ ]Is a one-hot coded representation of the sample label, when the answer corresponding to the sample is a _i Time y _i =1, otherwise y _i ＝0。

A fine-grained video emotional content question-answering system based on multimodal data, comprising:

the video segmentation module is used for segmenting the long video by taking a plurality of sentences of dialogue as a unit and segmenting corresponding subtitle texts and audios to obtain a plurality of video clips;

the multi-modal feature extraction module is used for extracting multi-modal features, including visual features, audio features and text features, of a video clip, and encoding corresponding question-answer pairs to obtain question codes and answer codes;

the coding module is used for respectively carrying out time sequence coding on the extracted multi-mode characteristics;

the multi-modal feature enhancement module is used for extracting problem related information from multi-modal features of the video based on the visual branches, the audio branches and the text branches, enhancing the visual branches by using facial features of characters in the video, enhancing the text branches by using story line information in the video story line information, and obtaining enhanced multi-modal features;

the plot memory network module is used for inputting the enhanced multi-modal characteristics into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal characteristics by using the plot memory network, capturing multi-modal key information in the emotion reasoning process, and obtaining video context representation;

and the answer prediction module is used for respectively learning the context awareness attention aiming at the question codes and the answer codes by taking the question codes and the answer codes and the video context as inputs to obtain the final emotion question and answer result.

A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method described above.

A computer readable storage medium storing a computer program which, when executed by a computer, implements the method described above.

In summary, compared with the prior art, the invention has the following advantages and positive effects:

1. the video emotion reasoning base line model is built based on the plot memory network (Episode Memory Network), a multi-branch processing module aiming at visual, audio and text data is designed, time sequence dependence in multi-mode data is encoded by means of a transducer encoder, the extracted multi-mode features comprise multi-angle emotion content, and fine-granularity video emotion content question-answering tasks can be accurately completed.

2. The invention utilizes a transducer encoder to learn time sequence association relation on video, audio and text sequences, and extracts high-dimensional multi-modal characteristics related to emotion classification, wherein the time sequence association relation is important to emotion information contained in analysis video.

3. According to the invention, attention mechanisms (Question-guided Attention) for Question guidance are respectively used in a visual branch and a text branch to pay Attention to the facial expression features in visual features and the story information in text features in the emotion reasoning process of a person, and the story outline information in the text features is extracted, so that the reasoning precision and Question-answering accuracy of the model method are effectively improved.

4. Aiming at the problem that effective information learned by single-step reasoning is less, the multi-mode clues for reasoning are accumulated in the multi-step updating process of the memory unit through the plot memory network and are used for final emotion related problem answer prediction, and the multi-mode-based fine-granularity video emotion content question-answering task result can be effectively improved.

Drawings

FIG. 1 is a schematic diagram of a multimodal fine granularity video emotional content question-answering task.

FIG. 2 is a network framework flow chart of a fine granularity video emotional content questioning and answering method based on multi-modal data.

Detailed Description

In order to enable those skilled in the art to better understand the present invention, the fine-grained video emotional content question-answering method of the multimodal data provided by the present invention is described in further detail below with reference to the accompanying drawings, but not by way of limitation of the present invention.

Referring to fig. 1 and 2, the present invention proposes a fine-grained video emotion content question-answering method based on multi-modal information, which extracts relevant features of a problem from the video multi-modal information based on a visual branch, an audio branch and a text branch; extracting facial expression feature information in visual features and story-outline information in text features by using question guidance attention; capturing emotion-related clues through a multi-step iterative updating plot memory storage unit to realize multi-step emotion reasoning, and respectively utilizing a transducer encoder to learn multi-mode data time sequence dependency relations; extracting facial expression information in a visual branch and story line information in a text branch respectively by using a problem guiding attention mechanism; and finally, accumulating emotion reasoning clues through multi-step reasoning of the plot memory network, and learning effective video context representation for answer prediction.

1. Multimodal feature extraction

1) Visual feature extraction

Video frames are first extracted from the video at 3fps for subsequent visual feature extraction. The invention mainly aims at extracting two visual characteristics of video frame data: video frame global visual features and facial features.

Global visual features: video frames were processed using a pre-trained Resnet-152 model on the ImageNet dataset to obtain video frame visual features with feature dimensions 204. Then, the visual features of the video frames corresponding to one video clip are stacked to obtain the visual features representing the whole video clip

Wherein n is _clp Representing the number of video frames in a video clip.

Facial features: firstly, a pre-training model MTCNN is used for detecting a facial area in a video frame, and two features, namely a facial recognition feature and a facial expression feature, are mainly extracted based on the facial area of a person in the video. For example, facial recognition features are obtained by fine-tuning a VGGFace2 model pre-trained on Facenet using facial region data for 6 primary roles in the "old friends" video; facial expression features were extracted using a pretrained Facenet model on the FER2013 dataset. Finally, the facial recognition features and the facial expression features are splicedHuman object face feature in composed video

Wherein n is _f Representing the number of facial regions of the face in each video clip. />

The input data of the visual branch in the method is V ^clp And V ^f 。

2) Audio feature extraction

Firstly, in order to align the audio data with the caption text, the audio data corresponding to the video clips are segmented into 20 segments corresponding to the caption text according to the time stamp of the caption text, and the contextual relation learning of the model to the audio clips is supported. 6373-dimensional acoustic features are extracted for 20 pieces of audio corresponding to each video clip using an openSIMLE feature extractor under the direction of the coparoe_2016 profile. Finally, the acoustic characteristics corresponding to each video segment can be expressed as

Wherein n is _a Representing the number of audio segments corresponding to a video segment.

The input data of the audio branch in the method is a.

3) Text feature extraction

Firstly extracting character vector characteristics of input text data, obtaining 300-dimensional word coding characteristics by means of Glove trained on Wikipedia2014 and Gigaword5, and obtaining characteristic representation S epsilon of subtitle text corresponding to each video segment

Wherein n is _set Representing the number of caption text sentences corresponding to a video clip, n _wrd Representing the number of words in each sentence. In addition, the story-outline text feature corresponding to each video clip may be expressed as

Wherein n is _ks Representing the number of sentences in the story outline corresponding to each video clip, n _kw Representing the number of words in each sentence.

The input data corresponding to the text branches in the method are S and K.

2. Transformer-based multi-modal information encoding

The transducer is excellent in a large number of natural language processing problems and has the ability to learn long-range data dependencies. The long-term learning ability of a transducer to sequence data is primarily dependent on self-attention mechanisms. Furthermore, in order to be able to learn the information of the sequence data in different feature subspaces, the invention is implemented in a transducer by processing multiple self-attention mechanisms in parallel using a multi-head attention mechanism, expressed as:

MHA(Q,K,V)＝Concat(head ₁ ,…,head _k )W ⁰

where Q represents a query matrix, K represents a key value matrix, V represents a value matrix, W is a weight matrix, and Attention () represents a self-Attention mechanism calculation process. head part _i Representing the ith "attention head" in the multi-head attention mechanism. The method uses three independent transform encoders to perform time sequence encoding processing on the sequence input data of each branch, wherein the number of layers of the transform encoder is 2, and the number of 'attention heads' in multi-head attention is 6.

For visual branches, the input data includes V ^clp And V ^f . Before further encoding the input data, the visual features V are first processed separately using a linear transform layer ^clp And V ^f The dimension of the two visual features is unified to 300, and the visual features after linear transformation are obtained

And->

Then, will->

And->

And respectively used as input data of a transducer encoder to learn time sequence dependency information of video visual sequence characteristics. Then, the last layer output of the transducer encoder is used as the visual feature obtained by further encoding +.>

And->

For the audio branches, the same method is used for obtaining the further coding characteristic of the audio data through the linear transformation and the coding processing of a transducer

For text branching, text features are obtained by linear transformation and encoding processing of a transducer by the same method

And->

For the text features of the questions and the text features of the answers in the data, an independent transducer encoder is used for encoding to obtain the text representation of the questions>

And answer text representation +.>

Wherein n is _q Is the number of words in the question text sentence, +.>

Is the answer text sentence a _iT The number of words in (a) is provided.

3. Problem directing attention

Inspired by the human being's cognitive process of inferring the underlying cause behind a certain emotion, the present invention uses the facial features of the person in the video to enhance the visual branch, and uses the storyline information in the video storyline information to enhance the text branch. Facial expressions are a direct reflection of human emotion, and people tend to focus on the facial expression of a person in a video when understanding and reasoning about emotion in a video. The first step in using facial expressions to enhance the visual branch is to extract information related to the problem in the facial features by means of the attention mechanism. Taking the visual branch as an example, the problem-guided attention mechanism is as follows:

a ^f ＝softmax(s)

where s represents the similarity between the problem and the facial features, a ^f Representing facial features

Finally obtaining a facial feature representation by extracting features related to the question according to the spatial attention score>

Enhanced visual separationThe visual characteristics of the branch may be expressed as +.>

Wherein->

n _v The number of video frames corresponding to the video clips; representing the connection symbol.

4. Multi-modal plot memory network

The stormwork is designed to retrieve information related to a question from a sequence of facts entered by the network to answer the question, and is particularly applicable to questions that require video-context-dependent reasoning. The invention uses the plot memory network to update and store the emotion reasoning clues extracted from the multi-modal characteristics.

In the invention, the memory network modules corresponding to the three independent branches are identical in structure with the scenario memory network. Each memory network module needs to be updated 3 times and each modality feature is mapped into an input fact representation matrix F.

Visual-M is the Visual memory network module corresponding to the Visual branch. The visual features are first organized into a visual fact matrix that is input to a visual memory network module, and the video clips are organized every 10 seconds as a processing unit. Since 3fps is used for extracting video frames, the video segments are sliced every 30 frames to obtain the video visual representation V _s ＝

Wherein s is _i Represents the i-th segmentation segment, n _s Representing the number of video slicing segments. Therefore (S)>

A visual fact matrix representing an input visual memory network module, and f _i ＝s _i . Through a visual memory network moduleProcessing, the final visual context representation +.>

Which contains visual information related to the problem.

The Audio-M is an Audio memory network module corresponding to the Audio branch, and the characteristics related to the problem are extracted from the Audio characteristics through multi-step updating. Audio features encoded by a transducer encoder in an audio branch

As input to the audio memory network module, i.e. the audio facts matrix of the audio memory network module is

Wherein n is _a Is the number of audio segments. By a process similar to that of the visual memory network module, the last updated state of the audio memory network module can be obtained>

As final audio context representation +.>

/>

text-M is a text-memorizing network module corresponding to a text branch, which aims to learn and store emotion information in text features by learning interactive relations between the text features and question codes. Text representation in text branches

Input text fact matrix F as text memory network _t Wherein sn is _i Expression of table text sentence vector, l _t Representing the number of sentences in the text feature. Learning of text features related to a question via a multi-step update of a text memory networkFinally, the text context representation is output +.>

For answer prediction.

5. Answer prediction

The answer prediction module aims at jointly modeling the video multi-modal feature representation and the question-answer pair code to predict an answer corresponding to the question. The core of the answer prediction module is a context matching module. The context matching module represents the final multi-modal context to C ^v,a,t Problem code q _T Answer coding

As input, learning context-aware attention to question coding and answer coding, respectively, may yield context-aware question representations q ^v,a,t And answer representation of context awareness +.>

Since the working principle of the context matching module is the same in the three branches of the invention, but the processed video mode features are different, the working process of the context matching module is stated by taking the visual branch as an example.

The visual memory network module outputs visual characteristic input as a context matching module of the visual branch, and visual characteristic perception attention aiming at problem coding can be calculated to obtain visual perception problem representation

In addition, visual-perception answer representations can be obtained by calculating visual-feature-perception attentiveness for answer coding

Finally, the visual branch answer prediction module predicts the visual feature C ^v Question representation q _v Sum answer representation

The fusion of (2) is represented as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing multiplication between elements. The visual branch-to-answer prediction probability distribution is calculated as follows:

for the audio branch and the text branch, the same calculation procedure can be used to obtain the predictive probability distribution of the audio branch to the answer +.>

And predictive probability distribution of text branch versus answer +.>

The final answer prediction probability distribution is calculated as follows:

P＝softmax(linear([P _v ；P _a ；P _t ]))

for final answer prediction.

6. Training and verification of fine-grained video emotional content question-answering model based on multimodal data

Further, training and verifying the fine-granularity video emotion content question-answer deep learning model based on the multi-mode data. The loss function of the model is:

the overall training objectives of this model are:

/>

wherein X is _R Representative of all sample data, θ, for the entire dataset _v ,θ _a ,θ _t The parameters of the visual branch, the audio branch and the text branch, respectively. Updating the parameter θ by visual, audio and text branching _v ,θ _a ,θ _t 。

Another embodiment of the present invention provides a fine-grained video emotional content question-answering system based on multimodal data, comprising:

Wherein the specific implementation of each module is referred to the previous description of the method of the present invention.

Experimental data: the results of the comparison of the proposed method with other methods are shown in table 1.

TABLE 1

	Method	Image modality	Audio modality	Text modality	Accuracy (%)
						1	Random	-	-	-	20.00
2	Longest Answer	-	-	-	32.24
						3	Shortest Answer	-	-	-	16.27
4	HRCN	P	-	-	47.41
						5	HGA	P	-	P	57.99
6	Two-stream	P	P	-	59.90
						7	Two-stream	P	-	P	58.46
8	Two-stream	-	P	P	58.59
						9	Two-stream	P	P	P	61.16
10	The method of the invention-bimodal	P	P	-	61.88
						11	The method of the invention-bimodal	P	-	P	69.07
12	The method of the invention-bimodal	-	P	P	58.91
						13	The method of the invention	P	P	P	65.62

Another embodiment of the invention provides a computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the above-described method.

Another embodiment of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a computer, implements the method described above.

The fine-grained video emotional content question-answering algorithm based on multi-mode data according to the present invention is described in detail above, but it is obvious that the specific implementation form of the present invention is not limited thereto. Various obvious modifications thereof will be within the scope of the invention, as will be apparent to those skilled in the art, without departing from the spirit of the method of the invention and the scope of the claims.

Claims

1. The fine-granularity video emotional content question-answering method based on the multi-modal data is characterized by comprising the following steps of:

2) Extracting multi-modal features including visual features, audio features and text features for a video clip, and encoding corresponding question-answer pairs to obtain question codes and answer codes;

5) Inputting the enhanced multi-modal features into a plot memory network, updating and storing emotion reasoning clues extracted from the multi-modal features by using the plot memory network, and capturing multi-modal key information in the emotion reasoning process to obtain video context representation;

6) And respectively learning the context awareness attention aiming at the question codes and the answer codes by using the question codes, the answer codes and the video context representation input answer prediction modules to obtain final emotion question and answer results.

2. The method of claim 1, wherein the visual features comprise global visual features and facial features; extracting global visual features using a Resnet-152 model pre-trained on an ImageNet dataset; the method for extracting facial features comprises the following steps: detecting a face area in a video frame by using a pre-training model MTCNN, and fine-tuning a VGGFace2 model pre-trained on Facenet by using face area data of a main role in the video to obtain a face recognition feature; extracting facial expression features using a Facenet model pre-trained on the FER2013 dataset; and the facial recognition features and the facial expression features are spliced to form facial features of the person in the video.

3. The method of claim 1, wherein the audio features are extracted using an openSIMLE audio feature extractor; extracting the text features and encoding the question-answer pairs by using a pre-trained GloVe word embedding tool method; the extracted multi-mode features are respectively time-sequence encoded by using a transducer encoder.

4. The method of claim 1, wherein the extracting problem-related information from the multi-modal features of the video is using a problem-guided attention-directed feature representation of the problem, comprising the steps of:

1) Characterizing facial features

Is of (a) spatial concentration a ^f ；

3) Directing spatial attention a ^f And features

5. The method of claim 1, wherein the video context representation C output by the episode memory network is obtained by ^v,a,t ：

Wherein F is _attn Represents the attention function, f _i Representing the ith fact vector in the output real-time sequence, m ^t-1 Is the state after t-1 st update in the memory network module, q represents the problem code vector;

2) Memory cell refresh mechanism: calculating hidden layer state of ith unit of GRU in memory network module

Wherein h is _i Represents the hidden layer state of the ith cell in the GRU, and the last hidden layer state of the GRU is used as the video context representation of the t-th memory cell update->

Finally, the memory cell state of the t-th time is updated +.>

Wherein F is _mem Is a memory update function.

6. The method of claim 1, wherein the video question-answering result of the answer prediction module is obtained by:

calculating fusion representations of each modal feature and the question and answer representation by using a context matching module;

processing the fusion representation by using a softmax function and an FC layer to obtain probability distribution of each branch for answer prediction;

and splicing the probability distribution of answer prediction obtained by the three modes, and processing by using linear and softmax functions to obtain the final answer prediction probability distribution.

7. The method of claim 1, wherein steps 3) -6) are implemented using a video emotional content question-answering model, wherein the video emotional content question-answering model employs end-to-end training, and wherein a penalty function of the video emotional content question-answering model

P＝[p ₀ ,…,p ₄ ]Each element p of (a) _i Indicating that the answer corresponding to the sample is a _i Probability of (2); y= [ y ] ₀ ,…,y ₄ ]Is a one-hot coded representation of the sample label, when the answer corresponding to the sample is a _i Time y _i =1, otherwise y _i ＝0。

8.A fine-grained video emotional content question-answering system based on multi-modal data, comprising the steps of:

9. A computer device comprising a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-7.

10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program which, when executed by a computer, implements the method of any one of claims 1-7.