CN115712709A - Multi-modal dialog question-answer generation method based on multi-relationship graph model - Google Patents

Multi-modal dialog question-answer generation method based on multi-relationship graph model Download PDF

Info

Publication number
CN115712709A
CN115712709A CN202211451009.5A CN202211451009A CN115712709A CN 115712709 A CN115712709 A CN 115712709A CN 202211451009 A CN202211451009 A CN 202211451009A CN 115712709 A CN115712709 A CN 115712709A
Authority
CN
China
Prior art keywords
video
graph
text
model
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211451009.5A
Other languages
Chinese (zh)
Inventor
吕姚嘉
朱文轩
刘铭
徐洁馨
李秋霞
秦兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
China Merchants Bank Co Ltd
Original Assignee
Harbin Institute of Technology
China Merchants Bank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology, China Merchants Bank Co Ltd filed Critical Harbin Institute of Technology
Priority to CN202211451009.5A priority Critical patent/CN115712709A/en
Publication of CN115712709A publication Critical patent/CN115712709A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

A multi-modal dialog question-answer generating method based on a multi-relationship graph model relates to a multi-modal dialog question-answer generating method. The invention aims to solve the problem that the existing multi-modal dialog system only considers scene serialization information to cause the effect of the existing model to be general. Firstly, video is serialized and segmented into a plurality of video segments, the color feature, the optical flow feature and the audio feature of each segment are obtained and spliced, and then position information and modal information are added to obtain the sequential representation of each video segment; taking each video clip as a vertex, constructing a video graph based on a full link relation and inputting the video graph into a graph convolution neural network to obtain a video hidden layer sequence and a fusion expression of the video hidden layer sequence and an original video sequence; then, processing word vectors corresponding to the audiovisual scene titles and the conversation histories in a similar mode to obtain corresponding text hidden layer sequences and fusion representations of the text hidden layer sequences and the original text sequences; and finally, generating an answer by using the neural network model.

Description

Multi-modal dialog question-answer generation method based on multi-relationship graph model
Technical Field
The invention belongs to the technical field of dialogue questions and answers, and particularly relates to a multi-mode dialogue question and answer generating method.
Background
The research in the field of the current dialogue question-answering system is mainly divided into two branches of text and multi-mode. The text dialogue question-answer task mainly has two major difficulties: answer generation requires contextual reasoning of the dialog and a lack of a large-scale dialog dataset. Because the pre-training Language Models (LMs) have learned rich semantic information from other text data, a certain degree of reasoning can be performed, and the problem of insufficient dialogue data volume is effectively made up, so that the system can still obtain a better result under the low-resource background. Therefore, the introduction of the pre-training language model into the dialogue question-answering task can deepen the comprehension of the system to the text, and the current user question is processed based on the inference of the historical dialogue turns, so that the quality of generated answers is improved. In ISCA2020, whang et al apply a pre-trained language model in an open-domain dialog to select candidate answers, where the output of the pre-trained language model (e.g., [ CLS ] flag in BERT) is used as a context representation for each dialog context and candidate answer pair. In WNGT2019, budzianowski et al assume that the true dialog state is available, combining the inputs into a single sequence to generate a response for a task-directed dialog. Since the dialog state and the database state can be viewed as raw text input, the system can be fine-tuned using pre-trained language models. In ICASSP2020, lai et al introduces a GPT-2 model, and uses the output of the model to represent the predicted slot values, thereby tracking the dialog state.
In ACL2020, hierarchical pointer networks are also widely used in text dialog systems. In ICLR2019, wu et al incorporate a global encoder and a local decoder, enabling external knowledge to be shared in a task-oriented dialog setup. In NAACL2019, reddy et al have designed a multi-level storage framework for task-oriented dialog. In the ACL2019, tian et al explore how to extract valuable information in the training process, and build a memory starting architecture. In addition, multitask learning has also been shown to optimize the performance of natural language answers. Work memory is introduced to the task in Chen et al's work in ACL2019, which can capture the conversation history and tuples of the knowledge base to generate high quality answers by interacting sufficiently with the two long term memories. In EMNLP2019, lin et al also apply it to research in this field in view of the ability of heterogeneous memory networks to simultaneously utilize session context, user questions, and knowledge base information.
Multimodal dialog question-answering opens up a new paradigm for implementing powerful dialog systems. Current research is focused on how to bridge the language and Visual gap with multi-granular complementary information between static images and text, such as CVPR2017, where the Visual Dialog (Visual Dialog) task proposed by Das et al provides pictures and multiple rounds of Dialog associated therewith, requiring that the model can correctly answer questions related to the questioner in natural language according to the given image and Dialog history. Although this task is of great significance in advancing the development process of the multi-modal dialog question-answering system, there are certain inherent limitations to the dialog based on static images, which greatly limit the dynamic perception capability of the question-answering system to spatio-temporal changes, making it impossible to reasonably cope with many applications that require understanding a particular context to make reasonable inferences. Therefore, in order to improve the time-space intelligence of the question-answering system, a new task, namely, audio-Visual Scene-Aware Dialog (AVSD), is introduced, and can be regarded as a general form of Visual Dialog, namely, visual Dialog based on continuous picture frames and Audio information, and has a wider application prospect compared with Visual Dialog. However, existing solutions mainly use separate encoders to encode the different modalities separately, and then fuse their representations with an attention mechanism and generate response statements. The scheme of the later stage fusion only considers the serialization characteristics of scenes and conversations and neglects the multi-granularity semantic complementary relation among different modes, so that the effect of the existing model is not satisfactory. Therefore, semantic information representation and mode fusion modes of exploration dialogue scenes have great significance for realizing a multi-modal dialogue question-answering system with higher performance.
Meanwhile, the focus of research related to the field of multimodal dialogs is shifting towards how to adequately integrate multi-source heterogeneous information, including images, audio, video, and text, etc. Compared with text dialogue question-answering, the multi-modal dialogue question-answering task additionally introduces audio-video characteristics related to dialogue, so that the problem of fine-grained interaction among different modes needs to be solved.
Attention mechanism is the mainstream research method in this field, and it can reduce the gap between visual and language modal representation. In CVPR2018, the CoAtt model designed by Wu et al contains a serialized co-attention mechanism encoder such that each input feature is provided by both other features in a serialized manner. The ReDAN model proposed by Gan et al in ACL2019 and the DMRM model proposed by Chen et al in AAAI2020 answer a series of questions about images through a multi-step reasoning based on a dual-attention mechanism. The LTMI model designed by Nguyen et al in ECCV2020 utilizes multi-head attention mechanism interaction of the modes of interest.
The pre-trained language model architecture of the Transformer architecture implemented based on the attention mechanism also performs well in learning the cross-modal representation of the visual-textual natural language processing task. In the image description task, li et al in AAAI2020 constructs BERT-based architecture to improve text and visual representations, and in NIPS2019 Lu et al deals with the visual question-answering task using a similar approach, with the difference that in processing multimodal input, the visual and textual representations are separated rather than merged into the entire sequence. In IJCNLP2019, alberti et al pay attention to the important role of the early-stage fusion or late-stage fusion method in enriching the trans-modal representation. In ICCV2019, sun et al propose a VideoBERT model that uses a BERT model to generate video descriptions and abandons the method of representing video frames with visual features, and instead converts frame-level features into visual markers as the original input to the model.
Recent research has also explored higher-level semantic representations of picture or dialog histories, especially the way picture or dialog histories are modeled based on graph structures. In AAAI2020, the DualVD model proposed by Jiang et al describes the features of pictures in detail from both visual and semantic perspectives, specifically, the visual graph model helps to extract surface information including entities and relationships, and the semantic graph model facilitates the transition of the dialog question-answering system from global to local visual semantic understanding. In CVPR2020, the CAG model designed by Guo et al establishes graph nodes with entity-related visual representation and history-related context representation, updates corresponding edge weights with an adaptive Top-K information transfer mechanism, and establishes a visual-semantic related dynamic graph for subsequent reasoning. In ACL2021, the GoG model proposed by Chen et al takes into account that there is also interaction between different relationships, and thus models a graph of the dependency of the current question based on the dialog history and a graph of the relationship of the object (region) based on the current question.
Disclosure of Invention
The invention aims to solve the problem that the existing multi-modal dialog system only considers scene serialization information to cause a common effect of an existing model, and further provides a multi-modal dialog question-answer generating method based on a multi-relationship graph model.
A multi-modal dialog question-answer generation method based on a multi-relationship graph model comprises the following steps:
s1, segmenting a video into a plurality of video segments by using a sliding window with a fixed size, and acquiring the color characteristics of each segment
Figure BDA0003949948550000031
And optical flow features
Figure BDA0003949948550000032
And audio features
Figure BDA0003949948550000033
Characterizing colors
Figure BDA0003949948550000034
Optical flow features
Figure BDA0003949948550000035
And audio features
Figure BDA0003949948550000036
Are spliced to obtain
Figure BDA0003949948550000037
Rejoining location information
Figure BDA0003949948550000038
And modal information
Figure BDA0003949948550000039
Deriving a sequence representation V of individual video segments t (ii) a The expression is as follows:
Figure BDA00039499485500000310
Figure BDA00039499485500000311
wherein the position information
Figure BDA00039499485500000312
Wherein the number is used to indicate the order in which each video clip appears, the modality information
Figure BDA00039499485500000313
In using an identifier [ video ]]Uniformly identifying video features, and converting the video features into vectors with fixed dimensions during actual calculation;
s2, representing V = (V) for audio-visual scene 1 ,V 2 ,...,V m ),V 1 ,V 2 ,...,V m Namely, the sequence representation of each video clip is obtained, each video clip is taken as a vertex, and a video graph based on the full link relation is constructed
Figure BDA00039499485500000314
Wherein
Figure BDA00039499485500000315
ε V Is a collection of directed dependent edges, for each directed dependent edge (V) i ,V j ,l ij ),l ij Represents from V i To V j And is set to 1;
inputting the video picture into the graph convolution neural network, and outputting the video hidden layer sequence G V
S3, hiding the video into a layer sequence G V Obtaining a fused representation of a video with a V input linear layer of a representation of an original video sequence
Figure BDA00039499485500000316
And is used as part input of a subsequent GPT-2 architecture-based multi-layer TRANSFORMER model;
s4, obtaining corresponding word vector representation C based on audiovisual scene title C and conversation history H feature And H feature (ii) a Subject word vector C feature And dialogue history word vector H feature Splicing together, and adding position information T pos And modal information T mod Obtaining a text sequence representation T; the expression is as follows:
T feature =[C feature ,H feature ],
T=T feature +T mod +T pos ,
wherein the position information T pos Where numerals are used to indicate the title and the order in which words appear in each question-answer pair, T pos Using the identifier [ cap ]]Uniform identification of audiovisual scene titles, identifier usr1]Identifying the questioner, identifier [ usr2 ]]Identifying respondents, and respectively converting the respondents into vectors with fixed dimensions during actual calculation;
s5, taking each word vector in the text sequence representation T obtained in the S4 as a vertex, and constructing a graph structure based on sentence-level dependency relationship
Figure BDA0003949948550000041
And/or graph structures based on full dialog coreference
Figure BDA0003949948550000042
And then the graph structure of sentence-level dependency relationship
Figure BDA0003949948550000043
And/or graph structures based on full dialog coreference
Figure BDA0003949948550000044
Respectively inputting the graph convolution neural networks to obtain respective corresponding text hidden layer sequences;
s6, drawing structure of sentence level dependency relationship
Figure BDA0003949948550000045
And/or graph structures based on full dialog coreference
Figure BDA0003949948550000046
Corresponding text hidden layer sequence and original text sequence represent T input linear layer to obtain fusion representation of text
Figure BDA0003949948550000047
And is used as part input of a subsequent GPT-2 architecture-based multi-layer TRANSFORMER model;
s7, mixing
Figure BDA0003949948550000048
And
Figure BDA0003949948550000049
and splicing to obtain the enhanced multi-modal input, and generating an answer based on the multi-layer Transformer model of the GPT-2 architecture.
Further, in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex to construct a graph structure based on sentence-level dependency relationship
Figure BDA00039499485500000410
The process of obtaining the corresponding text hidden layer sequence comprises the following steps:
firstly, GPT2 Tokenizer is used for obtaining word vector representation corresponding to each word, a Stanford CoreNLP text analysis tool is used for analyzing the syntactic dependency relationship of the sentence, each word vector is taken as a vertex, and a graph structure is modeled according to the syntactic dependency relationship; then inputting the graph convolution neural network and outputting a text hidden layer sequence G D
Alternatively, the first and second liquid crystal display panels may be,
in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex, and a graph structure based on the complete dialogue coreference relationship is constructed
Figure BDA00039499485500000411
The process of obtaining the corresponding text hidden layer sequence comprises the following steps:
firstly, using GPT2 Tokenizer to obtain word vector representation corresponding to each word, using a Stanford CoreNLP text analysis tool to analyze the coreference relation of the sentence, taking each word vector as a vertex, and modeling a graph structure according to the coreference relation of the sentence; then inputting the graph convolution neural network and outputting the text hidden layer sequence G C
Alternatively, the first and second liquid crystal display panels may be,
in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex, and a graph structure based on sentence-level dependency relationship is constructed
Figure BDA00039499485500000412
And graph structure based on full dialog coreference
Figure BDA00039499485500000413
The process of obtaining the corresponding text hidden layer sequence comprises the following steps:
firstly, GPT2 Tokenizer is used to obtain word vector representation corresponding to each word, stanford CoreNLP text analysis tool is used to analyze the syntactic dependency relationship and CoreNLP text analysis tool of the sentence, and each word is usedThe word vectors are taken as vertexes, and graph structures based on sentence-level dependency relationship are respectively constructed
Figure BDA00039499485500000414
And graph structure based on full dialog coreference
Figure BDA00039499485500000415
Then, the two graph structures are respectively input into a graph convolution neural network, and a text hidden layer sequence G is output D And G C
Further, the expression calculated by the graph convolution neural network of each layer of the graph convolution neural network in S5 is:
Figure BDA0003949948550000051
wherein, f (H) v (l+1) ,A v ) Represents a convolution of each layer of the graph, for
Figure BDA0003949948550000052
Or
Figure BDA0003949948550000053
A d Each representing a respective corresponding adjacency matrix,
Figure BDA0003949948550000054
adding an identity matrix I for the corresponding degree matrix d To obtain
Figure BDA0003949948550000055
Accordingly, a corresponding degree matrix is obtained
Figure BDA0003949948550000056
So as to facilitate normalization operation, wherein l is the number of layers of the graph convolution neural network,
Figure BDA0003949948550000057
convolution of the hidden state of the neural network for the l-th layer,
Figure BDA0003949948550000058
further, the expression calculated by each layer of the graph convolution neural network in S2 is:
Figure BDA0003949948550000059
wherein, f (H) v (l+1) ,A v ) Representing each layer of graph convolution; a. The v Is composed of
Figure BDA00039499485500000510
I, j respectively represent
Figure BDA00039499485500000511
The number of nodes i, j,
Figure BDA00039499485500000512
is A v The value of the ith row and jth column;
Figure BDA00039499485500000513
is composed of
Figure BDA00039499485500000514
The degree matrix of (c) is,
Figure BDA00039499485500000515
is D v The value of the ith row and ith column; in order to make the model capable of considering the node self-expression, an identity matrix I is added v To obtain
Figure BDA00039499485500000516
Accordingly, a corresponding degree matrix is obtained
Figure BDA00039499485500000517
To facilitate normalization operations; l is the number of layers of the graph convolution neural network,
Figure BDA00039499485500000518
convolution of the hidden state of the neural network for the l-th layer,
Figure BDA00039499485500000519
v is the representation of the original video sequence,
Figure BDA00039499485500000520
are trainable weights.
Furthermore, when a multi-layer Transformer model based on a GPT-2 architecture is processed, the method in S3 is required to be carried out
Figure BDA00039499485500000521
Inputting into linear full-connection layer, and projecting output result into S6
Figure BDA00039499485500000522
And (3) splicing the two to obtain a complete multi-modal input representation by using the same vector space, and then inputting the complete multi-modal input representation into a pre-training language model GPT-2.
Preferably, the GPT-2 architecture-based multi-layer transform model in S7 is formed by stacking 12 layers of transform decoder modules with a multi-head attention mechanism with masks.
Further, the multi-layer transform model based on the GPT-2 architecture is trained by using a negative log-likelihood loss function in a training process, and the training process comprises the following steps:
based on audio and video characteristics V, title C and conversation history H <n And the current problem Q n Generating answers
Figure BDA00039499485500000523
By minimizing the negative log-likelihood loss function, the probability that the next word output is the corresponding word of the source sequence is maximized:
Figure BDA00039499485500000524
wherein the content of the first and second substances,
Figure BDA00039499485500000525
represents the answer R n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E (V,C,H,Q,R)~D Indicating a desire.
Alternatively, the first and second electrodes may be,
the multi-layer Transformer model based on the GPT-2 framework performs combined training based on an answer prediction task RPT, a title prediction task CPT and an audio-video-text matching task VTMT of audio and video, titles and conversation historical characteristics in a training process, wherein the training process comprises the following steps:
the RPT part is intended to be based on audio-video features V, title C, dialog history H <n And current problem Q n Generating answers
Figure BDA0003949948550000061
By minimizing the negative log-likelihood loss function, the probability that the next word output by the model is the corresponding word of the source sequence is maximized:
Figure BDA0003949948550000062
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003949948550000063
represents the answer R n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E (V,C,H,Q,R)~D Indicates a desire;
the CPT part and the RPT part are similar, and for a given audiovisual feature V, the title C = { C } is generated by minimizing a negative log-likelihood loss function 1 ,c 2 ,...,c L The loss function is as follows:
Figure BDA0003949948550000064
wherein, c <i The first i-1 words representing title C;
the VTMT part is intended to determine whether a given audio-visual characteristic V matches a given text characteristic, the given text characteristic comprising a title C, a dialogue history H <n Current problem Q n And generating a response R n (ii) a Selecting a certain proportion of training data, randomly replacing corresponding original input with incorrect audio and video characteristics, obtaining the probability of whether the final output of the hidden state of the GPT2 module is matched or not through a linear full-connection layer, and then calculating a loss function by using a binary cross entropy:
Figure BDA0003949948550000065
wherein, X = (V, C, H, Q, R), Y is a tag characterizing whether the audio-video features and the text features are matched.
Has the advantages that:
the invention relates to a multi-modal dialog question-answer generation method based on a multi-relationship graph model, which constructs the multi-relationship graph model according to the characteristics of different modes to enrich multi-modal characteristic representation. By modeling the corresponding relation of entities in the continuous video clips and the syntactic and semantic relations implied in the continuous conversation, the understanding of the system to the scenes and the conversation is further deepened, the defect that the existing method only considers the coding of the time sequence and the language sequence is improved, and the quality of generated answers is further improved.
The invention aims to improve the condition that the current pre-training model architecture only utilizes the serialization information of videos or texts to acquire the embedded representation corresponding to each video segment or word, so that the multi-modal dialog question-answering system generates an unsatisfactory answer. And a graph convolution neural network is additionally added, so that the model has the capability of effectively coding the structure information of the multiple relational graphs. Specifically, the model introduces the graph convolution neural network, corresponding weights are given to the current nodes by calculating data distribution of an adjacent point set of the current nodes, and after the multilayer graph convolution neural network is stacked, the model has the capability of reasoning nodes jumping a plurality of times from the current nodes, so that interactive information between the current nodes and remote nodes can be captured, and further syntax or semantic relations among all words can be represented in one frame.
The invention can effectively improve various indexes of system generated response, including BLEU, METEOR, ROUGE-L, CIDER and the like. On the audiovisual scene perception dialogue related data set, the method carries out multi-index comparison on the answer generated by the system and the manually marked answer, and experimental results show that the answer generated by the method accords with the basic habit expressed by human beings, is basically matched with the manually marked result, and is superior to the effect of all existing models under the task. The improvement on the CIDER index representing the natural degree of sentences is particularly obvious, and is improved by 1% on average compared with the most advanced baseline result, so that the effectiveness and superiority of the method are fully demonstrated.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a block diagram of the overall model architecture of the present invention;
FIG. 3 is a block diagram of the basic elements of a pre-trained language model;
FIG. 4 is an example of a graph model construction based on dependencies;
FIG. 5 is an example graph model construction based on coreference relationships.
Detailed Description
The first embodiment is as follows: the present embodiment is described in connection with figure 1,
the embodiment is a multi-modal dialog question-answer generation method based on a multi-relation graph model, which comprises the following steps of:
step one, using a sliding window with a fixed size to segment a video into a plurality of video segments in a serialized mode, and for each segment, using an I3D model to obtain color characteristics of the segment
Figure BDA0003949948550000071
And optical flow features
Figure BDA0003949948550000072
Obtaining the audio characteristics of the segment by using a VGGish model
Figure BDA0003949948550000073
Characterizing colors
Figure BDA0003949948550000074
Light flow features
Figure BDA0003949948550000075
And audio features
Figure BDA0003949948550000076
Are spliced to obtain
Figure BDA0003949948550000077
Joining location information
Figure BDA0003949948550000078
And modal information
Figure BDA0003949948550000079
Deriving a sequence representation V of individual video segments t (ii) a The expression is as follows:
Figure BDA00039499485500000710
Figure BDA00039499485500000711
wherein the position information
Figure BDA00039499485500000712
Wherein numerals are used to indicate the order of occurrence of each video clip, modality information
Figure BDA00039499485500000713
In using an identifier [ video ]]The video features are identified uniformly and converted into fixed-dimension vectors during actual computation, as shown in FIG. 2
Figure BDA0003949948550000081
Denoted as V1, V2, V3, V4, V5, corresponding thereto
Figure BDA0003949948550000082
Are all marked as [ video ]]。
Step two, given the audio-visual scene representation V = (V) 1 ,V 2 ,...,V m ),V 1 ,V 2 ,...,V m Namely, the sequence representation of each video clip is obtained, each video clip is taken as a vertex, and a video graph based on the full link relation is constructed
Figure BDA0003949948550000083
Wherein
Figure BDA0003949948550000084
ε V Is a collection of directed dependent edges, for each directed dependent edge (V) i ,V j ,l ij ),l ij Represents from V i To V j And is set to 1;
inputting the video picture into the graph convolution neural network, and outputting the video hidden layer sequence G V (ii) a The expression of each layer of graph convolution neural network calculation is as follows:
Figure BDA0003949948550000085
wherein, f (H) v (l+1) ,A v ) Representing each layer of graph convolution; a. The v Is composed of
Figure BDA0003949948550000086
I, j respectively represent
Figure BDA0003949948550000087
The number of nodes i, j,
Figure BDA0003949948550000088
is A v The value of the ith row and jth column;
Figure BDA0003949948550000089
is composed of
Figure BDA00039499485500000810
The degree matrix of (a) is obtained,
Figure BDA00039499485500000811
is D v The value of row i and column i; in order to make the model capable of considering the node self-expression, an identity matrix I is added v To obtain
Figure BDA00039499485500000812
Accordingly, a corresponding degree matrix is obtained
Figure BDA00039499485500000813
To facilitate normalization operations; l is the number of layers of the graph convolution neural network,
Figure BDA00039499485500000814
for the hidden state of the l-th layer graph convolutional neural network,
Figure BDA00039499485500000815
v is the representation of the original video sequence,
Figure BDA00039499485500000816
are trainable weights.
Step three, hiding the video layer sequence G V Obtaining a fused representation of a video with a representation V of an original video sequence input linear layer
Figure BDA00039499485500000817
And as part of the input to the subsequent GPT2 model; the expression is as follows:
Figure BDA00039499485500000818
wherein, W M ,W V To train the weights.
Step four, obtaining Word vector representation C based on Word Pices audio-visual scene title C and conversation history H by using GPT2 Tokenizer feature And H feature . Subject word vector C feature And dialogue history word vector H feature Splicing together, adding position information T pos And modal information T mod Obtaining a text sequence representation T; the expression is as follows:
T feature =[C feature ,H feature ],
T=T feature +T mod +T pos ,
wherein the position information T pos Wherein numerals are used to indicate the title and the order of occurrence of the words in each question-answer pair, T pos Using the identifier [ cap ]]Uniform identification of audiovisual scene titles, identifier usr1]Identifying the questioner, identifier [ usr2 ]]The respondents are identified and respectively converted into vectors of fixed dimensions in actual calculation, and a specific example is given in fig. 2. Feature layer uses GPT2 Tokenizer to combine text "[ cap ]]a woman…[eos][usr1]is the woman…[eos]"encoding as T feature According to the source of each sentence, the text is marked "[ cap ]]a woman…[eos]"corresponding T mod The label is [ cap ]]The text "[ usr1 ]]is the wman 8230 mod Marked as [ usr1 ]]The text "[ usr2 ]]yes she is \8230; "and" [ usr 2;)]nothing much…[eos]"corresponding T mod Marked as [ usr2 ]]。
Taking each word vector as a vertex, and constructing a text graph based on sentence-level dependency relationship according to a Stanford CoreNLP text analysis tool
Figure BDA0003949948550000091
A specific example is given in fig. 4. For the current text "do the world eat or drink analyzing", firstly, GPT2 Tokenizer is used to obtain the word vector representation corresponding to each word, stanford CoreNLP text analysis tool is used to analyze the syntactic dependency relationship (the connecting line between the word vectors in the graph) of the sentence, each word vector is taken as a vertex, the graph structure can be modeled according to the dependency relationship, the graph is input into the graph convolution neural network, and the text hidden layer sequence G is output D (ii) a The expression is as follows:
Figure BDA0003949948550000092
wherein A is d Is composed of
Figure BDA0003949948550000093
Of the adjacent matrix of (a) and (b),
Figure BDA0003949948550000094
is composed of
Figure BDA0003949948550000095
The degree matrix is added into the identity matrix I so that the model can consider the node self expression d To obtain
Figure BDA0003949948550000096
Accordingly, a corresponding degree matrix is obtained
Figure BDA0003949948550000097
So as to facilitate normalization operation, wherein l is the number of layers of the graph convolution neural network,
Figure BDA0003949948550000098
for the hidden state of the l-th layer graph convolutional neural network,
Figure BDA0003949948550000099
step six, hiding the text into a layer sequence G D Obtaining a fused representation of text with the original text sequence representation T input linear layer
Figure BDA00039499485500000910
And as part of the input to the subsequent GPT2 model; the expression is as follows:
Figure BDA00039499485500000911
wherein, W N ,W D Are trainable weights.
Seventhly, constructing the GPT2 of the multi-mode dialogue question and answerModel (see figures 2 and 3 for details). FIG. 2 shows the overall model architecture, which is a multi-layer transform model based on GPT-2 architecture. The model is stacked from 12 layers of transform decoder modules with a masked multi-headed attention mechanism. In order to enable the model to have the capability of fusing multi-modal characteristics and generating reasonable answers, the GPT-2 model based on the generating task is changed to a certain extent, so that the model is more suitable for the requirements of multi-modal dialogue question-answering tasks. Specifically, the model compares the results of step three
Figure BDA00039499485500000912
Inputting the output result into a linear full-connection layer, and projecting the output result to the result of the step six
Figure BDA00039499485500000913
And (3) splicing the two to obtain a complete multi-modal input representation by using the same vector space, and then inputting the complete multi-modal input representation into a pre-training language model GPT-2.
FIG. 3 shows the specific architecture of each Transformer decoder module in the GPT-2 model. The module mainly comprises a multi-head attention mechanism with a mask and a feedforward neural network. The mask multi-head Attention mechanism mask Self-Attention can detect fine-grained long-term dependency relationships among modal inputs, including spatio-temporal relationships of video objects, coreference relationships among conversation histories, video local features, reference relationships of text vocabularies and the like, so as to generate reasonable answers which are based on visual and auditory features and accord with user questions.
Step eight, mixing
Figure BDA0003949948550000101
And
Figure BDA0003949948550000102
and (3) splicing to obtain an enhanced multi-modal input representation as the complete input of the GPT2 model of the multi-modal dialog question-answer constructed in the step seven, designing a loss function based on the model to carry out joint training to obtain a trained model, and further generating a reasonable answer aiming at a given audio-visual scene and a user question.
And a negative log-likelihood loss function is used during training, so that the model has the capability of predicting answers based on audio and video, titles and conversation historical characteristics. Formalized, the model is based on audio-video characteristics V, title C, dialogue history H <n And current problem Q n Generating answers
Figure BDA0003949948550000103
By minimizing the negative log-likelihood loss function, the probability that the next output word is the corresponding word of the source sequence is maximized:
Figure BDA0003949948550000104
wherein the content of the first and second substances,
Figure BDA0003949948550000105
represents the answer R n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E (V,C,H,Q,R)~D Indicating the desire.
The second embodiment is as follows:
the embodiment is a multi-modal dialog question-answer generating method based on a multi-relationship graph model, and the embodiment is different from the specific embodiment in that:
in the fifth step, each word vector in the text sequence representation T obtained in the fourth step is taken as a vertex, and a text graph based on the complete dialogue coreference relation is constructed according to a Stanford CoreNLP text analysis tool
Figure BDA0003949948550000106
Fig. 5 shows a specific example. For the current text ' a wman \8230afredge \8230thewman \8230andthe she \8230andit's, GPT2 Tokenizer is firstly used to obtain the word vector representation corresponding to each word, stanford CoreNLP text analysis tool is used to analyze the coreference of the sentence, namely the coreference exists among the text, a wman, the wman and the she ', and the Afridge text and the it, and each word is divided into a plurality of wordsWhen the word vector is taken as a vertex, the graph structure can be modeled according to the coreference relationship, namely, an edge is established between 'a wman', 'the wman' and 'she', and an edge is established between 'a fridge' and 'it', the graph is input into the graph convolution neural network, and the text hidden layer sequence G is output C The specific calculation formula is the same as the expression in the fifth step of the first embodiment;
step six, the text hidden layer sequence G obtained in the step four and the step five C Obtaining a fused representation of text with the original text sequence representation T input linear layer
Figure BDA0003949948550000107
And as part of the input to the subsequent GPT2 model; the expression is as follows:
Figure BDA0003949948550000108
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003949948550000111
are trainable weights.
Other steps and parameters are the same as those in the first embodiment.
The third concrete implementation mode:
the embodiment is a multi-modal dialog question-answer generating method based on a multi-relationship graph model, and the embodiment is different from the specific embodiment in one or two aspects:
in the fifth step, each word vector in the text sequence representation T obtained in the fourth step is taken as a vertex, and a text graph based on sentence-level dependency relationship is constructed according to the Stanford CoreNLP text analysis tool
Figure BDA0003949948550000112
And text graphs based on full dialog coreferences
Figure BDA0003949948550000113
Inputting the two graphs into a graph convolution neural network respectively, and outputting a text hidden layer sequence G D And G C The specific calculation formula is the same as the expression in step five of the first embodiment.
Step six, the text hidden layer sequence G obtained in the step four and the step five D And G C Obtaining a fused representation of text with a T input linear layer of a representation of the original text sequence
Figure BDA0003949948550000114
And as part of the input of the subsequent GPT2 model; the expression is as follows:
Figure BDA0003949948550000115
wherein the content of the first and second substances,
Figure BDA0003949948550000116
and
Figure BDA0003949948550000117
are trainable weights.
Other steps and parameters are the same as those in the first or second embodiment.
The fourth concrete implementation mode:
the embodiment is a multi-modal dialog question-answer generating method based on a multi-relation graph model, and is different from one of the first to third specific embodiments in that:
in the eighth step, in order to promote fusion of information of different modes, three tasks are introduced for fine adjustment during model training, including an answer Prediction Task (RPT) based on audio and Video, a title and a conversation history characteristic, a title Prediction Task (CPT) based on audio and Video, and a Video-Text Matching Task (VTMT). The first three embodiments only use one loss function, and are a single-task learning mode. In the fourth specific implementation mode, three loss functions are designed, and the model understanding capability of different modal information is enhanced by adopting a multi-task learning mode.
The RPT part is intended to be based on audio-visual features V, title C, dialog history H <n And the current problem Q n Generating answers
Figure BDA0003949948550000118
And by minimizing the negative log-likelihood loss function, the probability that the next word output by the model is the corresponding word of the source sequence is maximized:
Figure BDA0003949948550000119
wherein the content of the first and second substances,
Figure BDA00039499485500001110
represents the answer R n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E (V,C,H,Q,R)~D Indicating a desire.
The CPT part and the RPT part are similar, and for a given audio-visual feature V, the title C = { C } is generated by minimizing a negative log-likelihood loss function 1 ,c 2 ,...,c L The loss function is as follows:
Figure BDA0003949948550000121
wherein, c <i Representing the first i-1 words of title C.
The VTMT part is intended to determine a given audiovisual characteristic V and a given text characteristic (including title C, dialog history H) <n Current problem Q n And generating a response R n ) And whether it matches, thereby applying it successfully to the dialogue domain task in a manner that fine-tunes the pre-trained language model. Specifically, the task selects about 15% of training data, replaces corresponding original input with incorrect audio and video features randomly, obtains the probability of whether the final output of the hidden state of the GPT2 module is matched or not through a linear full-connection layer, and then calculates a loss function by using a binary cross entropy so as to enhance the understanding of the system to the scene, wherein a calculation formula is as follows:
Figure BDA0003949948550000122
wherein, X = (V, C, H, Q, R), Y is a tag characterizing whether the audio-video features and the text features are matched.
Other steps and parameters are the same as those in one of the first to third embodiments.
The following examples were used to demonstrate the beneficial effects of the present invention:
the first embodiment is as follows:
the data set selects an audio-visual scene perception dialogue data set of a seventh dialogue System Technology challenge race (DSTC7) issued by Hori et al in ICASSP2019 to evaluate The System performance, and in order to ensure The fairness and rationality for measuring performance differences among different models, the dividing mode of The data set is consistent with The task setting in The challenge race. The data set size and partitioning is shown in table 1.
TABLE 1 DSTC7-AVSD data set overview
Figure BDA0003949948550000123
The evaluation indexes are commonly used indexes in a natural language generation task, including BLEU, METEOR, ROUGE-L, CIDER and the like, and can calculate semantic similarity and language fluency between predicted answers and real answers from different angles, so that system performance is scientifically reflected.
The experimental parameter settings are shown in table 2. Specifically, during the encoding process, the learning rate of the Adam optimizer is set to 6.25e-5, which relates to a maximum of 3 rounds of conversation history, the hidden state of the transform module is 768, and the batch size is 8. In the decoding process, a beam search algorithm is adopted, the beam width is set to be 5, the maximum length of a sentence is set to be 20, and the length penalty is set to be 0.3.
Table 2 experimental parameter settings
Figure BDA0003949948550000131
Table 3 compares the difference in baseline model and the results generated by the present invention across DSTC 7-AVSD. It can be seen that in this example, the "television" referred to by the questioner does not appear in the headline and dialog history, and therefore the system needs to incorporate audiovisual information and make simple inferences to answer the question correctly. At this time, the answer of the baseline model to the question fully indicates that the baseline model does not fully understand the question direction of the questioner, lacks reasoning ability, and cannot give correct answers to questions which cannot find specific information in the title, abstract or dialogue history, even gives out answers without asking questions.
Compared with a baseline model, the method can fully interact the information of the given video and the given text based on the given video and the given text, thereby capturing the hidden complex dependency relationship between different modal inputs, extracting richer feature representations and generating high-quality and natural answers based on reasoning.
TABLE 3 DSTC7-AVSD samples generated by VGPT model
Figure BDA0003949948550000132
Figure BDA0003949948550000141
In order to objectively and comprehensively verify the effectiveness of the invention, the DSTC7-AVSD data set is compared with the relevant baseline method, and the specific results are shown in Table 4, wherein the optimal results of each index are shown in bold:
(1) A Naive Fusion model (nasal Fusion) proposed by Hori et al in ICASSP2019 provides a multimodal baseline method for DSTC7 organizers, which uses an LSTM model containing question directions to respectively extract video and audio features, uses layered LSTM to encode conversation history, and finally combines all modalities by means of a projection matrix to generate answers.
(2) The Hierarchical Attention mechanism model (HA) proposed by Sanabria et al in AAAI2019 introduces the migration learning of the video summarization task, obtains more visual details, and obtains the first name of the DSTC7-AVSD challenge.
(3) A Multimodal Transformer Network (MTN) proposed by Le et al in ACL2019 is the highest level system before the DSTC8-AVSD challenge, which uses a transform-based automatic coding module to focus on visual features with problem oriented.
(4) A Universal Multimodal Transformer (UMT) proposed by Li et al in TASLP2021 is the most advanced conversational question-answering system under the task at present, and introduces a pre-trained GPT-2 model to learn the fusion representation of the audiovisual scene in a multitask learning manner.
TABLE 4 Objective assessment results based on DSTC7-AVSD dataset
Figure BDA0003949948550000142
The experimental result shows that the invention is superior to the existing method on almost all the automatic indexes of the DSTC7-AVSD test set by using the third embodiment, and the UMT of the current most advanced model is improved by 1% on the indexes of BLEU-2, BLEU-2 and CIDER compared with the task. This shows that the introduction of the multiple relation graph structure coding can enable the dialogue system to generate answers with higher quality and remarkably improve the performance of the model. Due to the structural characteristics of the graph convolutional neural network, various syntactic and semantic information among all words can be displayed in a framework. Compared with a Multilayer Perceptron (MLP), the information of neighbor nodes can be comprehensively considered when calculating the representation of the current node, and for a long-distance connected point, the connected point can be obtained by stacking a Multilayer graph convolution neural network, so that the 'receptive field' of the connected point is enlarged.
Example two:
the data set selects The audio-visual scene perception dialogue data set of The eighth dialogue System Technology challenge (DSTC8) issued by Kim et al in TASLP2021 to evaluate The System performance, and The dividing mode of The data set is consistent with The task setting in The challenge in order to ensure The fairness and rationality for measuring The performance difference among different models.
The data set size and partitioning is shown in table 5.
TABLE 5 DSTC8-AVSD data set overview
Figure BDA0003949948550000151
The experimental parameter settings are consistent with table 2.
In order to objectively and comprehensively verify the effectiveness of the present invention, the DSTC8-AVSD dataset is compared with the relevant baseline method, and the specific results are shown in table 6, wherein the optimal results of each index are shown in bold:
(1) A Multi-step Joint-Modality Attention Network (JMAN) proposed by Chu et al in arXiv2020 designs a model architecture based on a recurrent neural Network, applies a Multi-step Attention mechanism, and gives consideration to visual and text representations in each reasoning process so as to better integrate information of two different modalities.
(2) Compared with the conventional Transformer architecture, the multi-Modal Semantic Transformer Network (MSTN) proposed by Lee et al in arXiv2020 has an additional attention-based word embedding layer designed, so that the model can take the meaning of the word into consideration more in the generation stage.
TABLE 6 Objective assessment results based on DSTC8-AVSD dataset
Figure BDA0003949948550000152
Figure BDA0003949948550000161
The experimental result shows that the invention is superior to the existing model on almost all the automation indexes of the DSTC8-AVSD test set by using the first embodiment. The improvement on the CIDER index representing the natural degree of the sentence is particularly obvious and is improved by 0.012 (1.240vs. 1.252), which shows that the local dependency relationship and the global coreference relationship can reflect the functional similarity of the text from different angles to encode the text information, thereby improving the expression of the existing model.
The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it is therefore intended that all such changes and modifications be considered as within the spirit and scope of the appended claims.

Claims (10)

1. A multi-modal dialog question-answer generating method based on a multi-relationship graph model is characterized by comprising the following steps:
s1, using a sliding window with a fixed size to serialize and divide a video into a plurality of video segments, and acquiring color characteristics of each segment
Figure FDA0003949948540000011
And optical flow features
Figure FDA0003949948540000012
And audio features
Figure FDA0003949948540000013
Characterizing colors
Figure FDA0003949948540000014
Optical flow features
Figure FDA0003949948540000015
And audio features
Figure FDA0003949948540000016
Are spliced to obtain
Figure FDA0003949948540000017
Re-joining location information
Figure FDA0003949948540000018
And modal information
Figure FDA0003949948540000019
Obtaining a sequence representation V of each video segment t (ii) a The expression is as follows:
Figure FDA00039499485400000110
Figure FDA00039499485400000111
wherein the position information
Figure FDA00039499485400000112
Wherein the number is used to indicate the order in which each video clip appears, the modality information
Figure FDA00039499485400000113
Using an identifier [ video ]]Uniformly identifying video features, and converting the video features into vectors with fixed dimensions during actual calculation;
s2, representing V = (V) for audio-visual scene 1 ,V 2 ,...,V m ),V 1 ,V 2 ,...,V m Namely, the sequence representation of each video clip is obtained, each video clip is taken as a vertex, and a video graph based on the full link relation is constructed
Figure FDA00039499485400000114
Wherein
Figure FDA00039499485400000115
Figure FDA00039499485400000116
Is a collection of directed dependent edges, for each directed dependent edge (V) i ,V j ,l ij ),l ij Denotes from V i To V j And is set to 1;
inputting the video picture into the graph convolution neural network, and outputting the video hidden layer sequence G V
S3, hiding the video into a layer sequence G V Obtaining a fused representation of a video with a representation V of an original video sequence input linear layer
Figure FDA00039499485400000117
And is used as part input of a subsequent GPT-2 architecture-based multi-layer TRANSFORMER model;
s4, obtaining corresponding word vector representation C based on audiovisual scene title C and conversation history H feature And H feature (ii) a Subject word vector C feature And dialogue history word vector H feature Spliced and then added with the position information T pos And modal information T mod Obtaining a text sequence representation T; the expression is as follows:
T feature =[C feature ,H feature ],
T=T feature +T mod +T pos ,
wherein the position information T pos Where numerals are used to indicate the title and the order in which words appear in each question-answer pair, T pos Using the identifier [ cap ]]Unified identification of audiovisual scene titles, identifier [ usr1 ]]Identifying the questioner, identifier [ usr2 ]]Identifying respondents, and respectively converting the respondents into vectors with fixed dimensions during actual calculation;
s5, regarding each word vector in the text sequence representation T obtained in the S4 as a vertex, and constructing a graph structure based on sentence-level dependency relationship
Figure FDA00039499485400000118
And/or graph structures based on full dialog coreferences
Figure FDA00039499485400000119
And then the graph structure of sentence-level dependency relationship
Figure FDA00039499485400000120
And/or graph structures based on full dialog coreferences
Figure FDA00039499485400000121
Respectively inputting the graph convolution neural networks to obtain respective corresponding text hidden layer sequences;
s6, drawing structure of sentence level dependency relationship
Figure FDA0003949948540000021
And/or graph structures based on full dialog coreferences
Figure FDA0003949948540000022
Corresponding text hidden layer sequence and original text sequence represent T input linear layer to obtain fusion representation of text
Figure FDA0003949948540000023
And is used as part input of a subsequent GPT-2 architecture-based multi-layer TRANSFORMER model;
s7, mixing
Figure FDA0003949948540000024
And
Figure FDA0003949948540000025
and splicing to obtain the enhanced multi-modal input, and generating an answer based on the multi-layer Transformer model of the GPT-2 architecture.
2. The multi-modal dialog question-answer generation method based on the multi-relationship graph model as claimed in claim 1, wherein the text sequence obtained in S4 in S5Constructing a graph structure based on sentence-level dependencies using each word vector in the representation T as a vertex
Figure FDA0003949948540000026
The process of obtaining the corresponding text hidden layer sequence comprises the following steps:
firstly, GPT2 Tokenizer is used for obtaining word vector representation corresponding to each word, a Stanford CoreNLP text analysis tool is used for analyzing the syntactic dependency relationship of the sentence, each word vector is taken as a vertex, and a graph structure is modeled according to the syntactic dependency relationship; then inputting the graph convolution neural network and outputting a text hidden layer sequence G D
3. The multi-modal dialog question-answer generation method based on the multi-relationship graph model as claimed in claim 1, wherein in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex to construct a graph structure based on the full dialog coreference relationship
Figure FDA0003949948540000027
The process of obtaining the corresponding text hidden layer sequence comprises the following steps:
firstly, a word vector representation corresponding to each word is obtained by using a GPT2 token, the coreference relation of the sentence is analyzed by using a Stanford CoreNLP text analysis tool, each word vector is taken as a vertex, and a graph structure is modeled according to the coreference relation of the sentence; then inputting the graph convolution neural network and outputting a text hidden layer sequence G C
4. The multi-modal dialog question-answer generation method based on the multi-relationship graph model as claimed in claim 1, wherein in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex, and a graph structure based on sentence-level dependency relationship is constructed
Figure FDA0003949948540000028
And graph structure based on full dialog coreference
Figure FDA0003949948540000029
The process of obtaining the corresponding text hidden layer sequence comprises the following steps:
firstly, GPT2 token is used to obtain word vector representation corresponding to each word, a Stanford CoreNLP text analysis tool is used to analyze syntactic dependency relationship of the sentence and coreference relationship of the sentence respectively, each word vector is taken as a vertex, and a graph structure based on sentence-level dependency relationship is constructed respectively
Figure FDA00039499485400000210
And graph structure based on full dialog coreference
Figure FDA00039499485400000211
Then, the two graph structures are respectively input into a graph convolution neural network, and a text hidden layer sequence G is output D And G C
5. The multi-modal dialog question-answer generating method based on the multi-relational graph model according to claim 1, 2, 3 or 4, wherein the expression calculated by each layer of graph convolution neural network of the graph convolution neural network in S5 is as follows:
Figure FDA00039499485400000212
wherein, f (H) v (l+1) ,A v ) Represents a convolution of each layer of the graph, for
Figure FDA00039499485400000213
Or
Figure FDA00039499485400000214
A d Each representing a respective corresponding adjacency matrix,
Figure FDA0003949948540000031
adding an identity matrix I for the corresponding degree matrix d To obtain
Figure FDA0003949948540000032
Accordingly, a corresponding degree matrix is obtained
Figure FDA0003949948540000033
So as to facilitate normalization operation, wherein l is the number of layers of the graph convolution neural network,
Figure FDA0003949948540000034
for the hidden state of the l-th layer graph convolutional neural network,
Figure FDA0003949948540000035
6. the multi-modal dialog question-answer generating method based on the multi-relational graph model according to claim 5, wherein the expression calculated by each layer of graph convolution neural network of the graph convolution neural network in S2 is as follows:
Figure FDA0003949948540000036
wherein, f (H) v (l+1) ,A v ) Representing each layer of graph convolution; a. The v Is composed of
Figure FDA0003949948540000037
I, j respectively represent
Figure FDA0003949948540000038
The number of nodes i, j,
Figure FDA0003949948540000039
is A v The value of the ith row and jth column;
Figure FDA00039499485400000310
is composed of
Figure FDA00039499485400000311
The degree matrix of (c) is,
Figure FDA00039499485400000312
is D v The value of row i and column i; in order to make the model capable of considering node self-expression, adding unit matrix I v To obtain
Figure FDA00039499485400000313
Accordingly, a corresponding degree matrix is obtained
Figure FDA00039499485400000314
To facilitate normalization operations; l is the number of layers of the graph convolution neural network,
Figure FDA00039499485400000315
convolution of the hidden state of the neural network for the l-th layer,
Figure FDA00039499485400000316
v is the representation of the original video sequence,
Figure FDA00039499485400000317
are trainable weights.
7. The method of claim 6, wherein the multi-modal dialog question answering system in S3 is further processed by the multi-layer Transformer model based on GPT-2 architecture
Figure FDA00039499485400000318
Inputting into linear full-connection layer, and projecting output result into S6
Figure FDA00039499485400000319
The same vector space, concatenates the two to obtain a complete multi-modal input representation, which is then input into the pre-trained language model GPT-2.
8. The method of claim 7, wherein the multi-layer transform model based on GPT-2 architecture in S7 is formed by stacking 12 layers of transform decoder modules with multi-head attention mechanism with masks.
9. The method as claimed in claim 8, wherein the multi-modal dialog question-answering generation method based on the multi-relational graph model is characterized in that the multi-layer Transformer model based on the GPT-2 architecture is trained by using a negative log-likelihood loss function in a training process, and the training process comprises the following steps:
based on audio and video characteristics V, title C and conversation history H <n And current problem Q n Generating answers
Figure FDA00039499485400000320
By minimizing the negative log-likelihood loss function, the probability that the next word output is the corresponding word of the source sequence is maximized:
Figure FDA00039499485400000321
wherein the content of the first and second substances,
Figure FDA00039499485400000322
represents the answer R n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E (V,C,H,Q,R)-D Indicating a desire.
10. The multi-modal dialog question-answer generation method based on the multi-relationship graph model according to claim 8, wherein the multi-layer Transformer model based on the GPT-2 architecture performs joint training based on an answer prediction task RPT, a title prediction task CPT of an audio/video and an audio/video-text matching task VTMT based on audio/video, titles and dialog history features in a training process, and the training process comprises the following steps:
the RPT part is intended to be based on audio-visual features V, title C, dialog history H <n And current problem Q n Generating answers
Figure FDA0003949948540000041
And by minimizing the negative log-likelihood loss function, the probability that the next word output by the model is the corresponding word of the source sequence is maximized:
Figure FDA0003949948540000042
wherein, the first and the second end of the pipe are connected with each other,
Figure FDA0003949948540000043
represents the answer R n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E (V,C,H,Q,R)~D Indicating a desire;
the CPT part and the RPT part are similar, and for a given audio-visual feature V, the title C = { C } is generated by minimizing a negative log-likelihood loss function 1 ,c 2 ,...,c L The loss function is as follows:
Figure FDA0003949948540000044
wherein, c <i The first i-1 words representing title C;
the VTMT part is intended to determine whether a given audiovisual feature V matches a given text feature, the given text feature comprising a title C, a dialog history H <n Current problem Q n And generating a response R n (ii) a Selecting a certain proportion of training data, using randomlyReplacing corresponding original input with correct audio and video characteristics, obtaining the probability of whether the final output of the hidden state of the GPT2 module is matched or not through a linear full-connection layer, and then calculating a loss function by using a binary cross entropy:
Figure FDA0003949948540000045
wherein, X = (V, C, H, Q, R), Y is a tag characterizing whether the audio-video features and the text features are matched.
CN202211451009.5A 2022-11-18 2022-11-18 Multi-modal dialog question-answer generation method based on multi-relationship graph model Pending CN115712709A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211451009.5A CN115712709A (en) 2022-11-18 2022-11-18 Multi-modal dialog question-answer generation method based on multi-relationship graph model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211451009.5A CN115712709A (en) 2022-11-18 2022-11-18 Multi-modal dialog question-answer generation method based on multi-relationship graph model

Publications (1)

Publication Number Publication Date
CN115712709A true CN115712709A (en) 2023-02-24

Family

ID=85233794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211451009.5A Pending CN115712709A (en) 2022-11-18 2022-11-18 Multi-modal dialog question-answer generation method based on multi-relationship graph model

Country Status (1)

Country Link
CN (1) CN115712709A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108206A (en) * 2023-04-13 2023-05-12 中南大学 Combined extraction method of financial data entity relationship and related equipment
CN116757460A (en) * 2023-08-23 2023-09-15 南京争锋信息科技有限公司 Emergency command scheduling platform construction method and system based on deep learning
CN117708307A (en) * 2024-02-06 2024-03-15 西北工业大学 Method and device for fusing micro-tuning and Adapter of large language model
CN117708307B (en) * 2024-02-06 2024-05-14 西北工业大学 Method and device for fusing micro-tuning and Adapter of large language model

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116108206A (en) * 2023-04-13 2023-05-12 中南大学 Combined extraction method of financial data entity relationship and related equipment
CN116757460A (en) * 2023-08-23 2023-09-15 南京争锋信息科技有限公司 Emergency command scheduling platform construction method and system based on deep learning
CN116757460B (en) * 2023-08-23 2024-01-09 南京争锋信息科技有限公司 Emergency command scheduling platform construction method and system based on deep learning
CN117708307A (en) * 2024-02-06 2024-03-15 西北工业大学 Method and device for fusing micro-tuning and Adapter of large language model
CN117708307B (en) * 2024-02-06 2024-05-14 西北工业大学 Method and device for fusing micro-tuning and Adapter of large language model

Similar Documents

Publication Publication Date Title
Uppal et al. Multimodal research in vision and language: A review of current and emerging trends
Gao et al. Hierarchical representation network with auxiliary tasks for video captioning and video question answering
CN115712709A (en) Multi-modal dialog question-answer generation method based on multi-relationship graph model
US20220398486A1 (en) Learning content recommendation system based on artificial intelligence learning and operating method thereof
Liu et al. Cross-attentional spatio-temporal semantic graph networks for video question answering
CN112069781B (en) Comment generation method and device, terminal equipment and storage medium
CN114020891A (en) Double-channel semantic positioning multi-granularity attention mutual enhancement video question-answering method and system
Xiao et al. Exploring diverse and fine-grained caption for video by incorporating convolutional architecture into LSTM-based model
CN113127623A (en) Knowledge base problem generation method based on hybrid expert model and joint learning
CN115391511A (en) Video question-answering method, device, system and storage medium
CN110309360A (en) A kind of the topic label personalized recommendation method and system of short-sighted frequency
Jhunjhunwala et al. Multi-action dialog policy learning with interactive human teaching
CN111741236A (en) Method and device for generating positioning natural image subtitles based on consensus diagram characteristic reasoning
Varghese et al. Towards participatory video 2.0
CN115687638A (en) Entity relation combined extraction method and system based on triple forest
CN115659279A (en) Multi-mode data fusion method based on image-text interaction
Nagao Artificial intelligence accelerates human learning: Discussion data analytics
Wang et al. SCANET: Improving multimodal representation and fusion with sparse‐and cross‐attention for multimodal sentiment analysis
CN116977701A (en) Video classification model training method, video classification method and device
Wang et al. How to make a BLT sandwich? learning to reason towards understanding web instructional videos
CN113780209B (en) Attention mechanism-based human face attribute editing method
CN115273856A (en) Voice recognition method and device, electronic equipment and storage medium
Wang et al. What is the competence boundary of Algorithms? An institutional perspective on AI-based video generation
Dean Altering screenwriting frameworks through practice-based research: a methodological approach
CN115422329A (en) Knowledge-driven multi-channel screening fusion dialogue generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination