CN115712709A

CN115712709A - Multi-modal dialog question-answer generation method based on multi-relationship graph model

Info

Publication number: CN115712709A
Application number: CN202211451009.5A
Authority: CN
Inventors: 吕姚嘉; 朱文轩; 刘铭; 徐洁馨; 李秋霞; 秦兵
Original assignee: Harbin Institute of Technology; China Merchants Bank Co Ltd
Current assignee: Harbin Institute of Technology; China Merchants Bank Co Ltd
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2023-02-24

Abstract

A multi-modal dialog question-answer generating method based on a multi-relationship graph model relates to a multi-modal dialog question-answer generating method. The invention aims to solve the problem that the existing multi-modal dialog system only considers scene serialization information to cause the effect of the existing model to be general. Firstly, video is serialized and segmented into a plurality of video segments, the color feature, the optical flow feature and the audio feature of each segment are obtained and spliced, and then position information and modal information are added to obtain the sequential representation of each video segment; taking each video clip as a vertex, constructing a video graph based on a full link relation and inputting the video graph into a graph convolution neural network to obtain a video hidden layer sequence and a fusion expression of the video hidden layer sequence and an original video sequence; then, processing word vectors corresponding to the audiovisual scene titles and the conversation histories in a similar mode to obtain corresponding text hidden layer sequences and fusion representations of the text hidden layer sequences and the original text sequences; and finally, generating an answer by using the neural network model.

Description

Multi-modal dialog question-answer generation method based on multi-relationship graph model

Technical Field

The invention belongs to the technical field of dialogue questions and answers, and particularly relates to a multi-mode dialogue question and answer generating method.

Background

The research in the field of the current dialogue question-answering system is mainly divided into two branches of text and multi-mode. The text dialogue question-answer task mainly has two major difficulties: answer generation requires contextual reasoning of the dialog and a lack of a large-scale dialog dataset. Because the pre-training Language Models (LMs) have learned rich semantic information from other text data, a certain degree of reasoning can be performed, and the problem of insufficient dialogue data volume is effectively made up, so that the system can still obtain a better result under the low-resource background. Therefore, the introduction of the pre-training language model into the dialogue question-answering task can deepen the comprehension of the system to the text, and the current user question is processed based on the inference of the historical dialogue turns, so that the quality of generated answers is improved. In ISCA2020, whang et al apply a pre-trained language model in an open-domain dialog to select candidate answers, where the output of the pre-trained language model (e.g., [ CLS ] flag in BERT) is used as a context representation for each dialog context and candidate answer pair. In WNGT2019, budzianowski et al assume that the true dialog state is available, combining the inputs into a single sequence to generate a response for a task-directed dialog. Since the dialog state and the database state can be viewed as raw text input, the system can be fine-tuned using pre-trained language models. In ICASSP2020, lai et al introduces a GPT-2 model, and uses the output of the model to represent the predicted slot values, thereby tracking the dialog state.

In ACL2020, hierarchical pointer networks are also widely used in text dialog systems. In ICLR2019, wu et al incorporate a global encoder and a local decoder, enabling external knowledge to be shared in a task-oriented dialog setup. In NAACL2019, reddy et al have designed a multi-level storage framework for task-oriented dialog. In the ACL2019, tian et al explore how to extract valuable information in the training process, and build a memory starting architecture. In addition, multitask learning has also been shown to optimize the performance of natural language answers. Work memory is introduced to the task in Chen et al's work in ACL2019, which can capture the conversation history and tuples of the knowledge base to generate high quality answers by interacting sufficiently with the two long term memories. In EMNLP2019, lin et al also apply it to research in this field in view of the ability of heterogeneous memory networks to simultaneously utilize session context, user questions, and knowledge base information.

Multimodal dialog question-answering opens up a new paradigm for implementing powerful dialog systems. Current research is focused on how to bridge the language and Visual gap with multi-granular complementary information between static images and text, such as CVPR2017, where the Visual Dialog (Visual Dialog) task proposed by Das et al provides pictures and multiple rounds of Dialog associated therewith, requiring that the model can correctly answer questions related to the questioner in natural language according to the given image and Dialog history. Although this task is of great significance in advancing the development process of the multi-modal dialog question-answering system, there are certain inherent limitations to the dialog based on static images, which greatly limit the dynamic perception capability of the question-answering system to spatio-temporal changes, making it impossible to reasonably cope with many applications that require understanding a particular context to make reasonable inferences. Therefore, in order to improve the time-space intelligence of the question-answering system, a new task, namely, audio-Visual Scene-Aware Dialog (AVSD), is introduced, and can be regarded as a general form of Visual Dialog, namely, visual Dialog based on continuous picture frames and Audio information, and has a wider application prospect compared with Visual Dialog. However, existing solutions mainly use separate encoders to encode the different modalities separately, and then fuse their representations with an attention mechanism and generate response statements. The scheme of the later stage fusion only considers the serialization characteristics of scenes and conversations and neglects the multi-granularity semantic complementary relation among different modes, so that the effect of the existing model is not satisfactory. Therefore, semantic information representation and mode fusion modes of exploration dialogue scenes have great significance for realizing a multi-modal dialogue question-answering system with higher performance.

Meanwhile, the focus of research related to the field of multimodal dialogs is shifting towards how to adequately integrate multi-source heterogeneous information, including images, audio, video, and text, etc. Compared with text dialogue question-answering, the multi-modal dialogue question-answering task additionally introduces audio-video characteristics related to dialogue, so that the problem of fine-grained interaction among different modes needs to be solved.

Attention mechanism is the mainstream research method in this field, and it can reduce the gap between visual and language modal representation. In CVPR2018, the CoAtt model designed by Wu et al contains a serialized co-attention mechanism encoder such that each input feature is provided by both other features in a serialized manner. The ReDAN model proposed by Gan et al in ACL2019 and the DMRM model proposed by Chen et al in AAAI2020 answer a series of questions about images through a multi-step reasoning based on a dual-attention mechanism. The LTMI model designed by Nguyen et al in ECCV2020 utilizes multi-head attention mechanism interaction of the modes of interest.

The pre-trained language model architecture of the Transformer architecture implemented based on the attention mechanism also performs well in learning the cross-modal representation of the visual-textual natural language processing task. In the image description task, li et al in AAAI2020 constructs BERT-based architecture to improve text and visual representations, and in NIPS2019 Lu et al deals with the visual question-answering task using a similar approach, with the difference that in processing multimodal input, the visual and textual representations are separated rather than merged into the entire sequence. In IJCNLP2019, alberti et al pay attention to the important role of the early-stage fusion or late-stage fusion method in enriching the trans-modal representation. In ICCV2019, sun et al propose a VideoBERT model that uses a BERT model to generate video descriptions and abandons the method of representing video frames with visual features, and instead converts frame-level features into visual markers as the original input to the model.

Recent research has also explored higher-level semantic representations of picture or dialog histories, especially the way picture or dialog histories are modeled based on graph structures. In AAAI2020, the DualVD model proposed by Jiang et al describes the features of pictures in detail from both visual and semantic perspectives, specifically, the visual graph model helps to extract surface information including entities and relationships, and the semantic graph model facilitates the transition of the dialog question-answering system from global to local visual semantic understanding. In CVPR2020, the CAG model designed by Guo et al establishes graph nodes with entity-related visual representation and history-related context representation, updates corresponding edge weights with an adaptive Top-K information transfer mechanism, and establishes a visual-semantic related dynamic graph for subsequent reasoning. In ACL2021, the GoG model proposed by Chen et al takes into account that there is also interaction between different relationships, and thus models a graph of the dependency of the current question based on the dialog history and a graph of the relationship of the object (region) based on the current question.

Disclosure of Invention

The invention aims to solve the problem that the existing multi-modal dialog system only considers scene serialization information to cause a common effect of an existing model, and further provides a multi-modal dialog question-answer generating method based on a multi-relationship graph model.

A multi-modal dialog question-answer generation method based on a multi-relationship graph model comprises the following steps:

s1, segmenting a video into a plurality of video segments by using a sliding window with a fixed size, and acquiring the color characteristics of each segment

And optical flow features

And audio features

Characterizing colors

Optical flow features

And audio features

Are spliced to obtain

Rejoining location information

And modal information

Deriving a sequence representation V of individual video segments _t (ii) a The expression is as follows:

wherein the position information

Wherein the number is used to indicate the order in which each video clip appears, the modality information

In using an identifier [ video ]]Uniformly identifying video features, and converting the video features into vectors with fixed dimensions during actual calculation;

s2, representing V = (V) for audio-visual scene ₁ ,V ₂ ,...,V _m )，V ₁ ,V ₂ ,...,V _m Namely, the sequence representation of each video clip is obtained, each video clip is taken as a vertex, and a video graph based on the full link relation is constructed

Wherein

ε _V Is a collection of directed dependent edges, for each directed dependent edge (V) _i ,V _j ,l _ij )，l _ij Represents from V _i To V _j And is set to 1;

inputting the video picture into the graph convolution neural network, and outputting the video hidden layer sequence G ^V ；

S3, hiding the video into a layer sequence G ^V Obtaining a fused representation of a video with a V input linear layer of a representation of an original video sequence

And is used as part input of a subsequent GPT-2 architecture-based multi-layer TRANSFORMER model;

s4, obtaining corresponding word vector representation C based on audiovisual scene title C and conversation history H ^feature And H ^feature (ii) a Subject word vector C ^feature And dialogue history word vector H ^feature Splicing together, and adding position information T ^pos And modal information T ^mod Obtaining a text sequence representation T; the expression is as follows:

T ^feature ＝[C ^feature ,H ^feature ],

T＝T ^feature +T ^mod +T ^pos ,

wherein the position information T ^pos Where numerals are used to indicate the title and the order in which words appear in each question-answer pair, T ^pos Using the identifier [ cap ]]Uniform identification of audiovisual scene titles, identifier usr1]Identifying the questioner, identifier [ usr2 ]]Identifying respondents, and respectively converting the respondents into vectors with fixed dimensions during actual calculation;

s5, taking each word vector in the text sequence representation T obtained in the S4 as a vertex, and constructing a graph structure based on sentence-level dependency relationship

And/or graph structures based on full dialog coreference

And then the graph structure of sentence-level dependency relationship

And/or graph structures based on full dialog coreference

Respectively inputting the graph convolution neural networks to obtain respective corresponding text hidden layer sequences;

s6, drawing structure of sentence level dependency relationship

And/or graph structures based on full dialog coreference

Corresponding text hidden layer sequence and original text sequence represent T input linear layer to obtain fusion representation of text

s7, mixing

And

and splicing to obtain the enhanced multi-modal input, and generating an answer based on the multi-layer Transformer model of the GPT-2 architecture.

Further, in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex to construct a graph structure based on sentence-level dependency relationship

The process of obtaining the corresponding text hidden layer sequence comprises the following steps:

firstly, GPT2 Tokenizer is used for obtaining word vector representation corresponding to each word, a Stanford CoreNLP text analysis tool is used for analyzing the syntactic dependency relationship of the sentence, each word vector is taken as a vertex, and a graph structure is modeled according to the syntactic dependency relationship; then inputting the graph convolution neural network and outputting a text hidden layer sequence G ^D 。

Alternatively, the first and second liquid crystal display panels may be,

in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex, and a graph structure based on the complete dialogue coreference relationship is constructed

firstly, using GPT2 Tokenizer to obtain word vector representation corresponding to each word, using a Stanford CoreNLP text analysis tool to analyze the coreference relation of the sentence, taking each word vector as a vertex, and modeling a graph structure according to the coreference relation of the sentence; then inputting the graph convolution neural network and outputting the text hidden layer sequence G ^C 。

Alternatively, the first and second liquid crystal display panels may be,

in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex, and a graph structure based on sentence-level dependency relationship is constructed

And graph structure based on full dialog coreference

firstly, GPT2 Tokenizer is used to obtain word vector representation corresponding to each word, stanford CoreNLP text analysis tool is used to analyze the syntactic dependency relationship and CoreNLP text analysis tool of the sentence, and each word is usedThe word vectors are taken as vertexes, and graph structures based on sentence-level dependency relationship are respectively constructed

And graph structure based on full dialog coreference

Then, the two graph structures are respectively input into a graph convolution neural network, and a text hidden layer sequence G is output ^D And G ^C 。

Further, the expression calculated by the graph convolution neural network of each layer of the graph convolution neural network in S5 is:

wherein, f (H) _v ^(l+1) ,A ^v ) Represents a convolution of each layer of the graph, for

Or

A ^d Each representing a respective corresponding adjacency matrix,

adding an identity matrix I for the corresponding degree matrix ^d To obtain

Accordingly, a corresponding degree matrix is obtained

So as to facilitate normalization operation, wherein l is the number of layers of the graph convolution neural network,

convolution of the hidden state of the neural network for the l-th layer,

further, the expression calculated by each layer of the graph convolution neural network in S2 is:

wherein, f (H) _v ^(l+1) ,A ^v ) Representing each layer of graph convolution; a. The ^v Is composed of

I, j respectively represent

The number of nodes i, j,

is A ^v The value of the ith row and jth column;

is composed of

The degree matrix of (c) is,

is D ^v The value of the ith row and ith column; in order to make the model capable of considering the node self-expression, an identity matrix I is added ^v To obtain

Accordingly, a corresponding degree matrix is obtained

To facilitate normalization operations; l is the number of layers of the graph convolution neural network,

convolution of the hidden state of the neural network for the l-th layer,

v is the representation of the original video sequence,

are trainable weights.

Furthermore, when a multi-layer Transformer model based on a GPT-2 architecture is processed, the method in S3 is required to be carried out

Inputting into linear full-connection layer, and projecting output result into S6

And (3) splicing the two to obtain a complete multi-modal input representation by using the same vector space, and then inputting the complete multi-modal input representation into a pre-training language model GPT-2.

Preferably, the GPT-2 architecture-based multi-layer transform model in S7 is formed by stacking 12 layers of transform decoder modules with a multi-head attention mechanism with masks.

Further, the multi-layer transform model based on the GPT-2 architecture is trained by using a negative log-likelihood loss function in a training process, and the training process comprises the following steps:

based on audio and video characteristics V, title C and conversation history H _＜n And the current problem Q _n Generating answers

By minimizing the negative log-likelihood loss function, the probability that the next word output is the corresponding word of the source sequence is maximized:

wherein the content of the first and second substances,

represents the answer R _n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E _{(V,C,H,Q,R)～D} Indicating a desire.

Alternatively, the first and second electrodes may be,

the multi-layer Transformer model based on the GPT-2 framework performs combined training based on an answer prediction task RPT, a title prediction task CPT and an audio-video-text matching task VTMT of audio and video, titles and conversation historical characteristics in a training process, wherein the training process comprises the following steps:

the RPT part is intended to be based on audio-video features V, title C, dialog history H _＜n And current problem Q _n Generating answers

By minimizing the negative log-likelihood loss function, the probability that the next word output by the model is the corresponding word of the source sequence is maximized:

wherein, the first and the second end of the pipe are connected with each other,

represents the answer R _n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E _{(V,C,H,Q,R)～D} Indicates a desire;

the CPT part and the RPT part are similar, and for a given audiovisual feature V, the title C = { C } is generated by minimizing a negative log-likelihood loss function ₁ ,c ₂ ,...,c _L The loss function is as follows:

wherein, c _＜i The first i-1 words representing title C;

the VTMT part is intended to determine whether a given audio-visual characteristic V matches a given text characteristic, the given text characteristic comprising a title C, a dialogue history H _＜n Current problem Q _n And generating a response R _n (ii) a Selecting a certain proportion of training data, randomly replacing corresponding original input with incorrect audio and video characteristics, obtaining the probability of whether the final output of the hidden state of the GPT2 module is matched or not through a linear full-connection layer, and then calculating a loss function by using a binary cross entropy:

wherein, X = (V, C, H, Q, R), Y is a tag characterizing whether the audio-video features and the text features are matched.

Has the advantages that:

the invention relates to a multi-modal dialog question-answer generation method based on a multi-relationship graph model, which constructs the multi-relationship graph model according to the characteristics of different modes to enrich multi-modal characteristic representation. By modeling the corresponding relation of entities in the continuous video clips and the syntactic and semantic relations implied in the continuous conversation, the understanding of the system to the scenes and the conversation is further deepened, the defect that the existing method only considers the coding of the time sequence and the language sequence is improved, and the quality of generated answers is further improved.

The invention aims to improve the condition that the current pre-training model architecture only utilizes the serialization information of videos or texts to acquire the embedded representation corresponding to each video segment or word, so that the multi-modal dialog question-answering system generates an unsatisfactory answer. And a graph convolution neural network is additionally added, so that the model has the capability of effectively coding the structure information of the multiple relational graphs. Specifically, the model introduces the graph convolution neural network, corresponding weights are given to the current nodes by calculating data distribution of an adjacent point set of the current nodes, and after the multilayer graph convolution neural network is stacked, the model has the capability of reasoning nodes jumping a plurality of times from the current nodes, so that interactive information between the current nodes and remote nodes can be captured, and further syntax or semantic relations among all words can be represented in one frame.

The invention can effectively improve various indexes of system generated response, including BLEU, METEOR, ROUGE-L, CIDER and the like. On the audiovisual scene perception dialogue related data set, the method carries out multi-index comparison on the answer generated by the system and the manually marked answer, and experimental results show that the answer generated by the method accords with the basic habit expressed by human beings, is basically matched with the manually marked result, and is superior to the effect of all existing models under the task. The improvement on the CIDER index representing the natural degree of sentences is particularly obvious, and is improved by 1% on average compared with the most advanced baseline result, so that the effectiveness and superiority of the method are fully demonstrated.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a block diagram of the overall model architecture of the present invention;

FIG. 3 is a block diagram of the basic elements of a pre-trained language model;

FIG. 4 is an example of a graph model construction based on dependencies;

FIG. 5 is an example graph model construction based on coreference relationships.

Detailed Description

The first embodiment is as follows: the present embodiment is described in connection with figure 1,

the embodiment is a multi-modal dialog question-answer generation method based on a multi-relation graph model, which comprises the following steps of:

step one, using a sliding window with a fixed size to segment a video into a plurality of video segments in a serialized mode, and for each segment, using an I3D model to obtain color characteristics of the segment

And optical flow features

Obtaining the audio characteristics of the segment by using a VGGish model

Characterizing colors

Light flow features

And audio features

Are spliced to obtain

Joining location information

And modal information

wherein the position information

Wherein numerals are used to indicate the order of occurrence of each video clip, modality information

In using an identifier [ video ]]The video features are identified uniformly and converted into fixed-dimension vectors during actual computation, as shown in FIG. 2

Denoted as V1, V2, V3, V4, V5, corresponding thereto

Are all marked as [ video ]]。

Step two, given the audio-visual scene representation V = (V) ₁ ,V ₂ ,...,V _m )，V ₁ ,V ₂ ,...,V _m Namely, the sequence representation of each video clip is obtained, each video clip is taken as a vertex, and a video graph based on the full link relation is constructed

Wherein

inputting the video picture into the graph convolution neural network, and outputting the video hidden layer sequence G ^V (ii) a The expression of each layer of graph convolution neural network calculation is as follows:

I, j respectively represent

The number of nodes i, j,

is A ^v The value of the ith row and jth column;

is composed of

The degree matrix of (a) is obtained,

is D ^v The value of row i and column i; in order to make the model capable of considering the node self-expression, an identity matrix I is added ^v To obtain

Accordingly, a corresponding degree matrix is obtained

for the hidden state of the l-th layer graph convolutional neural network,

v is the representation of the original video sequence,

are trainable weights.

Step three, hiding the video layer sequence G ^V Obtaining a fused representation of a video with a representation V of an original video sequence input linear layer

And as part of the input to the subsequent GPT2 model; the expression is as follows:

wherein, W _M ，W _V To train the weights.

Step four, obtaining Word vector representation C based on Word Pices audio-visual scene title C and conversation history H by using GPT2 Tokenizer ^feature And H ^feature . Subject word vector C ^feature And dialogue history word vector H ^feature Splicing together, adding position information T ^pos And modal information T ^mod Obtaining a text sequence representation T; the expression is as follows:

T ^feature ＝[C ^feature ,H ^feature ],

T＝T ^feature +T ^mod +T ^pos ,

wherein the position information T ^pos Wherein numerals are used to indicate the title and the order of occurrence of the words in each question-answer pair, T ^pos Using the identifier [ cap ]]Uniform identification of audiovisual scene titles, identifier usr1]Identifying the questioner, identifier [ usr2 ]]The respondents are identified and respectively converted into vectors of fixed dimensions in actual calculation, and a specific example is given in fig. 2. Feature layer uses GPT2 Tokenizer to combine text "[ cap ]]a woman…[eos][usr1]is the woman…[eos]"encoding as T ^feature According to the source of each sentence, the text is marked "[ cap ]]a woman…[eos]"corresponding T ^mod The label is [ cap ]]The text "[ usr1 ]]is the wman 8230 ^mod Marked as [ usr1 ]]The text "[ usr2 ]]yes she is \8230; "and" [ usr 2;)]nothing much…[eos]"corresponding T ^mod Marked as [ usr2 ]]。

Taking each word vector as a vertex, and constructing a text graph based on sentence-level dependency relationship according to a Stanford CoreNLP text analysis tool

A specific example is given in fig. 4. For the current text "do the world eat or drink analyzing", firstly, GPT2 Tokenizer is used to obtain the word vector representation corresponding to each word, stanford CoreNLP text analysis tool is used to analyze the syntactic dependency relationship (the connecting line between the word vectors in the graph) of the sentence, each word vector is taken as a vertex, the graph structure can be modeled according to the dependency relationship, the graph is input into the graph convolution neural network, and the text hidden layer sequence G is output ^D (ii) a The expression is as follows:

wherein A is ^d Is composed of

Of the adjacent matrix of (a) and (b),

is composed of

The degree matrix is added into the identity matrix I so that the model can consider the node self expression ^d To obtain

Accordingly, a corresponding degree matrix is obtained

for the hidden state of the l-th layer graph convolutional neural network,

step six, hiding the text into a layer sequence G ^D Obtaining a fused representation of text with the original text sequence representation T input linear layer

wherein, W _N ，W _D Are trainable weights.

Seventhly, constructing the GPT2 of the multi-mode dialogue question and answerModel (see figures 2 and 3 for details). FIG. 2 shows the overall model architecture, which is a multi-layer transform model based on GPT-2 architecture. The model is stacked from 12 layers of transform decoder modules with a masked multi-headed attention mechanism. In order to enable the model to have the capability of fusing multi-modal characteristics and generating reasonable answers, the GPT-2 model based on the generating task is changed to a certain extent, so that the model is more suitable for the requirements of multi-modal dialogue question-answering tasks. Specifically, the model compares the results of step three

Inputting the output result into a linear full-connection layer, and projecting the output result to the result of the step six

FIG. 3 shows the specific architecture of each Transformer decoder module in the GPT-2 model. The module mainly comprises a multi-head attention mechanism with a mask and a feedforward neural network. The mask multi-head Attention mechanism mask Self-Attention can detect fine-grained long-term dependency relationships among modal inputs, including spatio-temporal relationships of video objects, coreference relationships among conversation histories, video local features, reference relationships of text vocabularies and the like, so as to generate reasonable answers which are based on visual and auditory features and accord with user questions.

Step eight, mixing

And

and (3) splicing to obtain an enhanced multi-modal input representation as the complete input of the GPT2 model of the multi-modal dialog question-answer constructed in the step seven, designing a loss function based on the model to carry out joint training to obtain a trained model, and further generating a reasonable answer aiming at a given audio-visual scene and a user question.

And a negative log-likelihood loss function is used during training, so that the model has the capability of predicting answers based on audio and video, titles and conversation historical characteristics. Formalized, the model is based on audio-video characteristics V, title C, dialogue history H _＜n And current problem Q _n Generating answers

By minimizing the negative log-likelihood loss function, the probability that the next output word is the corresponding word of the source sequence is maximized:

wherein the content of the first and second substances,

represents the answer R _n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E _{(V,C,H,Q,R)～D} Indicating the desire.

The second embodiment is as follows:

the embodiment is a multi-modal dialog question-answer generating method based on a multi-relationship graph model, and the embodiment is different from the specific embodiment in that:

in the fifth step, each word vector in the text sequence representation T obtained in the fourth step is taken as a vertex, and a text graph based on the complete dialogue coreference relation is constructed according to a Stanford CoreNLP text analysis tool

Fig. 5 shows a specific example. For the current text ' a wman \8230afredge \8230thewman \8230andthe she \8230andit's, GPT2 Tokenizer is firstly used to obtain the word vector representation corresponding to each word, stanford CoreNLP text analysis tool is used to analyze the coreference of the sentence, namely the coreference exists among the text, a wman, the wman and the she ', and the Afridge text and the it, and each word is divided into a plurality of wordsWhen the word vector is taken as a vertex, the graph structure can be modeled according to the coreference relationship, namely, an edge is established between 'a wman', 'the wman' and 'she', and an edge is established between 'a fridge' and 'it', the graph is input into the graph convolution neural network, and the text hidden layer sequence G is output ^C The specific calculation formula is the same as the expression in the fifth step of the first embodiment;

step six, the text hidden layer sequence G obtained in the step four and the step five ^C Obtaining a fused representation of text with the original text sequence representation T input linear layer

are trainable weights.

Other steps and parameters are the same as those in the first embodiment.

The third concrete implementation mode:

the embodiment is a multi-modal dialog question-answer generating method based on a multi-relationship graph model, and the embodiment is different from the specific embodiment in one or two aspects:

in the fifth step, each word vector in the text sequence representation T obtained in the fourth step is taken as a vertex, and a text graph based on sentence-level dependency relationship is constructed according to the Stanford CoreNLP text analysis tool

And text graphs based on full dialog coreferences

Inputting the two graphs into a graph convolution neural network respectively, and outputting a text hidden layer sequence G ^D And G ^C The specific calculation formula is the same as the expression in step five of the first embodiment.

Step six, the text hidden layer sequence G obtained in the step four and the step five ^D And G ^C Obtaining a fused representation of text with a T input linear layer of a representation of the original text sequence

And as part of the input of the subsequent GPT2 model; the expression is as follows:

wherein the content of the first and second substances,

and

are trainable weights.

Other steps and parameters are the same as those in the first or second embodiment.

The fourth concrete implementation mode:

the embodiment is a multi-modal dialog question-answer generating method based on a multi-relation graph model, and is different from one of the first to third specific embodiments in that:

in the eighth step, in order to promote fusion of information of different modes, three tasks are introduced for fine adjustment during model training, including an answer Prediction Task (RPT) based on audio and Video, a title and a conversation history characteristic, a title Prediction Task (CPT) based on audio and Video, and a Video-Text Matching Task (VTMT). The first three embodiments only use one loss function, and are a single-task learning mode. In the fourth specific implementation mode, three loss functions are designed, and the model understanding capability of different modal information is enhanced by adopting a multi-task learning mode.

The RPT part is intended to be based on audio-visual features V, title C, dialog history H _＜n And the current problem Q _n Generating answers

And by minimizing the negative log-likelihood loss function, the probability that the next word output by the model is the corresponding word of the source sequence is maximized:

wherein the content of the first and second substances,

The CPT part and the RPT part are similar, and for a given audio-visual feature V, the title C = { C } is generated by minimizing a negative log-likelihood loss function ₁ ,c ₂ ,...,c _L The loss function is as follows:

wherein, c _＜i Representing the first i-1 words of title C.

The VTMT part is intended to determine a given audiovisual characteristic V and a given text characteristic (including title C, dialog history H) _＜n Current problem Q _n And generating a response R _n ) And whether it matches, thereby applying it successfully to the dialogue domain task in a manner that fine-tunes the pre-trained language model. Specifically, the task selects about 15% of training data, replaces corresponding original input with incorrect audio and video features randomly, obtains the probability of whether the final output of the hidden state of the GPT2 module is matched or not through a linear full-connection layer, and then calculates a loss function by using a binary cross entropy so as to enhance the understanding of the system to the scene, wherein a calculation formula is as follows:

Other steps and parameters are the same as those in one of the first to third embodiments.

The following examples were used to demonstrate the beneficial effects of the present invention:

the first embodiment is as follows:

the data set selects an audio-visual scene perception dialogue data set of a seventh dialogue System Technology challenge race (DSTC7) issued by Hori et al in ICASSP2019 to evaluate The System performance, and in order to ensure The fairness and rationality for measuring performance differences among different models, the dividing mode of The data set is consistent with The task setting in The challenge race. The data set size and partitioning is shown in table 1.

TABLE 1 DSTC7-AVSD data set overview

The evaluation indexes are commonly used indexes in a natural language generation task, including BLEU, METEOR, ROUGE-L, CIDER and the like, and can calculate semantic similarity and language fluency between predicted answers and real answers from different angles, so that system performance is scientifically reflected.

The experimental parameter settings are shown in table 2. Specifically, during the encoding process, the learning rate of the Adam optimizer is set to 6.25e-5, which relates to a maximum of 3 rounds of conversation history, the hidden state of the transform module is 768, and the batch size is 8. In the decoding process, a beam search algorithm is adopted, the beam width is set to be 5, the maximum length of a sentence is set to be 20, and the length penalty is set to be 0.3.

Table 2 experimental parameter settings

Table 3 compares the difference in baseline model and the results generated by the present invention across DSTC 7-AVSD. It can be seen that in this example, the "television" referred to by the questioner does not appear in the headline and dialog history, and therefore the system needs to incorporate audiovisual information and make simple inferences to answer the question correctly. At this time, the answer of the baseline model to the question fully indicates that the baseline model does not fully understand the question direction of the questioner, lacks reasoning ability, and cannot give correct answers to questions which cannot find specific information in the title, abstract or dialogue history, even gives out answers without asking questions.

Compared with a baseline model, the method can fully interact the information of the given video and the given text based on the given video and the given text, thereby capturing the hidden complex dependency relationship between different modal inputs, extracting richer feature representations and generating high-quality and natural answers based on reasoning.

TABLE 3 DSTC7-AVSD samples generated by VGPT model

In order to objectively and comprehensively verify the effectiveness of the invention, the DSTC7-AVSD data set is compared with the relevant baseline method, and the specific results are shown in Table 4, wherein the optimal results of each index are shown in bold:

(1) A Naive Fusion model (nasal Fusion) proposed by Hori et al in ICASSP2019 provides a multimodal baseline method for DSTC7 organizers, which uses an LSTM model containing question directions to respectively extract video and audio features, uses layered LSTM to encode conversation history, and finally combines all modalities by means of a projection matrix to generate answers.

(2) The Hierarchical Attention mechanism model (HA) proposed by Sanabria et al in AAAI2019 introduces the migration learning of the video summarization task, obtains more visual details, and obtains the first name of the DSTC7-AVSD challenge.

(3) A Multimodal Transformer Network (MTN) proposed by Le et al in ACL2019 is the highest level system before the DSTC8-AVSD challenge, which uses a transform-based automatic coding module to focus on visual features with problem oriented.

(4) A Universal Multimodal Transformer (UMT) proposed by Li et al in TASLP2021 is the most advanced conversational question-answering system under the task at present, and introduces a pre-trained GPT-2 model to learn the fusion representation of the audiovisual scene in a multitask learning manner.

TABLE 4 Objective assessment results based on DSTC7-AVSD dataset

The experimental result shows that the invention is superior to the existing method on almost all the automatic indexes of the DSTC7-AVSD test set by using the third embodiment, and the UMT of the current most advanced model is improved by 1% on the indexes of BLEU-2, BLEU-2 and CIDER compared with the task. This shows that the introduction of the multiple relation graph structure coding can enable the dialogue system to generate answers with higher quality and remarkably improve the performance of the model. Due to the structural characteristics of the graph convolutional neural network, various syntactic and semantic information among all words can be displayed in a framework. Compared with a Multilayer Perceptron (MLP), the information of neighbor nodes can be comprehensively considered when calculating the representation of the current node, and for a long-distance connected point, the connected point can be obtained by stacking a Multilayer graph convolution neural network, so that the 'receptive field' of the connected point is enlarged.

Example two:

the data set selects The audio-visual scene perception dialogue data set of The eighth dialogue System Technology challenge (DSTC8) issued by Kim et al in TASLP2021 to evaluate The System performance, and The dividing mode of The data set is consistent with The task setting in The challenge in order to ensure The fairness and rationality for measuring The performance difference among different models.

The data set size and partitioning is shown in table 5.

TABLE 5 DSTC8-AVSD data set overview

The experimental parameter settings are consistent with table 2.

In order to objectively and comprehensively verify the effectiveness of the present invention, the DSTC8-AVSD dataset is compared with the relevant baseline method, and the specific results are shown in table 6, wherein the optimal results of each index are shown in bold:

(1) A Multi-step Joint-Modality Attention Network (JMAN) proposed by Chu et al in arXiv2020 designs a model architecture based on a recurrent neural Network, applies a Multi-step Attention mechanism, and gives consideration to visual and text representations in each reasoning process so as to better integrate information of two different modalities.

(2) Compared with the conventional Transformer architecture, the multi-Modal Semantic Transformer Network (MSTN) proposed by Lee et al in arXiv2020 has an additional attention-based word embedding layer designed, so that the model can take the meaning of the word into consideration more in the generation stage.

TABLE 6 Objective assessment results based on DSTC8-AVSD dataset

The experimental result shows that the invention is superior to the existing model on almost all the automation indexes of the DSTC8-AVSD test set by using the first embodiment. The improvement on the CIDER index representing the natural degree of the sentence is particularly obvious and is improved by 0.012 (1.240vs. 1.252), which shows that the local dependency relationship and the global coreference relationship can reflect the functional similarity of the text from different angles to encode the text information, thereby improving the expression of the existing model.

The present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof, and it is therefore intended that all such changes and modifications be considered as within the spirit and scope of the appended claims.

Claims

1. A multi-modal dialog question-answer generating method based on a multi-relationship graph model is characterized by comprising the following steps:

s1, using a sliding window with a fixed size to serialize and divide a video into a plurality of video segments, and acquiring color characteristics of each segment

And optical flow features

And audio features

Characterizing colors

Optical flow features

And audio features

Are spliced to obtain

Re-joining location information

And modal information

Obtaining a sequence representation V of each video segment _t (ii) a The expression is as follows:

wherein the position information

Using an identifier [ video ]]Uniformly identifying video features, and converting the video features into vectors with fixed dimensions during actual calculation;

Wherein

Is a collection of directed dependent edges, for each directed dependent edge (V) _i ,V _j ,l _ij )，l _ij Denotes from V _i To V _j And is set to 1;

S3, hiding the video into a layer sequence G ^V Obtaining a fused representation of a video with a representation V of an original video sequence input linear layer

s4, obtaining corresponding word vector representation C based on audiovisual scene title C and conversation history H ^feature And H ^feature (ii) a Subject word vector C ^feature And dialogue history word vector H ^feature Spliced and then added with the position information T ^pos And modal information T ^mod Obtaining a text sequence representation T; the expression is as follows:

T ^feature ＝[C ^feature ,H ^feature ],

T＝T ^feature +T ^mod +T ^pos ,

wherein the position information T ^pos Where numerals are used to indicate the title and the order in which words appear in each question-answer pair, T ^pos Using the identifier [ cap ]]Unified identification of audiovisual scene titles, identifier [ usr1 ]]Identifying the questioner, identifier [ usr2 ]]Identifying respondents, and respectively converting the respondents into vectors with fixed dimensions during actual calculation;

s5, regarding each word vector in the text sequence representation T obtained in the S4 as a vertex, and constructing a graph structure based on sentence-level dependency relationship

And/or graph structures based on full dialog coreferences

And then the graph structure of sentence-level dependency relationship

And/or graph structures based on full dialog coreferences

s6, drawing structure of sentence level dependency relationship

And/or graph structures based on full dialog coreferences

s7, mixing

And

2. The multi-modal dialog question-answer generation method based on the multi-relationship graph model as claimed in claim 1, wherein the text sequence obtained in S4 in S5Constructing a graph structure based on sentence-level dependencies using each word vector in the representation T as a vertex

3. The multi-modal dialog question-answer generation method based on the multi-relationship graph model as claimed in claim 1, wherein in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex to construct a graph structure based on the full dialog coreference relationship

firstly, a word vector representation corresponding to each word is obtained by using a GPT2 token, the coreference relation of the sentence is analyzed by using a Stanford CoreNLP text analysis tool, each word vector is taken as a vertex, and a graph structure is modeled according to the coreference relation of the sentence; then inputting the graph convolution neural network and outputting a text hidden layer sequence G ^C 。

4. The multi-modal dialog question-answer generation method based on the multi-relationship graph model as claimed in claim 1, wherein in S5, each word vector in the text sequence representation T obtained in S4 is taken as a vertex, and a graph structure based on sentence-level dependency relationship is constructed

And graph structure based on full dialog coreference

firstly, GPT2 token is used to obtain word vector representation corresponding to each word, a Stanford CoreNLP text analysis tool is used to analyze syntactic dependency relationship of the sentence and coreference relationship of the sentence respectively, each word vector is taken as a vertex, and a graph structure based on sentence-level dependency relationship is constructed respectively

And graph structure based on full dialog coreference

5. The multi-modal dialog question-answer generating method based on the multi-relational graph model according to claim 1, 2, 3 or 4, wherein the expression calculated by each layer of graph convolution neural network of the graph convolution neural network in S5 is as follows:

Or

A ^d Each representing a respective corresponding adjacency matrix,

adding an identity matrix I for the corresponding degree matrix ^d To obtain

Accordingly, a corresponding degree matrix is obtained

for the hidden state of the l-th layer graph convolutional neural network,

6. the multi-modal dialog question-answer generating method based on the multi-relational graph model according to claim 5, wherein the expression calculated by each layer of graph convolution neural network of the graph convolution neural network in S2 is as follows:

I, j respectively represent

The number of nodes i, j,

is A ^v The value of the ith row and jth column;

is composed of

The degree matrix of (c) is,

is D ^v The value of row i and column i; in order to make the model capable of considering node self-expression, adding unit matrix I ^v To obtain

Accordingly, a corresponding degree matrix is obtained

convolution of the hidden state of the neural network for the l-th layer,

v is the representation of the original video sequence,

are trainable weights.

7. The method of claim 6, wherein the multi-modal dialog question answering system in S3 is further processed by the multi-layer Transformer model based on GPT-2 architecture

The same vector space, concatenates the two to obtain a complete multi-modal input representation, which is then input into the pre-trained language model GPT-2.

8. The method of claim 7, wherein the multi-layer transform model based on GPT-2 architecture in S7 is formed by stacking 12 layers of transform decoder modules with multi-head attention mechanism with masks.

9. The method as claimed in claim 8, wherein the multi-modal dialog question-answering generation method based on the multi-relational graph model is characterized in that the multi-layer Transformer model based on the GPT-2 architecture is trained by using a negative log-likelihood loss function in a training process, and the training process comprises the following steps:

based on audio and video characteristics V, title C and conversation history H _＜n And current problem Q _n Generating answers

wherein the content of the first and second substances,

represents the answer R _n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E _{(V,C,H,Q,R)-D} Indicating a desire.

10. The multi-modal dialog question-answer generation method based on the multi-relationship graph model according to claim 8, wherein the multi-layer Transformer model based on the GPT-2 architecture performs joint training based on an answer prediction task RPT, a title prediction task CPT of an audio/video and an audio/video-text matching task VTMT based on audio/video, titles and dialog history features in a training process, and the training process comprises the following steps:

the RPT part is intended to be based on audio-visual features V, title C, dialog history H _＜n And current problem Q _n Generating answers

represents the answer R _n The first j-1 words of (a), theta refers to the trainable model parameters, (V, C, H, Q) set is sampled from the entire training set D, E _{(V,C,H,Q,R)～D} Indicating a desire;

wherein, c _＜i The first i-1 words representing title C;

the VTMT part is intended to determine whether a given audiovisual feature V matches a given text feature, the given text feature comprising a title C, a dialog history H _＜n Current problem Q _n And generating a response R _n (ii) a Selecting a certain proportion of training data, using randomlyReplacing corresponding original input with correct audio and video characteristics, obtaining the probability of whether the final output of the hidden state of the GPT2 module is matched or not through a linear full-connection layer, and then calculating a loss function by using a binary cross entropy: