CN113177112A

CN113177112A - KR product fusion multi-mode information-based neural network visual dialogue model and method

Info

Publication number: CN113177112A
Application number: CN202110448254.XA
Authority: CN
Inventors: 骆克; 张鹏
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-07-27
Anticipated expiration: 2041-04-25
Also published as: CN113177112B

Abstract

The invention discloses a KR product fusion multi-modal information-based neural network visual dialogue model and a KR product fusion multi-modal information-based neural network visual dialogue method. The modal characteristic extraction module extracts the characteristics of the problem text and the characteristics of the historical information through an LSTM network, extracts the entity characteristics of the picture by using an Faster R-CNN network, and extracts the visual characteristics related to the problem by adopting an attention mechanism; the different-mode information fusion module captures feature information in different modes by using a later-stage fusion method, captures correlation information between different modes by using a KR product-based feature fusion method, and fuses intra-mode information and inter-mode information; the candidate answer prediction module performs answer prediction by using a fusion vector fused with intra-modality information and inter-modality information, so that a relevant answer can be more accurately found out. The method overcomes the current situation that the traditional visual dialogue model is insufficient in later fusion capture of the associated information among different modes.

Description

KR product fusion multi-mode information-based neural network visual dialogue model and method

Technical Field

The invention relates to the technical field of visual dialogue and multi-mode fusion, in particular to a model and a method for judging a real answer from candidate answers aiming at a picture, historical dialogue information and a corresponding question.

Background

Visual dialog is a challenging task in the field of intersection of language and vision that requires consideration of historical information from multiple rounds of dialog and related information in images to find the best candidate answer to a current question. Visual conversations occur in many application scenarios, such as helping the blind understand surroundings, interactive searching, indoor navigation, and the like. In the visual dialogue task, in order to capture information related to answers, a model needs to understand questions, capture visual information and historical information related to the questions, and capture potential association information among the questions, the vision and the history. With the rise of neural networks, the early visual conversational tasks adopted LF^[1]、HRE^[1]、MN^[1]Problems, pictures and historical information are coded by the three encoder methods, and relevance between visual features and historical features extracted by the method and the problems is not large, so that predicted answers are not accurate enough after problem vectors, visual vectors and historical vectors are fused; to predict the answer more accurately, some methods are based on an attention mechanism to obtain visual and historical features related to the question, such as Bottom-Up^[2]、Attention Memory^[3]、RVA^[4]、sequential co-attention^[5]And the like. The Bottom-UP provides a combined attention mechanism from Bottom to top and from top to Bottom, the visual conversation task acquires the entity characteristics in the image through the Faster R-CNN network, and the attention mechanism is utilized to acquire the visual characteristics related to the problem; attention Memory proposes an Attention storage mechanism to better select visual contents related to a question by calculating a visual Attention distribution related to a current question using visual Attention distributions in past dialogues; RVA provides a recursive visual attention mechanism, and visual attention distribution related to the current problem is gradually refined by using related visual information in historical conversation; sequential co-attention proposes a continuous co-attention mechanism, which uses the information of the other two modalities to calculate the attention score of the other modality, and selectively focuses on the image and the content of the historical conversation. The method carries out later-stage fusion on the acquired visual characteristics, historical characteristics and problem characteristics, and is beneficial toAnd predicting the candidate answers by using the fused vectors. However, after the features of different modalities are extracted respectively, the method of fusing the features of different modalities by using later-stage fusion cannot capture the correlation information between different modalities well, so that the model is not accurate enough in predicting candidate answers. In order to obtain more useful information, a method for fusing information of different modalities needs to be improved.

To achieve a better-performing visual dialogue model, the main challenges at present are: the visual dialogue task needs to model the image content and the dialogue history and obtain information useful for answer prediction from the question vector, the visual vector and the history vector. The prediction of the answer needs to pay attention to not only the feature information in different modalities, but also the associated information among different modalities, and the answer can be better predicted by capturing the feature information in the modalities and the associated information among the modalities. Therefore, how to better capture the associated information between different modalities is a problem to be solved by the visual dialogue model.

Reference documents:

[1]Das,A.,Kottur,S.,et al.“Visual Dialog.”IEEE Conference on Computer Vision and Pattern Recognition(CVPR),2017.

[2]Anderson,Peter,et al.“Bottom-up and top-down attention for image captioning and visual question answering.”Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,2018.

[3]Sigal,leonid,et al.“Visual Reference Resolution Using Attention Memory for Visual Dialog.”Advances in Neural Information Processing Systems,vol.30,2018,pp.3719–3729.

[4]Niu,Yulei,et al.“Recursive Visual Attention in Visual Dialog.”IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR),2019,pp.6679–6688.

[5]Wu,Qi,et al.“Are You Talking to MeReasoned Visual Dialog Generation Through Adversarial Learning.”IEEE/CVF Conference on Computer Vision and Pattern Recognition,2018,pp.6106–6115.

disclosure of Invention

The invention aims to overcome the defects in the prior art and provide a neural network visual dialogue model and a method based on fusion of KR product and multi-mode information.

The purpose of the invention is realized by the following technical scheme:

a neural network visual dialogue model based on KR product fusion multi-modal information comprises a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;

the modal feature extraction module extracts semantic features of the problem, visual features of the image and historical features of historical conversation; finally, obtaining problem characteristics, visual characteristics and historical characteristics;

the different-modality information fusion module is used for fusing the problem features, the visual features and the historical features to obtain a final fusion vector, and comprehensively considering the information of different modalities, so that the information related to the candidate answers can be conveniently captured; the method specifically comprises three steps of obtaining characteristic information in different modes, extracting correlation information among the different modes and fusing information between the modes;

the candidate answer prediction module is used for predicting candidate answers, the input of the candidate answer prediction module is fusion vectors obtained by the different-mode information fusion module, and then inner product calculation is carried out on the fusion vectors and the feature vectors of the candidate answers to obtain scores corresponding to the candidate answers.

Furthermore, the modal feature extraction module is used for learning the embedded expression of words in the word2vec model by dividing words of the problem text and utilizing a glove tool, wherein each word is expressed by a dense vector, and then the LSTM model is used for acquiring the semantic features of the problem, namely the problem features;

extracting entity characteristics in the image by using a Faster R-CNN pre-training network, wherein each entity is represented by a dense vector, acquiring attention distribution related to a problem by an attention mechanism, and performing weighted summation on the entity characteristics to acquire visual characteristics of the image;

and splicing the historical dialogue information, segmenting words of the historical information, learning the embedded expression of the words in the word2vec model by using a glove tool, and acquiring the historical characteristics of the historical dialogue by using an LSTM model.

Further, in the method for acquiring the feature information in different modes, firstly, the problem feature, the visual feature and the historical feature are spliced, then the features of the problem feature, the visual feature and the historical feature are fused through a full connection layer, and finally, the feature information in different modes is acquired;

the method for extracting the correlation information among different modes comprises the steps of firstly, obtaining two interaction matrixes through two times of KR product calculation, splicing problem features, visual features and historical features according to the 0 th dimension in the first time of KR product calculation to generate a feature matrix, and performing KR product calculation on the feature matrix according to columns to generate a third-order tensor; performing Mask calculation on the third-order tensor, and capturing information of the third-order tensor through convolution operation to obtain an interacted feature matrix; performing KR product calculation on the feature matrix obtained after interaction and the original feature matrix according to corresponding columns by the second KR product calculation, and performing convolution operation on the obtained third-order tensor to generate an interacted feature matrix; then splicing and compressing the interaction matrixes obtained twice to obtain final characteristics, wherein the final characteristics comprise correlation information among the modes;

and finally, splicing the intra-modal characteristic information obtained by later-stage fusion and the inter-modal association information obtained by fusion of the KR product to obtain a final fusion vector, wherein the fusion vector comprises the intra-modal characteristic information and the inter-modal association information.

Furthermore, the candidate answer prediction module is used for predicting a plurality of candidate answers to obtain the score of each candidate answer, calculating loss by comparing the real answers, and then continuously updating the parameters of the neural network visual dialogue model by using a back propagation algorithm to finally obtain the score corresponding to each candidate answer.

The invention also provides a KR product fusion multi-modal information-based neural network visual dialogue method, which is based on the neural network visual dialogue model and comprises the following steps of: capturing feature information in different modalities through a later-stage fusion method, capturing correlation information between the different modalities in a KR product calculation mode, and finally splicing the feature information in the modalities and the correlation information between the modalities to obtain a final fusion vector; and the later-stage fusion method is used for splicing the captured problem features, visual features and historical features and performing linear layer fusion.

Compared with the prior art, the technical scheme of the invention has the following beneficial effects:

1. a KR product-based different-modality information fusion module is provided, and is used for capturing correlation information between different modalities. Most of the previous correlation work focuses on capturing visual features and historical features related to the current problem respectively, and then fusing the visual features, the problem features and the historical features through a later-stage fusion method, but the later-stage fusion method does not deeply consider the association between different modalities. The KR product-based method combines information among different modalities by using a tensor product calculation mode and captures associated information among different modalities.

2. A redundancy characteristic reduction method based on a Mask mechanism is provided on the basis of KR products. In the process of first-layer feature combination based on KR products, because input feature matrixes are the same and tensor calculation is carried out according to columns, the obtained matrix is a symmetric matrix and contains repeated combination information; in order to reduce redundant information generated by information combination between different modes, repeated combination information is deleted by combining a Mask mechanism. After the redundant features in the first layer of combined features are removed, when the redundant features are used for combining the features of the second layer, the associated information between the modalities can be captured more accurately;

3. the neural network visual dialogue model captures characteristic information in the modes through a later-stage fusion method, and captures correlation information between the modes through a KR product + Mask mechanism + convolution neural network method, so that the two parts of information are fused through a splicing method. The KR product calculation needs to map visual features, problem features and historical features into the same vector space, and then capture correlation information among modes through a method of KR product, Mask mechanism and convolutional neural network; such an approach captures potential correlation information between modalities, but loses characteristic information within different modalities. In order to retain the feature information in different modalities, the information in the modalities is captured by a later-stage fusion method, and then the information is spliced and fused with the inter-modality association information captured by a KR-based method, so that the final fusion vector contains the feature information in the modalities and the inter-modality association information. Compared with the related work, the method for combining the correlation information among the modes improves the result of answer prediction.

Drawings

FIG. 1 is a flow chart of a method of the present invention;

FIG. 2 is a diagram of a visual dialog model based on KR product fusion multimodal information.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

the modal feature extraction module is used for extracting semantic features of the problem, visual features of the image and historical features of historical conversations. Firstly, obtaining vector representation of problems through an LSTM network, obtaining a group of entity characteristic vectors of an image by using an Faster R-CNN network, regarding historical conversation information as a whole or regarding each piece of conversation content as a whole, and obtaining vector representation of conversation history through the LSTM network; then, the visual attention distribution and the historical attention distribution related to the problem are calculated through an attention mechanism, and a visual vector and a historical vector are obtained through a weighted summation method.

The different modality information fusion module is used for fusing problem features, visual features and historical features. The problem vector, the visual vector and the historical vector are obtained by a modal feature extraction module, and information of the problem vector, the visual vector and the historical vector is fused by a later fusion method to capture feature information in different modalities; performing feature combination on the problem vector, the visual vector and the historical vector by a KR product-based multi-modal information fusion method, and capturing correlation information among different modes; and finally, splicing the information in different modes and the information among different modes to obtain a final fusion vector.

The candidate answer prediction module obtains answer vectors of 100 candidate answers through the LSTM network, and performs inner product calculation with the fusion vectors obtained by the different mode information fusion module to obtain scores of 100 candidate answers.

Fig. 1 shows a flow of a visual dialog method proposed by the present method; fig. 2 shows a neural network visual dialogue model designed by the present embodiment. The method comprises the following specific operation steps:

(1) traversing all the problems in the VisDial v1.0 data set (the VisDial v1.0 data set is downloaded through a visual dialogue official network, and the website address is https:// visual dialog.org/data), determining the fixed length (generally the maximum sentence length in the data set) of the input problems, filling the sentences of which the sentence lengths are less than the fixed length in the data set, and supplementing 0 for the rest lengths, ensuring the fixed lengths of all the sentences, and facilitating the calculation of a neural network and an attention mechanism.

(2) Traversing all answers in the data set, determining the fixed length of the answer text, filling sentences of which the sentence lengths are smaller than the fixed length in the data set, and supplementing 0 for the rest lengths, so that the fixed lengths of all the sentences are ensured, and the calculation of a neural network is facilitated.

(3) Obtaining 300-dimensional word vector w by utilizing glove tool_iFor constructing a word vector matrix E ═ w₁,...,w_n]Each row of the matrix represents a word, the matrix is directly used as a word vector matrix in the model initialization stage, the word vector matrix is set to be trainable, and the word vector matrix is used for training the word vector in the model training processWill be optimized.

(4) Extracting the characteristic vector corresponding to each picture by using a pre-trained Faster R-CNN network, and generating a 36 x 2048 matrix for each picture

36 denotes the number of entities identified in the picture and 2048 denotes the length of each entity feature vector.

(5) Problem vector q of round t_tObtained through an LSTM network; the history information consists of picture titles and question and answer information of each turn in the history, the title information of the picture is spliced with all the question and answer information, and a history vector h is obtained through an LSTM network_t。

(6) Problem-related visual vector v obtained by using attention mechanism_tThe calculation formula is as follows:

wherein,

representing a certain entity vector in a picture, the matrix W¹∈R^512×2048Vector q_t∈R⁵¹²Representing the problem feature of the t round, dropout (circle) represents that the data in the vector is discarded according to a certain probability,

representing multiplication of corresponding positions, vector

Matrix W²∈R^512×1Matrix of

softmax (·) is used to calculate correlations between different entity features and problem features in a picture, vector

Vector v_t∈R²⁰⁴⁸The visual features representing the problem associated with the t-th round are obtained by weighted summation of the different physical features in the image.

(7) Fusing the acquired problem vector, visual vector and historical vector by using a later-stage fusion method to obtain a vector containing intra-modal characteristic information

The calculation process is as follows:

wherein [ ·]Representing a stitching operation, vector q_t；v_t；h_t]∈R³⁰⁷²The matrix W³∈R^3072×512Vector of motion

(8) Acquiring association information among problem vectors, visual vectors and historical vectors by using a KR product-based multi-modal information fusion method to obtain vectors containing association information among modalities

Firstly, mapping a problem vector, a visual vector and a historical vector into the same vector space, splicing the three vectors according to the 0 th dimension to obtain a feature matrix

Then, characteristic momentsKR product calculation is carried out according to the array to obtain a third-order tensor

Setting the data of the lower triangle of each matrix in the third-order tensor to be 0 through a Mask mechanism; finally, the combined information among the characteristics is captured through the convolutional neural network to obtain a characteristic matrix

Using feature matrix X⁰And the feature matrix X¹Extracting X in columns⁰And X¹The KR product is calculated to obtain a third order tensor

Obtaining a feature matrix by a convolutional neural network

For feature matrix X¹And the feature matrix X²Respectively compressing according to the 1 st dimension to obtain vectors

And

then splicing the two vectors to obtain a final vector

The calculation formula is as follows:

wherein W^Mask∈R^3×3A mask matrix with an upper triangular value of 1 and the remaining values of 0,

it is indicated that the corresponding position is multiplied by,

representing tensor operations (mathematically expressed as outer products), D_sThe dimensions of the features are represented in a graph,

is a matrix

Column vector, matrix of

Is third order tensor

Of (1), matrix W_i ⁴∈R^3×3For the ith convolution kernel, sum (-) is an addition function, which sums all values,

is a scalar quantity, expressed as a matrix

The jth value in the ith vector,

is a scalar quantity, by subtending quantities

Carrying out average pooling;

is a matrix

The column vector of (a) is,

is a matrix

Column vector, matrix of

Is third order tensor

Of (1), matrix W_i ⁵∈R^M×3For the (i) th convolution kernel,

is a scalar quantity, expressed as a matrix

The jth value in the ith vector,

is a scalar quantity, by subtending quantities

To carry outObtaining by average pooling; m, N represent the number of convolution kernels in the first and second layers, respectively.

(9) Output of the splicing step (7) and the step (8), vector of the candidate answer

The method comprises the following steps of obtaining through an LSTM network, and calculating scores of 100 candidate answers through a candidate answer module, wherein the scores are as follows:

wherein the vector

Obtained by stitching, vectors

Features represented as one of the 100 candidate answers,

representing multiplication of corresponding positions, sum (-) representing addition of all values in the vector, score_iIs a scalar quantity, expressed as the relevance score of the ith candidate answer.

And calculating the cross entropy between the candidate answer score and the real answer score, and then continuously updating the parameters of the model through back propagation.

(10) Finally, the method is applied to RVA and LF models, training and testing are carried out on a VisDial v1.0 data set, and the evaluation indexes of the testing are R @1, R @5, R @10, mean (mean rank), MRR (mean receiptional rank) and NDCG (normalized discrete relational gain). As can be seen from the experimental results in table 1, the method for obtaining multimodal information based on late fusion is insufficient for capturing the associated information between different modalities; according to the invention, by utilizing a KR product-based multi-modal information fusion method, the correlation information among different modes is captured and spliced with the characteristic information in the different modes to obtain a final fusion vector, and experiments show that the overall effect of the visual dialogue model can be improved by capturing the correlation information among the different modes.

TABLE 1 comparison of results for models on VisDial v1.0 dataset

The present invention is not limited to the above-described embodiments. The foregoing description of the specific embodiments is intended to describe and illustrate the technical solutions of the present invention, and the above specific embodiments are merely illustrative and not restrictive. Those skilled in the art can make many changes and modifications to the invention without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A neural network visual dialogue model based on KR product fusion multi-modal information is characterized by comprising a modal feature extraction module, a different modal information fusion module and a candidate answer prediction module;

2. The KR product fusion multi-modal information-based neural network visual dialogue model of claim 1, wherein the modal feature extraction module is configured to use a glove tool to learn the embedded representation of words in a word2vec model by segmenting a problem text, each word is represented by a dense vector, and then use an LSTM model to obtain semantic features of a problem, namely problem features;

3. The KR product fusion multi-modal information-based neural network visual dialogue model of claim 1, wherein in the method for obtaining feature information in different modalities, the problem feature, the visual feature and the historical feature are spliced, and then the features of the problem feature, the visual feature and the historical feature are fused through a full connection layer, so as to obtain feature information in different modalities;

4. The KR-product-fusion-multimodal-information-based neural network visual dialogue model as claimed in claim 1, wherein the candidate answer prediction module is configured to predict a plurality of candidate answers to obtain a score of each candidate answer, calculate loss by comparing the actual answers, and then continuously update parameters of the neural network visual dialogue model by using a back propagation algorithm to finally obtain a score corresponding to each candidate answer.

5. A neural network visual dialogue method based on KR product fusion multi-modal information is based on the neural network visual dialogue model and is characterized by comprising the following steps: capturing feature information in different modalities through a later-stage fusion method, capturing correlation information between the different modalities in a KR product calculation mode, and finally splicing the feature information in the modalities and the correlation information between the modalities to obtain a final fusion vector; and the later-stage fusion method is used for splicing the captured problem features, visual features and historical features and performing linear layer fusion.