CN114661874B

CN114661874B - Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Info

Publication number: CN114661874B
Application number: CN202210223976.XA
Authority: CN
Inventors: 王鑫; 陈巧红
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2024-04-30
Anticipated expiration: 2042-03-07
Also published as: CN114661874A

Abstract

The invention belongs to the technical field of cross-modal tasks combined with the fields of computer vision and natural language processing. The technical proposal is as follows: a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels comprises the following steps: step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module; step 2; for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; and finally, the word vector is expressed and the state of the last time step is acquired through a long-short memory network, so that the problem characteristic is obtained. The method can enable the trained model to have robustness; the method has stronger generalization capability for more complex visual scenes, improves the semanteme of answers, and improves the accuracy of a visual question-answering model.

Description

Visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels

Technical Field

The invention belongs to the technical field of cross-modal tasks combined in the field of computer vision and natural language processing, and particularly relates to a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels.

Background

The visual question-answering technology is a subject that requires simultaneous understanding of visual content, semantic information, and cross-modal relationships. In the past, a great deal of work has been done to develop individual stem models in a single machine vision or natural language processing field and has profound significance. After combining the two fields of machine vision and natural language processing, the visual question-answering technology serving as one of the cross-modal field branches has great potential influence on wide application such as visual navigation and remote monitoring.

Currently, various image algorithms have been applied to the field of visual questions and answers to show excellent performance, and the main stream methods are roughly divided into two categories: an algorithm based on multi-modal fusion and an algorithm based on an attention mechanism. The multi-modal fusion algorithm is based on a CNN-RNN structure, and fuses visual features and text features into a unified representation for predicting answers. The attention mechanism algorithm is used for distinguishing effective information related to the problem in the image and solving the problem of interaction between vision and language. However, the method of multimodal fusion and attention mechanism does not effectively combine text information with image information; and the existing visual question-answering model cannot pay attention to the object relation information of the picture and lacks the acquisition capability of high-level semantic information, and the visual question-answering task faces the challenges of answering different types of questions and how to extract effective semantic information from the picture. The model should pay more attention to the object relation information of the picture, and can be matched to corresponding answers forward from the subtitle according to the questions, and the model should pay more attention to the high-level semantic information of the picture, so that the model has stronger robustness when matching the answers according to the subtitle.

Disclosure of Invention

The invention aims to overcome the defects of the background technology and provide a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels, which can enable a trained model to be more robust; the method has stronger generalization capability for more complex visual scenes; so as to improve the semanteme of the answer and the accuracy of the visual question-answer model.

The technical scheme adopted by the invention is that the visual question-answering method based on multi-angle semantic understanding and self-adapting double channels comprises the following steps:

step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;

step 2; for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;

Step 3; for embedding of image captions and close captions text, the sentences are divided into words by using spaces and punctuations as well; then cascading the obtained caption features and converting the cascading features into a text paragraph form; finally, a long-short-term memory network is used for encoding text paragraphs, and the output of the last layer is the encoded word vector sequence;

Step 4; using an attention mechanism for the visual features and the problem features obtained in the step 1 and the step 2 to obtain attention features related to the problem; outputting the relation features by a relation reasoning module through the visual features, the geometric features and the problem features obtained in the step 1 and the step 2; finally, the attention features and the relation features are fused to generate visual feature representations;

Step 5; inputting the word vector sequences and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;

Step 6; the visual features and the multi-angle semantic features generated in the step 4 and the step 5 are sent to a visual semantic selection gate, and the contribution of the visual channel and the semantic channel to the predicted answer is controlled in a feature fusion mode; the answer with highest probability is selected as the final answer through the multi-classifier in the prediction of the answer.

The invention is also characterized in that:

In the step 1, the usage object detection module specifically means: obtaining object detection frames by using a fast R-CNN model, and selecting the most relevant K detection frames (generally k=36) as important visual areas; for each selected region i, V _i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V ₁,v₂,…,v_K}^T, In addition, the geometric features of the input image are also recorded, denoted as b= { B ₁,b₂,…,b_K}^T, where/>(X _i,y_i),w_i,h_i represents the center coordinates, width and height of the selected region i, respectively; w, h represents the width and height of the input image, respectively.

The step 2 is specifically implemented according to the following steps:

Firstly, trimming each input problem Q to 14 words at most, simply discarding extra words exceeding 14 words, and filling the problem of less than 14 words with 0 vector; then, the problem of 14 words is converted into Glove word vectors, the size of the word embedded sequence generated by the problem is 14 multiplied by 300, and the word embedded sequence sequentially passes through a long-short-time memory network (LSTM) with the hidden layer of d _q dimension; finally use The final hidden state of (a) is a problem embedded representation of the input problem Q.

The text embedding implementation step in the step 3 is the same as the text embedding step in the step 2 except that the step of cascading the image captions and the close captions is not included.

The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention featuresExpressed as a weighted sum of:

V_at＝A^T·V

Where a= [ ω ₁,ω₂,…,ω_K]^T is the mapping matrix of the attention.

The relation reasoning module in the step 4) specifically refers to: the relation between the coded image areas is realized by a double convolution flow mode, and two different types of relation features are generated as binary relation features and multiple relation features respectively. The relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning. The feature fusion module is responsible for fusing the visual features, the geometric features and the problem features in a dimension increasing and reducing mode to generate paired combination of visual area features; the binary relation reasoning module is responsible for mining paired visual relations among visual areas to generate binary relation features in a mode of three continuous 1 multiplied by 1 convolution layers; the multivariate relation reasoning module is responsible for mining intra-group visual relations among visual areas to generate multivariate relation features in a mode of three continuous 3×3 cavity convolution layers. And finally, combining the binary relation features and the multiple relation features to obtain the relation features.

The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V _co = concat [ V, B ]; secondly, the visual area characteristic V _co and the problem characteristic are mapped into a subspace with low dimension:

Where W _v and W _q are learning parameters and b _q and b _v are biases. Where d _s is the dimension of the subspace.

Combining visual areas in pairs, expanding visual area featuresAnd to transpose it with the/>The addition results in a pairwise combination of visual zone features V _fu.

The binary relation reasoning comprises the following steps: three consecutive 1 x1 convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The number of channels for the three 1 x1 convolutional layers is d _s respectively,/>The visual area combination characteristic V _fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R _p through softmax, wherein the specific formula is as follows:

the step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The holes of the three hole convolution layers are 1,2 and 4, respectively. The step size of all convolutions is 1, and zero edge filling is adopted for making the output and the input of each convolution have the same size; the visual area paired combination V _fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R _g through softmax according to the following formula:

the specific implementation steps of the step 4 are as follows:

first, according to the multi-modal fusion:

Where 1 εR ^d is a vector with elements of 1, and Representing element-wise multiplication.

Second, the same mapping matrix is used for all image areasAnd/>

Wherein P ε R ^d is the learning parameter; to obtain the attention mapping matrix, the attention weight ω _i for the image region i is as follows:

Thus all visual areas and corresponding attention features Expressed as a weighted sum of:

V_at＝A^T·V

Where a= [ ω ₁,ω₂,…,ω_K]^T is the mapping matrix of the attention.

The multi-angle semantic module in the step 5) is used for associating the problem characteristics with the subtitle characteristics; the specific method comprises the following steps: firstly, traversing and calculating the relevance between a caption t _i and a problem q _j by using a cosine similarity method, and selecting text features most relevant to the problem q _j; secondly, combining the weight coefficient R _i with the caption feature t _i to make the semantic information more relevant to the problem get more attention, namelyWherein/>Representing the weight subtitle features; each word of the subtitle is then encoded using bi-directional LSTM (BiLSTM), while each word of the question is also encoded using BiLSTM; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.

The step5 is specifically implemented according to the following steps:

Step 5.1: the method comprises the steps of associating a problem feature with a subtitle feature, firstly traversing and calculating the relevance between a subtitle t _i and a problem q _j by using a cosine similarity method, and selecting a text feature most relevant to the problem q _j; secondly, combining the weight coefficient R _i with the caption feature t _i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weighted caption feature. Each word of the subtitle is then encoded using bi-directional LSTM or BiLSTM, while each word of the question is also encoded using BiLSTM; finally, four methods of complete fusion, average pooling fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information;

Step 5.2: each word of the subtitle is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:

Wherein the method comprises the steps of Representing the hidden states of the forward and reverse LSTM of the subtitle at the ith time step, respectively.The hidden states of the forward and reverse LSTM of the question at the j-th time step are respectively represented;

step 5.3: four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted respectively to capture the advanced semantic information.

The complete fusion is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:

Wherein the method comprises the steps of The vector is a vector in the dimension l, and the vector represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively;

The average pooling fusion is carried out, the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics on each time step are transmitted into an F function to be fused, and then the average pooling operation is carried out, wherein the specific formula is as follows:

Wherein the method comprises the steps of The vector is a vector in the dimension l, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively;

The attention fusion is carried out, firstly, a similarity coefficient between the context embedding of the subtitle and the context embedding of the problem is calculated through a cosine similarity function, then the similarity coefficient is regarded as a weight, and the weight is multiplied by each forward (or reverse) word vector embedding of the problem, and the average value is calculated, wherein the specific formula is as follows:

Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>And respectively corresponding to the forward and reverse attention vectors of the ith subtitle word vector, and representing the relevance of the whole problem and the word.

Finally, embedding the attention vector and the caption context into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, wherein the process is as follows:

the maximum attention fusion is carried out, the problem with the maximum similarity coefficient is directly embedded into the attention vector, and finally the attention vector and the caption are embedded into the F function for fusion; the specific formula is as follows:

the four fusion strategies in the step 5) are used for marking the comprehensive fusion characteristics of the ith subtitle obtained by cascading the generated 8 feature vectors as

The integrated fusion feature is input into a bi-directional LSTM (BiLSTM) and the final hidden state in both directions is obtained as follows:

secondly, final hidden states at the head and the tail are cascaded to generate multi-angle semantic features Finally, to facilitate multi-modal feature fusion, multi-angle semantic features are mapped to the same dimensions as the visual representation, as follows:

Wherein the method comprises the steps of As a learnable weight matrix, b _s is bias.

The invention has the beneficial effects that:

1. The invention is based on a multi-angle semantic understanding and self-adaption dual-channel model, can capture visual clues and semantic clues of images at the same time, and adds gating in a later fusion stage to adaptively select visual information and semantic information to answer questions, so that the trained model has robustness.

2. The invention adopts the visual relation reasoning module in the visual channel, wherein the visual relation reasoning module comprises binary relation reasoning and multiple relation reasoning, enhances the understanding capability of the model on visual content, and has stronger generalization capability in the face of more complex visual scenes.

3. The invention adopts the multi-angle semantic module to generate semantic features in the semantic channel, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semanteme of answers is improved, and the accuracy of a visual question-answering model is improved.

Drawings

FIG. 1 is a diagram of a network model structure of the method of the present invention.

Fig. 2 is a schematic diagram of a relationship inference module in the method of the present invention.

FIG. 3 is a schematic diagram of a multi-angle semantic module in the method of the present invention.

Detailed Description

The invention will be further described with reference to an embodiment (entitled child garment) shown in the drawings.

The invention discloses a visual question-answering method based on multi-angle semantic understanding and self-adaption double channels, which comprises the following steps:

Step 1: the input image is preprocessed, and visual features and geometric features of a salient region in the input image are extracted by using an object detection module. Mesh features are extracted by using pre-training ResNet-101, object regions are explored in cooperation with a fast-RCNN model, 2048-dimensional target region features are extracted, and the K most relevant detection frames (generally K=36) are selected as important visual regions. For each selected region i, V _i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V ₁,v₂,…,v_K}^T, In addition, the geometric features of the input image are also recorded, denoted as b= { B ₁,b₂,…,b_K}^T, where/> (X _i,y_i),w_i,h_i represents the center coordinates, width and height of the selected region i, respectively. W, h represents the width and height of the input image, respectively.

Step 2: for the embedding of question text, the method of using spaces and punctuation marks divides sentences into words (numeric or numeric-based words are also regarded as one word); performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;

The implementation method is as follows: each input question Q is pruned to a maximum of 14 words, simply discarding additional words beyond 14 words, while questions below 14 words are filled with 0 vectors. The problem of containing 14 words is then transformed into Glove vectors, resulting in word embedding sequences of size 14 x 300, which in turn are passed through a long-short-term memory network (LSTM) with hidden layer of d _q dimensions. Finally use The final hidden state of (a) is a problem embedded representation of the input problem Q.

Step 3: for embedding of image captions and close captions text, the sentences are also divided into words by using spaces and punctuation marks, and the sentence length is also set to be 14; then, the invention adopts the first 6 close captions (according to the average value of the caption distribution) as text input, and the obtained caption features are cascaded and converted into the text paragraph form; finally, the text paragraphs are encoded by using a long-short-time memory network, and the output of the last layer is the encoded word vector sequence.

The attention mechanism specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention features Expressed as a weighted sum of:

V_at＝A^T·V

Where a= [ ω ₁,ω₂,…,ω_K]^T is the mapping matrix of the attention.

The relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; (using a relational reasoning module, which is an innovation point of the invention).

The feature fusion comprises the following steps: firstly, object features and geometric features of K visual areas of an image are cascaded to generate visual area features V _co = concat [ V, B ]. Secondly, the visual area characteristic V _co and the problem characteristic are mapped into a subspace with low dimension:

To pair-wise combine visual areas, extended visual area featuresAnd to transpose it with the/>The addition results in a pairwise combination of visual zone features V _fu.

The step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolutional layers are employed, and a ReLU active layer is employed after each convolutional layer. The holes of the three hole convolution layers are 1,2 and 4, respectively. The step size of all convolutions is 1 and zero edge padding is used to make the output and input size of each convolution the same. The visual area paired combination V _fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R _g through softmax according to the following formula:

The specific implementation steps of the step 4 are as follows:

first, according to the simplest bilinear multi-modal fusion W _i is replaced by two smaller matrices H _iG_i ^T, where/>/>

Where 1 εR ^d is a vector with elements of 1, andRepresenting element-by-element multiplication;

second, the same mapping matrix is used for all image areas And/>

V_at＝A^T·V

Where a= [ ω ₁,ω₂,…,ω_K]^T is the mapping matrix of the attention.

The specific implementation steps of the step5 are as follows:

Step 5.1: the method comprises the steps of associating the problem features with the subtitle features, firstly traversing and calculating the relevance between the subtitle t _i and the problem q _j by using a cosine similarity method, and selecting the text features most relevant to the problem q _j. Secondly, combining the weight coefficient R _i with the caption feature t _i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weighted caption feature. Each word of the subtitle is then encoded using bi-directional LSTM (BiLSTM), while each word of the question is also encoded using BiLSTM; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.

Step 5.2: each word of the subtitle is encoded using bi-directional LSTM (BiLSTM), while each word of the BiLSTM encoding problem is also employed:

Wherein the method comprises the steps of Representing the hidden states of the forward and reverse LSTM of the subtitle at the ith time step, respectively.Representing the hidden state of the forward and reverse LSTM of the problem at the j-th time step, respectively.

Step 5.3: the invention adopts four fusion strategies of complete fusion, average pooling fusion and maximum attention fusion to capture high-level semantic information; (this is yet another innovative point of the present invention).

The complete fusion strategy is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:

Wherein the method comprises the steps of The vector of the dimension l represents the forward and reverse complete fusion characteristics of the vector of the ith caption word respectively.

The average pooling fusion strategy is to transfer the forward (or reverse) word vector characteristics of the caption paragraph and the forward (or reverse) problem characteristics on each time step into an F function for fusion, and then execute the average pooling operation, wherein the specific formula is as follows:

Wherein the method comprises the steps of And the vector is a vector in the dimension I, and represents the forward and reverse average pooling fusion characteristics of the vector of the ith caption word respectively.

The attention fusion strategy is that firstly, similarity coefficients between the context embedding of the subtitle and the context embedding of the problem are calculated through cosine similarity functions, then the similarity coefficients are regarded as weights, and the weights are multiplied by each forward (or reverse) word vector embedding of the problem and are averaged, and the specific formula is as follows:

the maximum attention fusion strategy is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:

the 4 fusion methods are used for cascading the generated 8 feature vectors to obtain the comprehensive fusion feature of the ith subtitle, and the comprehensive fusion feature is recorded as The integrated fusion feature is input into a bi-directional LSTM (BiLSTM) and the final hidden state in both directions is obtained as follows:

/>

Step 6: and (3) sending the visual features and the multi-angle semantic features generated in the step (4) and the step (5) into a visual semantic selection gate, and controlling the contribution of the visual channel and the semantic channel to the predicted answer in a feature fusion mode. The answer with highest probability is selected as the final answer through the multi-classifier in the prediction of the answer.

In summary, the invention uses R-CNN-LSTM frame to combine attention mechanism and relation reasoning method in visual channel based on VQA 1.0.0 and VQA 2.0.0 data set, and uses visual feature vector and geometric feature vector in Faster R-CNN coding image to input the visual feature vector and geometric feature vector into visual channel to generate visual mode representation; and the semantic channel adopts LSTM network coding to splice the global title and the local title, and semantic modal representation is output through a multi-angle semantic module. And finally, inputting the obtained visual mode representation and semantic mode representation into an adaptive selection gating to decide which mode clues are adopted to predict answers.

The innovation points are as follows: the relationship reasoning module is adopted in the visual channel, wherein the relationship reasoning module comprises binary relationship reasoning and multiple relationship reasoning, so that the understanding capability of the model on visual content can be enhanced, and the model has stronger generalization capability in the face of more complex visual scenes. Secondly, a multi-angle semantic module is adopted in the semantic channel to generate semantic features, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semanteme of an answer can be improved, and meanwhile, the accuracy of a visual question-answer model can be improved.

Simulation experiment and experimental result characterization:

1. Data set

The model was run on two visual question-and-answer published datasets VQA 1.0.0 and VQA 2.0.0 datasets, respectively. VQA 1.0.0 is built on the MSCOCO image dataset [38], the training set in the dataset contains 248 349 questions and 82 783 pictures, the validation set contains 121 512 questions and 40 504 pictures, and the test set contains 244 302 questions and 81 434 pictures. VQA 2.0.0 is an iterative version of VQA 1.0, which adds more problem samples than VQA 1.0 to make the language bias more balanced. The training set of VQA 2.0.0 data set contained 443 757 questions and 82 783 pictures, the validation set contained 214 354 questions and 40 504 pictures, and the test set contained 447 793 questions and 81 434 pictures. There are three types of problems: yes/no, numerical, and others. Wherein the number of other types of samples is approximately half of the total number of samples. The model provided by the invention is trained on a training set and a verification set, and test results are reported on a test-development set (test-dev) and a test-standard set (test-standard) in order to ensure fair comparison with other works.

2. Experimental environment

The invention realizes the proposed model in pytorch library and completes the test experiment on GUP server. The server is configured as 256G RAM and has 4 Nvidia 1080Ti GPU, and the total memory is 64GB. The invention trains the model by using an Adam optimizer, the maximum iteration round number is 40, and the batch size is set to 256. The learning rate is set to be 1e-3 in the first training period, set to be 2e-3 in the second training period, set to be 3e-3 in the third training period, and kept until the tenth training period, and then the learning rate is attenuated once every two periods, and the attenuation rate is 0.5. In order to prevent gradient explosion, the invention also adopts a gradient trimming method to update the gradient value of each period to be one fourth of the original gradient value. To prevent overfitting, a dropout layer was used after each fully connected layer, with a dropout rate of 0.5.

3. Experimental results and analysis

TABLE 1 VQA 1.0 Performance of models in test-development set and test-Standard set

As shown in table 1, performance comparisons of various advanced models and the models herein are mainly shown, and the results shown in the table are obtained after the models are trained in the training set and the verification set. It can be seen that the performance of the model of the invention is obviously superior to other models in most indexes, and the overall accuracy in the test-development set and the test-standard set respectively reaches 69.44% and 69.37%. In the test-development set, there was a 5.64% improvement in overall accuracy over the MAN model using the memory-enhanced neural network, and a 0.73% improvement over the best performing VSDC model. The VSDC model also adopts the concept of semantic guidance prediction, and adopts an attention mechanism in the aspect of semantics to acquire semantic information related to the problem. In addition to the semantic attention mechanism, the invention adds three fusion methods to improve the multi-angle semantic understanding capability of the model, and experimental results show that the multi-angle semantic modules in the semantic channel have important significance for improving the prediction precision. The model proposed by the present invention also has the same performance in the test-standard set.

/>

TABLE 2 VQA 2.0 Performance of models in test-development set and test-Standard set

As shown in Table 2, the present invention further verifies the performance of the model on VQA 2.0.0 datasets, including test-development sets and test-standard sets. By comparing with the advanced method, the model provided by the invention has good performance on indexes such as overall precision and the like. Compared with MuRel [49] model, the overall accuracy of the invention is improved by 1.22% and 0.89% in test-development set and test-standard set, respectively. The MuRel model is a more prominent model in the current multi-mode relation modeling method, and is a network structure adopting residual characteristic learning end-to-end reasoning. The performance of the invention is superior to that of the model, and the model can utilize a large amount of semantic information to improve the prediction accuracy due to the guiding effect of the semantic channel on answer prediction. In addition, compared with VCTREE model adopting reinforcement learning and supervision learning parallel mode, the model is used as a visual question-answering method with better performance at present, and the invention has obvious advantages in indexes such as overall accuracy and the like. In view of the above-mentioned, it is desirable,

By comparing with the advanced method, the model provided by the invention can better mine semantic information on the basis of understanding the image content, and improves the accuracy of the model on answer prediction.

Claims

1. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels is characterized by comprising the following steps of:

Step 2; for embedding of the problem text, dividing sentences into words by using a space and punctuation mark method; performing vectorization representation on the word by adopting a pre-trained word vector model; finally, word vector representation is used for obtaining the state of the last time step through a long-short memory network, so as to obtain the problem characteristics;

Step 6; the visual features and the multi-angle semantic features generated in the step 4 and the step 5 are sent to a visual semantic selection gate, and the contribution of the visual channel and the semantic channel to the predicted answer is controlled in a feature fusion mode; the answer with highest probability is selected through a multi-classifier to be used as a final answer in the prediction of the answer;

The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism, and using a soft attention method as an attention module to introduce visual objects which are related to problems in a network structure, and outputting attention characteristics; wherein all visual areas and corresponding attention features Expressed as a weighted sum of:

V_at＝A^Τ·V

wherein a= [ ω ₁,ω₂,…,ω_K]^Τ is a mapping matrix of the attention;

The relationship reasoning module in the step 4) specifically refers to: the relation between the coded image areas is realized by a double convolution flow mode, and two different types of relation features are generated as binary relation features and multiple relation features respectively; the relation reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; the feature fusion module is responsible for fusing the visual features, the geometric features and the problem features in a dimension increasing and reducing mode to generate paired combination of visual area features; the binary relation reasoning module is responsible for mining paired visual relations among visual areas and generating binary relation features in a mode of three continuous 1X 1 convolution layers; the multi-relation reasoning module is responsible for mining intra-group visual relations among visual areas and generating multi-relation features in a mode of three continuous 3×3 cavity convolution layers; finally, combining the binary relation features and the multiple relation features to obtain relation features;

Wherein W _v and W _q are learning parameters, b _q and b _v are biases; where d _s is the dimension of the subspace;

The binary relation reasoning comprises the steps of adopting three continuous 1×1 convolution layers and adopting a ReLU activation layer after each convolution layer; the number of channels for the three 1 x1 convolutional layers is d _s respectively, />The visual area combination characteristic V _fu is input into the binary relation reasoning module, and then the output at the last layer is/>Will/>And adding the two components to obtain a symmetrical matrix, and finally generating a binary relation R _p through softmax, wherein the specific formula is as follows:

The step of the multivariate relation reasoning is as follows: three successive 3 x 3 hole convolution layers are employed, and a ReLU activation layer is employed after each convolution layer; the holes of the three hole convolution layers are 1,2 and 4 respectively; the step size of all convolutions is 1, and zero edge filling is adopted for making the output and the input of each convolution have the same size; the visual area paired combination V _fu is input into a multi-relation reasoning module, and the output of the last convolution layer and the ReLU activation layer is Is equivalent to binary relation reasoning and will/>And adding the two components to obtain a symmetrical matrix, and finally generating a multivariate relation R _g through softmax according to the following formula:

The multi-angle semantic module in the step 5) is used for associating the problem characteristics with the subtitle characteristics; the specific method comprises the following steps: firstly, traversing and calculating the relevance between a caption t _i and a problem q _j by using a cosine similarity method, and selecting text features most relevant to the problem q _j; secondly, combining the weight coefficient R _i with the caption feature t _i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weight subtitle features; then adopting bidirectional LSTM to code each word of the subtitle, and adopting BiLSTM to code each word of the problem; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.

2. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 1, wherein the method is characterized by comprising the following steps of: in the step 1), the usage object detection module specifically means: obtaining object detection frames by adopting a Faster R-CNN model, and selecting K most relevant detection frames as important visual areas; for each selected region i, V _i is a d-dimensional visual object vector, the input image is ultimately represented as v= { V ₁,v₂,…,v_K}^T,In addition, the geometric features of the input image are also recorded, denoted as b= { B ₁,b₂,…,b_K}^T, where/>(X _i,y_i),w_i,h_i represents the center coordinates, width and height of the selected region i, respectively; w, h represents the width and height of the input image, respectively.

3. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 2, wherein the method is characterized by comprising the following steps: the step 2) is specifically implemented according to the following steps:

Firstly, trimming each input problem Q to 14 words at most, simply discarding extra words exceeding 14 words, and filling the problem of less than 14 words with 0 vector; then, the problem of 14 words is converted into Glove word vectors, the size of the word embedded sequence generated by the problem is 14 multiplied by 300, and the word embedded sequence sequentially passes through a long-time memory network with a hidden layer of d _q dimension; finally use The final hidden state of (a) is a question embedded representation of the input question Q;

The text embedding implementation step in the step 3) is the same as the text embedding step in the step 2) except that the step of cascading the image captions with the close captions is not included.

4. The visual question-answering method based on multi-angle semantic understanding and self-adapting dual channels according to claim 3, wherein: the specific implementation steps of the step 4 are as follows:

first, according to the multi-modal fusion:

Where 1 εR ^d is a vector with elements of 1, and Representing element-by-element multiplication;

second, the same mapping matrix is used for all image areas And/>

V_at＝A^Τ·V

Where a= [ ω ₁,ω₂,…,ω_K]^Τ is the mapping matrix of the attention.

5. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 4, wherein the method is characterized by comprising the following steps of: the specific implementation steps of the step 5 are as follows:

Step 5.1: the method comprises the steps of associating a problem feature with a subtitle feature, firstly traversing and calculating the relevance between a subtitle t _i and a problem q _j by using a cosine similarity method, and selecting a text feature most relevant to the problem q _j; secondly, combining the weight coefficient R _i with the caption feature t _i to make the semantic information more relevant to the problem get more attention, namely Wherein/>Representing the weight subtitle features;

Wherein the method comprises the steps of Respectively representing the hidden states of the forward LSTM and the reverse LSTM of the subtitle at the ith time step; The hidden states of the forward and reverse LSTM of the question at the j-th time step are respectively represented;

6. The visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 5, wherein the method is characterized by comprising the following steps of: the complete fusion is to respectively transfer each forward word vector and each reverse word vector of the caption paragraph with the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:

The average pooling fusion is carried out, the forward or reverse word vector characteristics of the caption paragraphs and the forward problem characteristics on each time step are transmitted into an F function to be fused, and then the average pooling operation is carried out, wherein the specific formula is as follows:

The attention fusion is carried out, firstly, a similarity coefficient between the context embedding of the subtitle and the context embedding of the problem is calculated through a cosine similarity function, then the similarity coefficient is regarded as a weight, and the weight is multiplied by each forward word vector embedding of the problem and the average value is calculated, wherein the concrete formula is as follows:

Wherein the method comprises the steps of Representing the similarity coefficient of the forward direction and the reverse direction respectively,/>The attention vectors corresponding to the i-th subtitle word vector forward and reverse respectively represent the relevance of the whole problem and the word;

The maximum attention fusion is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:

7. the visual question-answering method based on multi-angle semantic understanding and self-adaption double channels according to claim 6, wherein the method is characterized by comprising the following steps: the four fusion methods in the step 5) are used for marking the comprehensive fusion characteristics of the ith subtitle obtained by cascading the generated 8 characteristic vectors as

The integrated fusion features are input into a bidirectional LSTM, and the final hidden states in two directions are obtained, and the formula is as follows: