CN114661874A

CN114661874A - Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels

Info

Publication number: CN114661874A
Application number: CN202210223976.XA
Authority: CN
Inventors: 王鑫; 陈巧红
Original assignee: Zhejiang Sci Tech University ZSTU
Current assignee: Zhejiang Sci Tech University ZSTU
Priority date: 2022-03-07
Filing date: 2022-03-07
Publication date: 2022-06-24
Anticipated expiration: 2042-03-07
Also published as: CN114661874B

Abstract

The invention belongs to the technical field of cross-modal tasks combining the fields of computer vision and natural language processing. The technical scheme is as follows: the visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels comprises the following steps: step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module; step 2; for the embedding of question text, a space and punctuation method is used to divide a sentence into words (numbers or words based on numbers are also regarded as a word); then carrying out vectorization representation on the words by adopting a pre-trained word vector model; and finally, expressing the word vector through a long-time memory network, and acquiring the state at the last time step to obtain the problem characteristics. The method can make the trained model more robust; the method has strong generalization capability in the face of more complex visual scenes, improves the semanteme of answers, and improves the accuracy of the visual question-answering model.

Description

Visual question-answering method based on multi-angle semantic understanding and self-adaptive dual channels

Technical Field

The invention belongs to the technical field of cross-modal tasks combining the fields of computer vision and natural language processing, and particularly relates to a multi-angle semantic understanding and self-adaptive dual-channel-based visual question-answering method.

Background

The visual question-answering technology is a subject which needs to understand visual contents, semantic information and cross-modal relationships at the same time. There has been a great deal of work in the past to develop respective skeleton models in a single machine vision or natural language processing domain and have profound significance. After two fields of machine vision and natural language processing are combined, a visual question-answering technology which is one of branches of a cross-modal field has great potential influence on wide application such as visual navigation, remote monitoring and the like.

At present, various image algorithms have been applied to the field of visual question answering to show excellent performance, wherein the mainstream methods are roughly divided into two types: multimodal fusion based algorithms and attention based algorithms. The multimodal fusion algorithm is based on a CNN-RNN structure, fusing visual and textual features into a unified representation for predicting answers. Attention mechanism algorithms have emerged to address the problem of visual and linguistic interaction by distinguishing between valid information in images that is relevant to the problem. However, the method of multi-modal fusion and attention mechanism cannot effectively combine text information and image information; the existing visual question-answering model cannot pay attention to the object relation information of the picture and lacks the capability of acquiring high-level semantic information, and the visual question-answering task faces the challenges of answering different types of questions and how to extract effective semantic information from the picture. The model should pay more attention to the object relation information of the picture and can also be matched with the corresponding answer from the caption in the forward direction according to the question, and the model should pay more attention to the information of the high-level semantics of the picture and has stronger robustness when the answer is matched according to the caption.

Disclosure of Invention

The invention aims to overcome the defects of the background technology and provides a visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels, and the method can enable a trained model to have higher robustness; the method also has stronger generalization capability in the face of more complex visual scenes; so as to improve the semanteme of the answer and the accuracy of the visual question-answering model.

The technical scheme adopted by the invention is that the visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels comprises the following steps:

step 1; preprocessing an input image, and extracting visual features and geometric features of a salient region in the input image by using an object detection module;

step 2; for the embedding of question text, a space and punctuation method is used to divide a sentence into words (numbers or words based on numbers are also regarded as a word); performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vector through a long-time and short-time memory network, and acquiring the state at the last time step to obtain the problem characteristics;

step 3; for embedding of image captions and dense caption texts, the sentences are segmented into words by using spaces and punctuations; then cascading the obtained multiple subtitle features and converting the subtitle features into a text paragraph form; finally, a long-time memory network is used for coding the text paragraphs, and the output of the last layer is a coded word vector sequence;

step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention feature and the relation feature to generate a visual feature representation;

step 5; inputting the word vector sequence and the problem features obtained in the step 2 and the step 3 into a multi-angle semantic module to generate multi-angle semantic features;

step 6; sending the visual features and the multi-angle semantic features generated in the step 4 and the step 5 into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode; the prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.

The invention is also characterized in that:

in step 1, the using the object detection module specifically includes: obtaining object detection frames by adopting a Faster R-CNN model, and selecting the most relevant K detection frames (generally K is 36) as important visual areas; for each selected region i, v_iIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V₁,v₂,…,v_K}^T，

In addition, the geometric features of the input image are also recorded, and are marked as B ═ B₁,b₂,…,b_K}^TWherein

(x_i,y_i),w_i,h_iRespectively representing the center coordinate, the width and the height of the selected area i; w, h represent the width and height of the input image, respectively.

The step 2 is implemented according to the following steps:

firstly, each input question Q is pruned to 14 words at most, extra words exceeding 14 words are simply discarded, and meanwhile, the question with less than 14 words is filled by a 0 vector; the question containing 14 words is then converted into a Glove word vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as d_qLong and short term memory networks of dimensions (LSTM); finally use

The final hidden state of (a) is a problem embedding representation of the input problem Q.

The text embedding implementation step in the step 3 is the same as the text embedding step in the step 2 except that the image subtitles and the dense subtitles are not cascaded.

The attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention features

The weighted sum of (a) is expressed as:

V_at＝A^T·V

wherein a ═ ω₁,ω₂,…,ω_K]^TIs the mapping matrix of attention.

The relationship inference module in the step 4) specifically includes: the relationship between the coded image areas is realized in a double-convolution stream mode, and two different types of relationship features are generated and are respectively a binary relationship feature and a multivariate relationship feature. The relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning. The feature fusion module is used for fusing the visual features, the geometric features and the problem features in a dimension increasing and dimension reducing mode to generate paired combinations of the visual region features; the binary relation reasoning module is responsible for mining the paired visual relation among the visual areas to generate binary relation characteristics in a mode of three continuous 1 multiplied by 1 convolutional layers; the multivariate relation reasoning module is responsible for mining the intra-group visual relation among the visual areas to generate multivariate relation characteristics in a mode of three continuous 3 multiplied by 3 void convolutional layers. And finally, combining the binary relation characteristics and the multivariate relation characteristics to obtain the relation characteristics.

The characteristic fusion step is as follows: firstly, the invention cascades the object characteristics and the geometric characteristics of K visual areas of an image to generate visual area characteristics V_co＝concat[V,B](ii) a Secondly, the visual region is characterized by V_coMapping into a low-dimensional subspace with the problem features:

wherein W_vAnd W_qIs a learning parameter, b_qAnd b_vIs an offset.

Wherein d is_sIs the dimension of the subspace.

Combining visual regions in pairs to expand visual region features

And transposing it with

Adding to obtain paired combination V of visual region features_fu。

The binary relation reasoning comprises the following steps: three consecutive 1 x 1 convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The number of channels of the three 1 × 1 convolutional layers is d_s，

And

combining visual regions into feature V_fuWhen the input is input into the binary relation reasoning module, the output at the last layer is

Then will be

Adding the transformed vector and the transpose of the vector to obtain a symmetric matrix, and finally generating a binary relation R through softmax_pThe concrete formula is as follows:

the step of the multivariate relationship inference is as follows: by using three linksThe subsequent 3 x 3 void convolutional layers, and a ReLU active layer is employed after each convolutional layer. The voids of the three void convolution layers are 1, 2 and 4, respectively. All convolution step sizes are 1, and zero edge filling is adopted to enable the output of each convolution to be the same as the input in size; combining visual areas in pairs V_fuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference module

The same reasoning as the binary relation will

Adding the transformed vector and the transpose to obtain a symmetric matrix, and finally generating a multivariate relation R through softmax_gThe formula is as follows:

the specific implementation steps of the step 4 are as follows:

first, according to multimodal fusion:

wherein 1 ∈ R^dIs a vector whose elements are all 1, and

representing element-by-element multiplication.

Secondly, the same mapping matrix is used for all image areas

And

wherein P ∈ R^dIs a learning parameter; to obtain the attention mapping matrix, an attention weight ω for the image region i_iThe following formula:

thus all visual areas and corresponding attention features

The weighted sum of (a) is expressed as:

V_at＝A^T·V

wherein a ═ ω₁,ω₂,…,ω_K]^TIs the mapping matrix of attention.

The multi-angle semantic module in the step 5) associates the question features with the subtitle features; the specific method comprises the following steps: firstly, traversing and calculating the caption t by utilizing a cosine similarity method_iAnd problem q_jIs selected and question q_jThe most relevant text features; secondly, the weight coefficient R_iAnd caption feature t_iIn combination, semantic information more relevant to the problem is given more attention, i.e.

Wherein

Representing weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, improving the generalization capability of the model for understanding semantic information by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion.

The step 5 is implemented according to the following steps:

step 5.1: method for associating problem features with subtitle features by utilizing cosine similarity firstlyTraversing calculation of caption t_iAnd problem q_jIs selected and question q_jThe most relevant text features; secondly, the weight coefficient R_iAnd caption feature t_iIn combination, semantic information more relevant to the problem is given more attention, i.e.

Wherein

Representing weighted caption features. Then, each word of the caption is coded by adopting bidirectional LSTM or BilSTM, and each word of the problem is coded by adopting BilSTM; finally, the generalization capability of the model for understanding semantic information is improved by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion;

step 5.2: each word of the caption is encoded using bi-directional LSTM or BiLSTM, while each word of the BiLSTM encoding problem is also employed:

wherein

Respectively representing the closed state of the forward and reverse LSTM of the subtitle at the ith time step.

The hidden states of the forward LSTM and the backward LSTM of the problem at the jth time step are respectively represented;

step 5.3: and four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are respectively adopted to capture high-level semantic information.

The complete fusion is to introduce each forward and backward word vector of the caption paragraph and the forward and backward final states of the whole problem into an F function for fusion, and the specific formula is as follows:

wherein

The vector is a vector with l dimension and respectively represents the forward and reverse complete fusion characteristics of the ith caption word vector;

the average pooling fusion is to transmit the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics at each time step into an F function for fusion, and then to execute the average pooling operation, wherein the specific formula is as follows:

wherein

Is a vector of l dimensions, and respectively represents the ith subtitleForward and reverse average pooling fusion characteristics of word vectors;

the attention fusion comprises the steps of firstly calculating a similarity coefficient between the subtitle context embedding and the problem context embedding through a cosine similarity function, then taking the similarity coefficient as a weight, embedding and multiplying each forward (or reverse) word vector of a problem, and calculating an average value, wherein a specific formula is as follows:

wherein

Respectively representing the degree of similarity coefficients in the forward direction and the reverse direction,

the attention vectors corresponding to the ith subtitle word vector in the forward direction and the reverse direction respectively represent the relevance of the whole problem and the word.

Finally, the attention vector and the caption context are embedded and transmitted into an F function for fusion to obtain the forward and reverse attention fusion characteristics of the ith caption word vector, and the process is as follows:

the maximum attention fusion directly embeds the problem with the maximum similarity coefficient as an attention vector, and finally embeds the attention vector and the caption into an F function for fusion; the specific formula is as follows:

the four fusion strategies in the step 5) are to record the comprehensive fusion characteristics of the ith caption obtained by cascading the generated 8 characteristic vectors as

Inputting the comprehensive fusion features into a bidirectional LSTM (BilSTM), and acquiring final hidden states in two directions, wherein the formula is as follows:

secondly, the two ends are finally connectedHidden state cascading to generate multi-angle semantic features

Finally, to facilitate multi-modal feature fusion, the multi-angle semantic features are mapped to the same dimensions as the visual representation, with the following formula:

wherein

Is a learnable weight matrix, b_sIs an offset.

The invention has the beneficial effects that:

1. the invention is based on a multi-angle semantic understanding and self-adaptive dual-channel model, can capture visual clues and semantic clues of images simultaneously, and adds gating in a later stage fusion stage to answer questions by self-adaptively selecting visual information and semantic information, so that the trained model has higher robustness.

2. The visual relationship reasoning module is adopted in the visual channel, the binary relationship reasoning module and the multivariate relationship reasoning module are included, the comprehension capability of the model to the visual contents is enhanced, and the visual relationship reasoning module also has strong generalization capability in the face of more complex visual scenes.

3. According to the invention, a multi-angle semantic module is adopted in a semantic channel to generate semantic features, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, so that the semantics of answers are improved and the accuracy of a visual question-answering model is improved.

Drawings

Fig. 1 is a diagram of a network model architecture for the method of the present invention.

FIG. 2 is a schematic diagram of a relationship inference module in the method of the present invention.

FIG. 3 is a diagram illustrating a multi-angle semantic module in the method of the present invention.

Detailed Description

The invention will be further described with reference to an embodiment shown in the drawings, in the context of a child's clothing.

The invention relates to a visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels, which comprises the following steps of:

step 1: the input image is preprocessed, and visual features and geometric features of a salient region in the input image are extracted by using an object detection module. Extracting grid features by using a pre-trained ResNet-101, searching an object region by matching with a Faster-RCNN model, extracting 2048-dimensional target region features, and selecting the most relevant K detection frames (generally K is 36) as important visual regions. For each selected region i, v_iIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V₁,v₂,…,v_K}^T，

In addition, the geometric features of the input image are also recorded, and are recorded as B ═ B₁,b₂,…,b_K}^TWherein

(x_i,y_i),w_i,h_iRespectively representing the center coordinate, width and height of the selected area i. w, h represent the width and height of the input image, respectively.

Step 2: for the embedding of question text, the method of space and punctuation is used to divide the sentence into words (the number or the word based on the number is also regarded as a word); performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vectors through a long-time memory network, and acquiring the state at the last time step to obtain problem characteristics;

the implementation method comprises the following steps: each input question Q is pruned to a maximum of 14 words, simply discarding additional words beyond 14 words, while questions less than 14 words are filled with a 0 vector. The question containing 14 words is then converted into a Glove vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as d_qLong and short time memory networks of dimensions (LSTM). Finally use

And step 3: for embedding of image subtitles and dense subtitle texts, a sentence is divided into words by using spaces and punctuations, and the length of the sentence is also set to be 14; then, the invention adopts the first 6 dense subtitles (according to the average value of the subtitle distribution) as text input, and cascades the obtained characteristics of a plurality of subtitles to convert the characteristics into a text paragraph form; and finally, using a long-time and short-time memory network to encode the text paragraphs, wherein the output of the last layer is the encoded word vector sequence.

Step 4; using an attention mechanism for the visual features and the problem features obtained in the steps 1 and 2 to obtain attention features related to the problem; outputting the relationship characteristics by the visual characteristics, the geometric characteristics and the problem characteristics obtained in the steps 1 and 2 through a relationship reasoning module; finally, fusing the attention features and the relation features to generate visual feature representation;

the attention mechanism specifically comprises the following steps: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention features

The weighted sum of (a) is expressed as:

V_at＝A^T·V

wherein A ═ ω₁,ω₂,…,ω_K]^TIs the mapping matrix of attention.

The relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; (this is an innovative point of the present invention with a relational inference module).

The characteristic fusion step is as follows: firstly, the object characteristics and the geometric characteristics of K visual areas of an image are cascaded to generate visual area characteristics V_co＝concat[V,B]. Secondly, the visual region is characterized by V_coMapping into a low-dimensional subspace with the problem features:

wherein W_vAnd W_qIs a learning parameter, b_qAnd b_vIs an offset.

Wherein d is_sIs the dimension of the subspace.

Extending visual zone characteristics for pairwise combining of visual zones

And transposing it

Adding to obtain paired combination V of visual region features_fu。

The binary relation reasoning comprises the following steps: three successive 1 x 1 convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The number of channels of the three 1 × 1 convolutional layers is d_s，

And

combining visual regions into feature V_fuInput to binary relational inference moduleIn (3), the output at the last layer is

Then will be

the step of the multivariate relationship inference is as follows: three consecutive 3 x 3 void convolutional layers were used, and a ReLU active layer was used after each convolutional layer. The voids of the three void convolution layers are 1, 2 and 4, respectively. The step size of all convolutions is 1 and zero edge padding is used to make the output of each convolution the same size as the input. Combining visual areas in pairs V_fuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference module

The same reasoning as the binary relation will

the specific implementation steps of the step 4 are as follows:

first, according to the simplest bilinear multimodal fusion as

W is to be_iReplaced by two smaller matrices H_iG_i ^TIn which

And

wherein 1 ∈ R^dIs a vector whose elements are all 1, and

representing element-by-element multiplication;

secondly, the same mapping matrix is used for all image areas

And

thus all visual areas and corresponding attention features

Is represented as:

V_at＝A^T·V

wherein a ═ ω₁,ω₂,…,ω_K]^TIs the mapping matrix of attention.

The specific implementation steps of the step 5 are as follows:

step 5.1: associating the problem features with the caption features, and traversing and calculating the caption t by using a cosine similarity method_iAnd problem q_jIs selected and question q_jThe most relevant text features. Secondly, the weight coefficient R_iAnd caption feature t_iIn combination, semantic information more relevant to the problem is paid more attention, i.e.

Wherein

Representing weighted caption features. Then, each word of the caption is coded by adopting bidirectional LSTM (BilSTM), and each word of the BilSTM coding problem is also coded; and finally, four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion are adopted to improve the generalization capability of the model for understanding semantic information.

Step 5.2: each word of the caption is encoded using bi-directional lstm (BiLSTM), while each word of the BiLSTM encoding problem is also employed:

wherein

Indicating the hidden states of the forward and backward LSTM of the problem at the jth time step, respectively.

Step 5.3: the invention adopts four fusion strategies of complete fusion, average pooling fusion, attention fusion and maximum attention fusion respectively to capture high-level semantic information; (this is yet another innovative aspect of the present invention).

The complete fusion strategy is to transmit each forward and reverse word vector of the caption paragraph and the forward and reverse final states of the whole problem into an F function for fusion, and the specific formula is as follows:

wherein

The vector is a vector with l dimension, and respectively represents the forward and reverse fully fused characteristics of the ith caption word vector.

The average pooling fusion strategy is to transmit the forward (or reverse) word vector characteristics of the caption paragraphs and the forward (or reverse) problem characteristics at each time step into an F function for fusion, and then perform average pooling operation, and the specific formula is as follows:

wherein

The vector is a vector with l dimension, and respectively represents the forward and reverse average pooling fusion characteristics of the ith caption word vector.

The attention fusion strategy firstly calculates a similarity coefficient between the caption context embedding and the problem context embedding through a cosine similarity function, then takes the similarity coefficient as a weight, and multiplies each forward (or reverse) word vector embedding of the problem and calculates an average value, and the specific formula is as follows:

wherein

the maximum attention fusion strategy is to directly embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:

the 4 fusion methods cascade the generated 8 feature vectors to obtain the comprehensive fusion feature of the ith caption, and record the comprehensive fusion feature as

Inputting the comprehensive fusion features into a bidirectional LSTM (BilSTM) and obtaining final hidden states in two directions, wherein the formula is as follows:

secondly, cascading the final hidden states at the head and the tail to generate multi-angle semantic features

wherein

Is a learnable weight matrix, b_sIs an offset.

And 6: and (5) sending the visual features and the multi-angle semantic features generated in the step (4) and the step (5) into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode. The prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.

In summary, based on VQA 1.0.0 and VQA 2.0.0 data sets, the invention utilizes an R-CNN-LSTM frame in combination with an attention mechanism and a relational reasoning method in a visual channel, adopts visual characteristic vectors and geometric characteristic vectors in an Faster R-CNN coded image, and inputs the visual characteristic vectors and the geometric characteristic vectors into the visual channel to generate visual modal representation; and outputting semantic modal representation through a multi-angle semantic module by adopting the global title and the local title which are spliced by LSTM network coding in a semantic channel. And finally, inputting the obtained visual modal representation and semantic modal representation into an adaptive selection gate to decide which modal clue is adopted to predict the answer.

The innovation points are as follows: firstly, a relationship reasoning module is adopted in a visual channel, wherein the relationship reasoning module comprises binary relationship reasoning and multivariate relationship reasoning, the comprehension capability of the model to visual contents can be enhanced, and the model also has strong generalization capability in the face of more complex visual scenes. And secondly, generating semantic features by adopting a multi-angle semantic module in a semantic channel, wherein the multi-angle semantic module comprises a complete fusion method, an average pooling fusion method, an attention fusion method and a maximum attention fusion method, and the accuracy of the visual question-answering model can be improved while the semanteme of answers is improved.

Simulation experiment and experimental result characterization:

1. data set

The model was experimented on two published data sets of visual questions and answers, VQA 1.0.0 and VQA 2.0.0 data sets, respectively. VQA 1.0.0 was created based on the MSCOCO image dataset [38], with the training set containing 248349 questions and 82783 pictures, the validation set containing 121512 questions and 40504 pictures, and the test set containing 244302 questions and 81434 pictures. VQA 2.0.0 is an iterative version of VQA 1.0.0, which adds more problem samples to make the language bias more balanced than VQA 1.0.0. The training set of the VQA 2.0.0 data set contained 443757 questions and 82783 pictures, the validation set contained 214354 questions and 40504 pictures, and the test set contained 447793 questions and 81434 pictures. There are three types of problems: yes/no, numeric, and others. Wherein the number of other types of samples is about half of the total number of samples. The model provided by the invention is trained on a training set and a verification set, and in order to ensure fair comparison with other works, a test result is reported on a test-development set (test-dev) and a test-standard set (test-standard).

2. Experimental Environment

The invention realizes the proposed model in the pyrrch library and completes the test experiment on the GUP server. The server is configured as 256G RAM and simultaneously has 4 Nvidia 1080Ti GPUs, and the total video memory is 64 GB. The invention trains the model using an Adam optimizer with a maximum iteration round of 40 and a batch size of 256. The learning rate is set to be 1e-3 in the first training period, the learning rate is set to be 2e-3 in the second training period, the learning rate is set to be 3e-3 in the third training period and is kept until the tenth training period, and then the learning rate is attenuated once every two periods, and the attenuation rate is 0.5. In order to prevent gradient explosion, the invention also adopts a gradient pruning method to update the gradient value of each period to be one fourth of the original value. To prevent overfitting, dropout layers were employed after each fully connected layer, with a dropout ratio of 0.5.

3. Results and analysis of the experiments

TABLE 1 VQA 1.0 Performance of models in test-development set and test-Standard set

As shown in Table 1, the performance comparison between various advanced models and the models herein is mainly demonstrated, and the results shown in the table are obtained after the models are trained in the training set and the verification set. It can be seen that the performance of the model of the invention is obviously superior to that of other models in most indexes, and the overall accuracy in the test-development set and the test-standard set respectively reaches 69.44% and 69.37%. In the test-development set, there was a 5.64% improvement in overall accuracy over the MAN model using the memory-enhanced neural network, and a 0.73% improvement over the best performing VSDC model. The VSDC model also adopts the idea of semantic guidance prediction, and adopts an attention mechanism in the aspect of semantics to acquire semantic information related to the problem. In addition to the semantic attention mechanism, the invention also adds three types of fusion methods to improve the multi-angle semantic understanding capability of the model, and experimental results show that the multi-angle semantic module in the semantic channel has important significance for improving the prediction precision. The model proposed by the present invention also performs in the test-criteria set.

TABLE 2 VQA 2.0 Performance of models in test-development set and test-Standard set

As shown in Table 2, the present invention further verifies the performance of the model on the VQA 2.0.0 data set, which includes the test-development set and the test-standard set. Compared with the advanced method, the model provided by the invention has good performance on indexes such as overall accuracy and the like. Compared with the MuRel [49] model, the overall accuracy of the invention is improved by 1.22% and 0.89% in the test-development set and the test-standard set respectively. The MuRel model is a prominent model in the existing multi-modal relational modeling method, and is a network structure adopting residual characteristics to learn end-to-end reasoning. The performance of the method is superior to that of the model because the semantic channel guides the answer prediction, so that the model can improve the prediction precision by using a large amount of semantic information. In addition, compared with a VCTREE model adopting a reinforcement learning and supervised learning parallel mode, the model is used as a visual question-answering method with better performance at present, and the method has obvious advantages in indexes such as overall precision and the like. In view of the above, it is desirable to provide,

through comparison with the advanced method, the model provided by the invention can better mine semantic information on the basis of understanding the image content, and the accuracy of the model for predicting the answer is improved.

Claims

1. The visual question-answering method based on multi-angle semantic understanding and self-adaptive double channels is characterized by comprising the following steps of:

step 2; for embedding the question text, a space and punctuation method is used for segmenting the sentence into words; performing vectorization representation on the words by adopting a pre-trained word vector model; finally, expressing the word vectors through a long-time memory network, and acquiring the state at the last time step to obtain problem characteristics;

step 6; sending the visual features and the multi-angle semantic features generated in the steps 4 and 5 into a visual semantic selection gate, and controlling the contribution of a visual channel and a semantic channel to a predicted answer in a feature fusion mode; the prediction of the answer will select the answer with the highest probability through the multi-classifier as the final answer.

2. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 1, is characterized in that: in the step 1), using the object detection module specifically includes: obtaining object detection frames by adopting a Faster R-CNN model, and selecting the most relevant K detection frames as important visual areas; for each selected region i, v_iIs a d-dimensional visual object vector, the input image is finally expressed as V ═ V₁,v₂,…,v_K}^T，

3. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 2, is characterized in that: the step 2) is implemented according to the following steps:

firstly, each input question Q is pruned to 14 words at most, extra words exceeding 14 words are simply discarded, and meanwhile, the question with less than 14 words is filled by a 0 vector; the question containing 14 words is then converted into a Glove word vector, the resulting words are embedded in a sequence size of 14 x 300 and passed sequentially through the hidden layer as d_qThe length of the dimension is memorized in the network; finally use

The final hidden state of (a) is a problem embedding representation of the input problem Q;

the text embedding implementation step in the step 3) is the same as the text embedding step in the step 2) except that the image subtitles and the dense subtitles are not cascaded.

4. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 3, wherein: the attention mechanism in the step 4) specifically refers to: introducing a top-down attention mechanism and taking a soft attention method as an attention module to be introduced into a network structure to highlight visual objects related to the problem and output attention characteristics; wherein all visual areas and corresponding attention features

The weighted sum of (a) is expressed as:

V_at＝A^T·V

wherein a ═ ω₁,ω₂,…,ω_K]^TIs the mapping matrix of attention;

the relationship inference module in the step 4) specifically includes: realizing the relation between the coding image areas by a double convolution stream mode, and generating two different types of relation features which are respectively a binary relation feature and a multivariate relation feature; the relationship reasoning module consists of three parts: feature fusion, binary relation reasoning and multivariate relation reasoning; the feature fusion module is used for fusing the visual features, the geometric features and the problem features in a dimension increasing and dimension reducing mode to generate paired combinations of the visual region features; the binary relation reasoning module is responsible for mining the pair-wise visual relation among the visual areas and generating binary relation characteristics in a three-continuous 1 multiplied by 1 convolutional layer mode; the multivariate relation reasoning module is responsible for mining the intra-group visual relation among the visual areas and generating multivariate relation characteristics in a mode of three continuous 3 multiplied by 3 void convolutional layers; and finally, combining the binary relation characteristics and the multivariate relation characteristics to obtain the relation characteristics.

5. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 4, is characterized in that: the characteristic fusion step is as follows: firstly, the object characteristics and the geometric characteristics of K visual areas of an image are cascaded to generate visual area characteristics V_co＝concat[V,B](ii) a Secondly, the visual region is characterized by V_coMapping into a low-dimensional subspace with the problem features:

wherein W_vAnd W_qIs a learning parameter, b_qAnd b_vIs an offset;

wherein d is_sIs a dimension of a subspace;

the binary relation reasoning comprises the steps of adopting three continuous 1 multiplied by 1 convolutional layers, and adopting a ReLU activation layer after each convolutional layer; the number of channels of the three 1 × 1 convolutional layers is d_s，

And

Then will be

the step of the multivariate relationship inference is as follows: three continuous 3 x 3 void convolutional layers are adopted, and a ReLU active layer is adopted after each convolutional layer; the voids of the three void convolution layers are 1, 2 and 4 respectively; all convolution step sizes are 1, and zero edge filling is adopted to enable the output of each convolution to be the same as the input in size; combining visual areas in pairs V_fuThe output of the last convolution layer and the ReLU activation layer is input into the multivariate relational inference module

The same reasoning as the binary relation will

6. the visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 5, wherein: the specific implementation steps of the step 4 are as follows:

first, according to multimodal fusion:

wherein 1 ∈ R^dIs a vector whose elements are all 1, and

representing element-by-element multiplication.

Secondly, the same mapping matrix is used for all image areas

And

wherein P ∈ R^dAre learning parameters. To obtain the attention mapping matrix, an attention weight ω for the image region i_iThe following formula:

thus all visual areas and corresponding attention features

The weighted sum of (a) is expressed as:

V_at＝A^T·V

wherein a ═ ω₁,ω₂,…,ω_K]^TIs the mapping matrix of attention.

7. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 6, is characterized in that: the multi-angle semantic module in the step 5) associates the question features with the subtitle features; the specific method comprises the following steps: firstly, traversing and calculating the caption t by utilizing a cosine similarity method_iAnd problem q_jIs selected and question q_jThe most relevant text features; secondly, the weight coefficient R_iAnd caption feature t_iIn combination, semantic information more relevant to the problem is given more attention, i.e.

Wherein

8. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 7, is characterized in that: the specific implementation steps of the step 5 are as follows:

step 5.1: associating the problem features with the caption features, and traversing and calculating the caption t by using a cosine similarity method_iAnd problem q_jIs selected and question q_jThe most relevant text features; secondly, the weight coefficient R_iAnd caption feature t_iIn combination, the semantic information more relevant to the problem is paid more attention,namely, it is

Wherein

Representing weight caption features; then, each word of the caption is coded by adopting bidirectional LSTM or BilSTM, and each word of the BilSTM coding problem is also coded by adopting the BilSTM; finally, the generalization capability of the model for understanding semantic information is improved by adopting four methods of complete fusion, average pooling fusion, attention fusion and maximum attention fusion;

wherein

Hidden shape at jth time step for forward and backward LSTM, respectively, representing problemState;

9. The visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 8, characterized in that: the complete fusion is to transmit each forward and backward word vector of the caption paragraph and the forward and backward final state of the whole problem into the F function for fusion, and the specific formula is as follows:

wherein

the average pooling fusion is to transmit the forward or backward word vector characteristics of the caption section and the forward (or backward) problem characteristics at each time step into an F function for fusion, and then to execute the average pooling operation, wherein the specific formula is as follows:

wherein

The vector is a vector of l dimension and respectively represents the forward and reverse average pooling fusion characteristics of the ith caption word vector;

the attention fusion comprises the steps of firstly calculating a similarity degree coefficient between subtitle context embedding and problem context embedding through a cosine similarity function, then taking the similarity degree coefficient as weight, embedding and multiplying each forward (or reverse) word vector of a problem, and solving an average value, wherein the specific formula is as follows:

wherein

the attention vectors respectively corresponding to the forward direction and the reverse direction of the ith subtitle word vector represent the correlation between the whole problem and the word;

the maximum attention fusion is to embed the problem with the maximum similarity coefficient as an attention vector, and finally embed the attention vector and the caption into an F function for fusion, wherein the specific formula is as follows:

10. the visual question answering method based on multi-angle semantic understanding and self-adaptive two-channel according to claim 9, is characterized in that: the four fusion methods of the step 5) are to cascade the generated 8 feature vectors to obtain the comprehensive fusion features of the ith caption, and to mark the comprehensive fusion features as

Inputting the integrated fusion features into a bi-directional LSTM (BilStm) and obtaining final hidden shapes in two directionsThe state, the formula is as follows:

wherein

Is a learnable weight matrix, b_sIs an offset.