CN112926655A - Image content understanding and visual question and answer VQA method, storage medium and terminal - Google Patents

Image content understanding and visual question and answer VQA method, storage medium and terminal Download PDF

Info

Publication number
CN112926655A
CN112926655A CN202110211935.4A CN202110211935A CN112926655A CN 112926655 A CN112926655 A CN 112926655A CN 202110211935 A CN202110211935 A CN 202110211935A CN 112926655 A CN112926655 A CN 112926655A
Authority
CN
China
Prior art keywords
image
fusion
loss
attention module
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110211935.4A
Other languages
Chinese (zh)
Other versions
CN112926655B (en
Inventor
匡平
张婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN202110211935.4A priority Critical patent/CN112926655B/en
Publication of CN112926655A publication Critical patent/CN112926655A/en
Application granted granted Critical
Publication of CN112926655B publication Critical patent/CN112926655B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for understanding image content and visually asking VQA, a storage medium and a terminal, wherein the method comprises the following steps: inputting the images and the questions to be answered into a trained prediction module for answering; the prediction module comprises a fusion attention module, a bilinear model and a classifier which are connected in sequence, and the classifier outputs answers. The invention completes the task of solving the problems and Visual Question and Answer (VQA) of the image content according to the thought of 'expressing the characteristics of images and questions, expressing the characteristics of images and declarative sentences, fusing characteristic matrixes, learning the characteristics of images according to the characteristics of the questions, learning the characteristics of images according to correct declarative sentences, correctly guiding the models by using correct declarative sentences and obtaining results'; therefore, the method provides a method for fusing the intensive interaction between the image and the question key words, and can learn the intensive interaction between the image and the text so as to deduce the relationship between the image and the question key words.

Description

Image content understanding and visual question and answer VQA method, storage medium and terminal
Technical Field
The invention relates to the technical field of computers, in particular to a method for understanding image content and visually asking and answering VQA, a storage medium and a terminal.
Background
In recent years, image content understanding and Visual Question and Answer (VQA) have attracted increasing interest. Multi-modal fusion of global features is the most straightforward VQA solution. The general processing idea is to express images and questions as global features and then perform probability prediction of answers by using a multi-modal fusion model.
In addition to understanding the visual content of the image, VQA also requires a full understanding of the semantics of the natural language question. Therefore, it is necessary to learn both the attention of the text to the problem and the visual attention of the image. The current problem is mainly represented by LSTM, and the multi-modal fusion is mainly represented by residual error network. The problem caused by the current fusion is that the global feature representation of a graph may lose some key information, which may relate to the local area of the image in the problem, and most solutions still use the attention mechanism. The cooperative attention network adopted at present learns the attention distribution of each modality separately and then performs fusion.
Since the current network architecture for solving the VQA problem is to learn the attention distribution of each modality and then perform fusion, there are several drawbacks: (1) the network can only learn rough interaction among multiple modes, but neglects intensive interaction of images and texts, and the current cooperative attention is not enough to deduce the relationship between the images and the problem keywords; (2) the task of image question answering (VQA) is not accurate.
Disclosure of Invention
It is an object of the present invention to overcome the disadvantages of the prior art and to provide a method, a storage medium and a terminal for image content understanding and visual question answering VQA.
The purpose of the invention is realized by the following technical scheme:
in a first aspect of the present invention, a method for image content understanding and visual question and answer VQA is provided, comprising the steps of:
inputting the images and the questions to be answered into a trained prediction module for answering; the prediction module comprises a fusion attention module, a bilinear model and a classifier which are connected in sequence, and the classifier outputs answers; the training of the prediction module comprises the sub-steps of:
respectively extracting features of the image and the problem, inputting the image and the problem into a fusion attention module, and splicing the obtained image fusion feature I (f) and the problem fusion feature Q (f) to obtain a first splicing result;
respectively extracting features of the image and the correctness statement of the problem, inputting the images and the correctness statement of the problem into a fusion attention module, and splicing the obtained image fusion feature I (t) and the statement fusion feature S (t) to obtain a second splicing result;
performing loss calculation on the first splicing result and the second splicing result to obtain a result loss (f);
inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a bilinear model to be coded to obtain fused characteristics Z, and obtaining a classification result through a classifier;
converting the correct answer to the question and the classification result into a first vector a (t) and a second vector a (f), respectively;
performing loss calculation on the first vector A (t) and the second vector A (f) to obtain a result loss (c);
performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss;
optimizing a fusion attention module, a bilinear model and a classifier by using the final result Loss; the fusion attention module was optimized using loss (f).
Further, the image feature extraction specifically includes:
performing characteristic representation on an input image by using a fast R-CNN trained on Visual Genome data in a bottom-up mode; for each target, performing average pooling by using the convolutional layer to obtain a characteristic, and marking as Xi; the features in the image are finally represented as an image feature matrix X.
Further, extracting features from the questions specifically comprises:
dividing the input problem to obtain words, then converting each word into a vector by using a word embedding method, then inputting a single-layer recurrent neural network, and finally outputting a problem feature matrix Y (f);
extracting features from the correctness statement of the question, specifically comprising:
dividing the input statement sentence to obtain words, then converting each word into a vector by using a word embedding method, then inputting a single-layer recurrent neural network, and finally outputting a statement feature matrix Y (t).
Further, the fused attention module comprises a first self-attention module, a second self-attention module, and a scoring attention module;
the first self-attention module receives image features, the second self-attention module receives question features or statement features, the results of the first self-attention module and the results of the second self-attention module are output to the scoring attention module, and image fusion features I (f) and question fusion features Q (f) or image fusion features I (t) and statement fusion features S (t) are output.
Further, the first and second self-attention modules each include:
inputting image characteristics, problem characteristics or statement characteristics, converting the input image characteristics, the problem characteristics or the statement characteristics into matrixes through embedding transformation, and performing dot multiplication on the matrixes and the three matrixes Wq, Wk and Wv to obtain three weight matrixes Qi, Ki and Vi; wherein Wq, Wk, Wv are three trainable weight matrices using uniform distribution;
performing point multiplication on the matrix Qi and the matrix Ki to obtain a Score Score (i), and performing point multiplication on the matrix Qi and the matrices K (i +1), K (i +2), … and K (i + n) respectively to obtain scores Score (i +1), Score (i +2), … and Score (i + n);
SoftMax is carried out on [ Score (i), Score (i +1), … and Score (i + n) ] to obtain proportions [ Ratio (i), Ratio (i +1), … and Ratio (i + n) ];
multiplying the score proportion [ Ratio (i), Ratio (i +1), …, Ratio (i + n) ] with [ Vi, V (i +1), …, V (i + n) ] to obtain weighted values, and adding the weighted values to obtain Ti, namely an n × n attention mechanism diagram, wherein each word corresponds to a weight E (i, j), and the weighted diagram is output by the first self-attention module and the second self-attention module.
Further, the scoring attention module comprises:
the feature of the image feature output by the first self-attention module is ISM, the feature of the question feature or statement feature output by the second self-attention module is QM, both ISM and QM are n x n matrixes, and each position stores the ratio of a certain row to a certain column;
dot multiplication is carried out on i rows and j columns of the ISM and QM to obtain a matrix V:
V(ISM,QM)=(ISM(i,j)*QM(i,j))(i≤n,j≤n)
normalizing the matrix V to obtain V(i,j)′:
Figure BDA0002952676480000031
Make V(i,j)Multiplying by ISM, multiplying the finally obtained matrix by weight a to obtain a final matrix S, and normalizing S to obtain a weight matrix M:
Mi=softmax(a*V(i,j)′*ISM(i,j))(i≤n,j≤n)
the finally obtained image characteristics IM are:
Figure BDA0002952676480000032
in the formula (I), the compound is shown in the specification,
Figure BDA0002952676480000033
representing the resulting image features, and m represents the number of network cycles in a particular training session.
Further, the obtaining a result loss (f) by performing loss calculation on the first splicing result and the second splicing result includes:
Figure BDA0002952676480000034
wherein F (f) is the first splicing result, F (t) is the second splicing result, and N is the number of samples;
the obtaining of the result loss (c) by performing the loss calculation on the first vector a (t) and the second vector a (f) includes:
Loss(c)=cross-entropy(A(t),A(f))
wherein cross-entropy represents cross-entropy;
performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss, wherein the mathematical operation comprises the following steps:
Loss=Loss(c)+λLoss(f)
in the formula, λ is an adjustment parameter.
Further, the inputting the image fusion feature i (f) and the problem fusion feature q (f) into a bilinear model to obtain a fused feature Z, and obtaining a classification result through a classifier, includes:
inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a pre-trained encoder to obtain a fusion vector Z:
Figure BDA0002952676480000041
in the formula, LayerNorm represents normalization of hidden layer state dimension, Wx represents a linear projection matrix of image features, IM represents image features, namely I (f), and WyA linear projection matrix representing the problem features is shown, and QY represents the problem fusion features, namely Q (f);
mapping the obtained fusion vector Z, namely the feature Z into a vector S, and calculating the classification result of the N answers by using binary cross entropy:
S=sidmoid(Z)
in the formula, sidmoid represents an activation function of a neural network.
In a second aspect of the present invention, a storage medium is provided having stored thereon computer instructions which, when executed, perform the steps of one of the image content understanding and visual question and answer VQA methods.
In a third aspect of the present invention, a terminal is provided, comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the processor executing the computer instructions to perform the steps of the method for image content understanding and visual question and answer VQA.
The invention has the beneficial effects that:
(1) in an exemplary embodiment of the present invention, a feature matrix of an image is extracted from the image, and a question (sentence) input is divided into words (in an exemplary embodiment, the maximum length of a word is 10), and features of the question are extracted through a neural network; then, by a method of learning image characteristics according to question characteristics, the idea of learning correct answers by adopting a correct declarative sentence guiding model is adopted, and the tasks of solving questions and Visual Questions (VQA) of image contents are completed according to the thinking of 'performing characteristic representation on images and questions, performing characteristic representation on images and declarative sentences, fusing characteristic matrixes, learning image characteristics according to question characteristics, learning image characteristics according to correct declarative sentences, using the correct declarative sentence guiding model and obtaining results'. Therefore, the method provides a method for fusing the intensive interaction between the image and the question key words, and can learn the intensive interaction between the image and the text so as to deduce the relationship between the image and the question key words.
(2) In another exemplary embodiment of the present invention, specific implementation of each step is disclosed, which can improve the accuracy of VQA to 77.34%.
Drawings
FIG. 1 is a flow chart of a method in an exemplary embodiment of the invention;
FIG. 2 is a diagram of a model corresponding to a method in an exemplary embodiment of the invention;
fig. 3 is a schematic structural diagram of a fused attention module in an exemplary embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.
In addition, the technical features involved in the different embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
In a first aspect of the present invention, a method for image content understanding and visual question answering VQA is provided, as shown in fig. 1 and 2, comprising the steps of:
inputting the images and the questions to be answered into a trained prediction module for answering; the prediction module comprises a fusion attention module, a bilinear model and a classifier which are connected in sequence, and the classifier outputs answers; the training of the prediction module comprises the sub-steps of:
respectively extracting features (an image feature matrix X and a problem feature matrix Y (f)) from an image and a problem, inputting the features into a fusion attention module, and splicing the obtained image fusion feature I (f) and the problem fusion feature Q (f) to obtain a first splicing result;
respectively extracting features (an image feature matrix X and a statement feature matrix Y (t)) from the image and the correctness statement of the problem, inputting the features into a fusion attention module, and splicing the obtained image fusion feature I (t) and the statement fusion feature S (t) to obtain a second splicing result;
performing loss calculation on the first splicing result and the second splicing result to obtain a result loss (f);
inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a bilinear model to be coded to obtain fused characteristics Z, and obtaining a classification result through a classifier;
converting the correct answer to the question and the classification result into a first vector a (t) and a second vector a (f), respectively; (the conversion mode can adopt one-hot coding)
Performing loss calculation on the first vector A (t) and the second vector A (f) to obtain a result loss (c);
performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss;
optimizing a fusion attention module, a bilinear model and a classifier by using the final result Loss; the fusion attention module was optimized using loss (f).
Specifically, in this exemplary embodiment, a feature matrix of an image is first extracted from the image, and a question (sentence) input is divided into words (in an exemplary embodiment, the maximum length of a word is 10), and features of the question are extracted through a neural network; then, by a method of learning image characteristics according to question characteristics, the idea of learning correct answers by adopting a correct declarative sentence guiding model is adopted, and the tasks of solving questions and Visual Questions (VQA) of image contents are completed according to the thinking of 'performing characteristic representation on images and questions, performing characteristic representation on images and declarative sentences, fusing characteristic matrixes, learning image characteristics according to question characteristics, learning image characteristics according to correct declarative sentences, using the correct declarative sentence guiding model and obtaining results'. Therefore, the method provides a method for fusing the intensive interaction between the image and the question key words, and can learn the intensive interaction between the image and the text so as to deduce the relationship between the image and the question key words.
In addition, as shown in fig. 2, in the training process, the distraction prediction module further includes a labeling module, and the fusion attention module is located in the labeling module at the same time.
Inputting the image feature matrix X and the problem feature matrix Y into a fusion attention module, enabling a network to learn the target problem feature Y, adding a 'grading' mechanism, enabling the network to learn the image feature X by using the problem feature Y, and finally outputting a fusion feature I (f) and a problem fusion feature Q (f) of the image. Inputting the image feature matrix X and the statement feature matrix Y (t) into a fusion attention module, enabling a network to learn the feature Y (t) of a statement sentence aiming at the correctness of the image, enabling the network to learn the image feature X by using the statement fusion feature Y (t), and finally outputting the fusion feature I (t) and the statement fusion feature S (t) of the image.
And performing loss calculation on the first splicing result and the second splicing result to obtain a result loss (f), and performing loss calculation on the two splicing results, wherein the purpose of the step is to enable the splicing characteristics of the prediction module to be more and more approximate to the splicing characteristics in the labeling module, so that the accuracy is improved.
For the calculation of loss from A (t) and A (f), as shown in FIG. 2, the goal of this step is to predict the training result of the module to be closer and closer to the correct answer of the question, so as to improve the accuracy.
Finally, two loss are subjected to a mathematical operation, that is, loss (c) + λ loss (f), λ is preset to 0.01 in the specific embodiment, so as to make the two losses reach a balance, thereby preventing a special situation, such as a problem that one loss is too large or one loss is too small.
Preferably, in an exemplary embodiment, the extracting features from the image specifically includes:
performing feature representation on an input image in a bottom-up (bottom-up) mode by using a fast R-CNN (the core is ResNet-101) trained on Visual Genome data; for each target, performing average pooling by using the convolutional layer to obtain a characteristic, and marking as Xi; the features in the image are finally represented as an image feature matrix X.
Preferably, in an exemplary embodiment, the extracting the problem features specifically includes:
dividing the input problem to obtain words (the number of words is at most 14 in an exemplary embodiment), then converting each Word into a vector by using a Word Embedding method (pre-training on a large-scale corpus), then inputting a single-layer recurrent neural network, and finally outputting a problem feature matrix Y (f);
extracting features from the correctness statement of the question, specifically comprising:
the input statement sentence is divided into words (in an exemplary embodiment, the number of words is at most 14), then each Word is converted into a vector by using a Word Embedding method (pre-trained on a large-scale corpus), then a single-layer recurrent neural network is input, and finally a statement feature matrix Y (t) is output.
More preferably, in an exemplary embodiment, as shown in FIG. 3, the fused attention module includes a first self-attention module, a second self-attention module, and a scored attention module;
the first self-attention module receives image features, the second self-attention module receives question features or statement features, the results of the first self-attention module and the results of the second self-attention module are output to the scoring attention module, and image fusion features I (f) and question fusion features Q (f) or image fusion features I (t) and statement fusion features S (t) are output.
Fig. 3 is a fused attention module. The fusion feature I (f) of the final output image and the fusion feature Y (f) of the question or the fusion feature I (t) of the output image and the statement fusion feature S (t) need to be repeated for the fusion module and the fusion module n times.
More preferably, in an exemplary embodiment, the first self-attention module and the second self-attention module each include:
inputting image characteristics, problem characteristics or statement characteristics, converting the input image characteristics, the problem characteristics or the statement characteristics into matrixes through embedding transformation, and performing dot multiplication on the matrixes and the three matrixes Wq, Wk and Wv to obtain three weight matrixes Qi, Ki and Vi; wherein Wq, Wk, Wv are three trainable weight matrices using uniform distribution;
performing point multiplication on the matrix Qi and the matrix Ki to obtain a Score Score (i), and performing point multiplication on the matrix Qi and the matrices K (i +1), K (i +2), … and K (i + n) respectively to obtain scores Score (i +1), Score (i +2), … and Score (i + n);
SoftMax is carried out on [ Score (i), Score (i +1), … and Score (i + n) ] to obtain proportions [ Ratio (i), Ratio (i +1), … and Ratio (i + n) ];
multiplying the score proportion [ Ratio (i), Ratio (i +1), …, Ratio (i + n) ] with [ Vi, V (i +1), …, V (i + n) ] to obtain weighted values, and adding the weighted values to obtain Ti, namely an n × n attention mechanism diagram, wherein each word corresponds to a weight E (i, j), and the weighted diagram is output by the first self-attention module and the second self-attention module.
It should be noted that Wq, Wk, and Wv are preset as three trainable weight matrices initialized by using uniform distribution at first, and then through training, the three matrices are optimized by using back propagation to converge.
In addition, each sentence contains a plurality of words, and each word corresponds to one Q, one K and one V. Assuming that the sentence is I am a man, the meaning here is that the matrix Q of "I" and the matrix K of "I", the matrix K of "am", the matrix K of "a", and the matrix K of "man" are multiplied respectively to obtain score.
More preferably, in an exemplary embodiment, the scoring attention module comprises:
the feature of the image feature output by the first self-attention module is ISM, the feature of the question feature or statement feature output by the second self-attention module is QM, both ISM and QM are n x n matrixes, and each position stores the ratio of a certain row to a certain column;
dot multiplication is carried out on i rows and j columns of the ISM and QM to obtain a matrix V:
V(ISM,QM)=(ISM(i,j)*QM(i,j))(i≤n,j≤n)
normalizing the matrix V to obtain V(i,j)′:
Figure BDA0002952676480000081
Make V(i,j)Multiplying by ISM, multiplying the finally obtained matrix by weight a to obtain a final matrix S, and normalizing S to obtain a weight matrix M:
Mi=softmax(a*V(i,j)′*ISM(i,j))(i≤n,j≤n)
the finally obtained image characteristics IM are:
Figure BDA0002952676480000082
in the formula (I), the compound is shown in the specification,
Figure BDA0002952676480000083
representing the resulting image features, and m represents the number of network cycles in a particular training session.
Preferably, in an exemplary embodiment, the calculating the loss of the first splicing result and the second splicing result to obtain a result loss (f) includes:
Figure BDA0002952676480000084
wherein F (f) is the first splicing result, F (t) is the second splicing result, and N is the number of samples;
the obtaining of the result loss (c) by performing the loss calculation on the first vector a (t) and the second vector a (f) includes:
Loss(c)=cross-entropy(A(t),A(f))
in the formula, cross-entropy characterizes how hard a probability distribution p is expressed by a probability distribution q, where p is a correct answer and q is a predicted value, i.e., the smaller the cross-entropy value is, the closer the two probability distributions are.
Performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss, wherein the mathematical operation comprises the following steps:
Loss=Loss(c)+λLoss(f)
where λ is an adjustment parameter, and in an exemplary embodiment λ is preset to 0.01, in order to balance the two loss, so as to prevent a special situation, such as a problem that one loss is too large or one loss is too small.
Preferably, in an exemplary embodiment, the inputting the image fusion features i (f) and the problem fusion features q (f) into a bilinear model to encode to obtain fused features Z, and obtaining a classification result through a classifier, includes:
inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a pre-trained encoder to obtain a fusion vector Z:
Figure BDA0002952676480000085
in the formula, LayerNorm expresses normalization of hidden layer state dimension, Wx expresses linear projection matrix of image feature, IM expresses image feature, namely I (f), and WyA linear projection matrix representing the problem features is shown, and QY represents the problem fusion features, namely Q (f);
mapping the obtained fusion vector Z, namely the feature Z into a vector S, and calculating the classification result of the N answers by using binary cross entropy:
S=sidmoid(Z)
in the formula, the Sigmoid function is an activation function of the neural network, and maps variables between 0 and 1.
It should be noted that IM here indicates that the image feature corresponds to the image fusion feature i (f), and QY here corresponds to the problem fusion feature q (f). And the problem feature matrix y (f) is not calculated in the fusion attention module, and is only used for the auxiliary image feature matrix X, and IM is calculated in the fusion attention module, that is, the problem feature matrix y (f) is QY, that is, the problem fusion feature q (f). The stated feature matrix Y (t) is similar and will not be described in detail herein.
By adopting the method of the exemplary embodiment, the accuracy of VQA can be improved to 77.34%.
In a further exemplary embodiment of the present invention, based on any one of the above exemplary embodiments, a storage medium is provided having stored thereon computer instructions which, when executed, perform the steps of one of the image content understanding and visual question-and-answer VQA methods.
In another exemplary embodiment of the present invention based on any one of the above exemplary embodiments, there is provided a terminal comprising a memory and a processor, the memory having stored thereon computer instructions executable on the processor, the processor executing the computer instructions to perform the steps of one of the image content understanding and visual question and answer VQA methods.
Based on such understanding, the technical solutions of the present embodiments may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including several instructions for causing an apparatus to execute all or part of the steps of the methods according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It is to be understood that the above-described embodiments are illustrative only and not restrictive of the broad invention, and that various other modifications and changes in light thereof will be suggested to persons skilled in the art based upon the above teachings. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications of the invention may be made without departing from the spirit or scope of the invention.

Claims (10)

1. A method of image content understanding and visual question answering VQA, comprising: the method comprises the following steps:
inputting the images and the questions to be answered into a trained prediction module for answering; the prediction module comprises a fusion attention module, a bilinear model and a classifier which are connected in sequence, and the classifier outputs answers; the training of the prediction module comprises the sub-steps of:
respectively extracting features of the image and the problem, inputting the image and the problem into a fusion attention module, and splicing the obtained image fusion feature I (f) and the problem fusion feature Q (f) to obtain a first splicing result;
respectively extracting features of the image and the correctness statement of the problem, inputting the images and the correctness statement of the problem into a fusion attention module, and splicing the obtained image fusion feature I (t) and the statement fusion feature S (t) to obtain a second splicing result;
performing loss calculation on the first splicing result and the second splicing result to obtain a result loss (f);
inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a bilinear model to be coded to obtain fused characteristics Z, and obtaining a classification result through a classifier;
converting the correct answer to the question and the classification result into a first vector a (t) and a second vector a (f), respectively;
performing loss calculation on the first vector A (t) and the second vector A (f) to obtain a result loss (c);
performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss;
optimizing a fusion attention module, a bilinear model and a classifier by using the final result Loss; the fusion attention module was optimized using loss (f).
2. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: extracting features from the image, specifically comprising:
performing characteristic representation on an input image by using a fast R-CNN trained on Visual Genome data in a bottom-up mode; for each target, performing average pooling by using the convolutional layer to obtain a characteristic, and marking as Xi; the features in the image are finally represented as an image feature matrix X.
3. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: extracting the problems to obtain features, specifically comprising the following steps:
dividing the input problem to obtain words, then converting each word into a vector by using a word embedding method, then inputting a single-layer recurrent neural network, and finally outputting a problem feature matrix Y (f);
extracting features from the correctness statement of the question, specifically comprising:
dividing the input statement sentence to obtain words, then converting each word into a vector by using a word embedding method, then inputting a single-layer recurrent neural network, and finally outputting a statement feature matrix Y (t).
4. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: the fused attention module comprises a first self-attention module, a second self-attention module and a scoring attention module;
the first self-attention module receives image features, the second self-attention module receives question features or statement features, the results of the first self-attention module and the results of the second self-attention module are output to the scoring attention module, and image fusion features I (f) and question fusion features Q (f) or image fusion features I (t) and statement fusion features S (t) are output.
5. The method of claim 4, wherein the image content understanding and visual question answering VQA is characterized in that: the first and second self-attention modules each include:
inputting image characteristics, problem characteristics or statement characteristics, converting the input image characteristics, the problem characteristics or the statement characteristics into matrixes through embedding transformation, and performing dot multiplication on the matrixes and the three matrixes Wq, Wk and Wv to obtain three weight matrixes Qi, Ki and Vi; wherein Wq, Wk, Wv are three trainable weight matrices using uniform distribution;
performing point multiplication on the matrix Qi and the matrix Ki to obtain a Score Score (i), and performing point multiplication on the matrix Qi and the matrices K (i), K (i +1), K (i +2), … and K (i + n) respectively to obtain scores Score (i +1), Score (i +2), … and Score (i + n);
SoftMax is carried out on [ Score (i), Score (i +1), … and Score (i + n) ] to obtain proportions [ Ratio (i), Ratio (i +1), … and Ratio (i + n) ];
multiplying the score proportion [ Ratio (i), Ratio (i +1), …, Ratio (i + n) ] with [ Vi, V (i +1), …, V (i + n) ] to obtain weighted values, and adding the weighted values to obtain Ti, namely an n × n attention mechanism diagram, wherein each word corresponds to a weight E (i, j), and the weighted diagram is output by the first self-attention module and the second self-attention module.
6. The method of claim 5, wherein the image content understanding and visual question answering VQA is characterized in that: the scoring attention module comprises:
the feature of the image feature output by the first self-attention module is ISM, the feature of the question feature or statement feature output by the second self-attention module is QM, both ISM and QM are n x n matrixes, and each position stores the ratio of a certain row to a certain column;
dot multiplication is carried out on i rows and j columns of the ISM and QM to obtain a matrix V:
V(ISM,QM)=(ISM(i,j)*QM(i,j))(i≤n,j≤n)
normalizing the matrix V to obtain V(i,j)′:
Figure FDA0002952676470000021
Make V(i,j)Multiplying by ISM, multiplying the finally obtained matrix by weight a to obtain a final matrix S, and normalizing S to obtain a weight matrix M:
Mi=softmaX(a*V(i,j)*ISM(i,j))(i≤n,j≤n)
the finally obtained image characteristics IM are:
Figure FDA0002952676470000022
in the formula (I), the compound is shown in the specification,
Figure FDA0002952676470000023
representing the obtained image characteristics, and m represents the number of network cycles in specific training.
7. The method of claim 1, wherein the image content understanding and visual question answering VQA is characterized in that: the obtaining of the result loss (f) by performing loss calculation on the first splicing result and the second splicing result comprises the following steps:
Figure FDA0002952676470000031
wherein F (f) is the first splicing result, F (t) is the second splicing result, and N is the number of samples;
the obtaining of the result loss (c) by performing the loss calculation on the first vector a (t) and the second vector a (f) includes:
Loss(c)=cross-entropy(A(t),A(f))
wherein cross-entropy represents cross-entropy;
performing mathematical operation on the result Loss (f) and the result Loss (c) to obtain a final result Loss, wherein the mathematical operation comprises the following steps:
Loss=Loss(c)+λLoss(f)
in the formula, λ is an adjustment parameter.
8. The method of claim 6, wherein the image content understanding and visual question answering VQA is characterized in that: inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a bilinear model to be coded to obtain fused characteristics Z, and obtaining a classification result through a classifier, wherein the method comprises the following steps:
inputting the image fusion characteristics I (f) and the problem fusion characteristics Q (f) into a pre-trained encoder to obtain a fusion vector Z:
Figure FDA0002952676470000032
wherein LayerNorm denotes the dimension of the p-hidden layer stateDegree is normalized, WxA linear projection matrix representing image features, IM representing image features i (f), WyA linear projection matrix representing the problem features is shown, and QY represents the problem fusion features, namely Q (f);
mapping the obtained fusion vector Z, namely the feature Z into a vector S, and calculating the classification result of the N answers by using binary cross entropy:
S=sidmoid(Z)
in the formula, sidmoid represents an activation function of a neural network.
9. A storage medium having stored thereon computer instructions, characterized in that: the computer instructions when executed perform the steps of a method for image content understanding and visual question answering VQA according to any one of claims 1 to 8.
10. A terminal comprising a memory and a processor, said memory having stored thereon computer instructions executable on said processor, wherein said processor when executing said computer instructions performs the steps of a method for image content understanding and visual question and answer VQA according to any one of claims 1 to 8.
CN202110211935.4A 2021-02-25 2021-02-25 Image content understanding and visual question and answer VQA method, storage medium and terminal Active CN112926655B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110211935.4A CN112926655B (en) 2021-02-25 2021-02-25 Image content understanding and visual question and answer VQA method, storage medium and terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110211935.4A CN112926655B (en) 2021-02-25 2021-02-25 Image content understanding and visual question and answer VQA method, storage medium and terminal

Publications (2)

Publication Number Publication Date
CN112926655A true CN112926655A (en) 2021-06-08
CN112926655B CN112926655B (en) 2022-05-17

Family

ID=76171844

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110211935.4A Active CN112926655B (en) 2021-02-25 2021-02-25 Image content understanding and visual question and answer VQA method, storage medium and terminal

Country Status (1)

Country Link
CN (1) CN112926655B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360621A (en) * 2021-06-22 2021-09-07 辽宁工程技术大学 Scene text visual question-answering method based on modal inference graph neural network
CN113722458A (en) * 2021-08-27 2021-11-30 海信电子科技(武汉)有限公司 Visual question answering processing method, device, computer readable medium and program product

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480206A (en) * 2017-07-25 2017-12-15 杭州电子科技大学 A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN107679582A (en) * 2017-10-20 2018-02-09 深圳市唯特视科技有限公司 A kind of method that visual question and answer are carried out based on multi-modal decomposition model
CN108920587A (en) * 2018-06-26 2018-11-30 清华大学 Merge the open field vision answering method and device of external knowledge
CN109086892A (en) * 2018-06-15 2018-12-25 中山大学 It is a kind of based on the visual problem inference pattern and system that typically rely on tree
US20190042900A1 (en) * 2017-12-28 2019-02-07 Ned M. Smith Automated semantic inference of visual features and scenes
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
US20190370587A1 (en) * 2018-05-29 2019-12-05 Sri International Attention-based explanations for artificial intelligence behavior
CN110851760A (en) * 2019-11-12 2020-02-28 电子科技大学 Human-computer interaction system for integrating visual question answering in web3D environment
CN111553371A (en) * 2020-04-17 2020-08-18 中国矿业大学 Image semantic description method and system based on multi-feature extraction
CN111639692A (en) * 2020-05-25 2020-09-08 南京邮电大学 Shadow detection method based on attention mechanism
CN111858849A (en) * 2020-06-10 2020-10-30 南京邮电大学 VQA method based on intensive attention module

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107480206A (en) * 2017-07-25 2017-12-15 杭州电子科技大学 A kind of picture material answering method based on multi-modal low-rank bilinearity pond
CN107679582A (en) * 2017-10-20 2018-02-09 深圳市唯特视科技有限公司 A kind of method that visual question and answer are carried out based on multi-modal decomposition model
US20190042900A1 (en) * 2017-12-28 2019-02-07 Ned M. Smith Automated semantic inference of visual features and scenes
US20190370587A1 (en) * 2018-05-29 2019-12-05 Sri International Attention-based explanations for artificial intelligence behavior
CN109086892A (en) * 2018-06-15 2018-12-25 中山大学 It is a kind of based on the visual problem inference pattern and system that typically rely on tree
CN108920587A (en) * 2018-06-26 2018-11-30 清华大学 Merge the open field vision answering method and device of external knowledge
CN110377710A (en) * 2019-06-17 2019-10-25 杭州电子科技大学 A kind of vision question and answer fusion Enhancement Method based on multi-modal fusion
CN110851760A (en) * 2019-11-12 2020-02-28 电子科技大学 Human-computer interaction system for integrating visual question answering in web3D environment
CN111553371A (en) * 2020-04-17 2020-08-18 中国矿业大学 Image semantic description method and system based on multi-feature extraction
CN111639692A (en) * 2020-05-25 2020-09-08 南京邮电大学 Shadow detection method based on attention mechanism
CN111858849A (en) * 2020-06-10 2020-10-30 南京邮电大学 VQA method based on intensive attention module

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
ZHOU YU等: "Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering", 《2017 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
张婷: "基于注意力机制的图像内容理解与视觉推理算法研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
曾鹏鹏: "基于深度卷积网络与区域关注机制的视觉问答系统", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
李磊: "结合协同注意力和关联深度网络的视觉问答研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113360621A (en) * 2021-06-22 2021-09-07 辽宁工程技术大学 Scene text visual question-answering method based on modal inference graph neural network
CN113722458A (en) * 2021-08-27 2021-11-30 海信电子科技(武汉)有限公司 Visual question answering processing method, device, computer readable medium and program product

Also Published As

Publication number Publication date
CN112926655B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
CN110188358B (en) Training method and device for natural language processing model
CN113656570B (en) Visual question-answering method and device based on deep learning model, medium and equipment
WO2020143130A1 (en) Autonomous evolution intelligent dialogue method, system and device based on physical environment game
WO2020088330A1 (en) Latent space and text-based generative adversarial networks (latext-gans) for text generation
CN109670168B (en) Short answer automatic scoring method, system and storage medium based on feature learning
CN110990543A (en) Intelligent conversation generation method and device, computer equipment and computer storage medium
CN112417894B (en) Conversation intention identification method and system based on multi-task learning
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN112926655B (en) Image content understanding and visual question and answer VQA method, storage medium and terminal
KR102361616B1 (en) Method and apparatus for recognizing named entity considering context
CN111401084A (en) Method and device for machine translation and computer readable storage medium
Yu et al. Training an adaptive dialogue policy for interactive learning of visually grounded word meanings
CN111460176A (en) Multi-document machine reading understanding method based on Hash learning
CN111160000B (en) Composition automatic scoring method, device terminal equipment and storage medium
CN111027292B (en) Method and system for generating limited sampling text sequence
CN111597341A (en) Document level relation extraction method, device, equipment and storage medium
CN111046178A (en) Text sequence generation method and system
CN112131367A (en) Self-auditing man-machine conversation method, system and readable storage medium
CN111597815A (en) Multi-embedded named entity identification method, device, equipment and storage medium
CN112528168B (en) Social network text emotion analysis method based on deformable self-attention mechanism
CN110795535A (en) Reading understanding method for depth separable convolution residual block
CN114332565A (en) Method for generating image by generating confrontation network text based on distribution estimation condition
US20220138425A1 (en) Acronym definition network
CN111666375B (en) Text similarity matching method, electronic device and computer readable medium
CN111832699A (en) Computationally efficient expressive output layer for neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant