CN116662591A - Robust visual question-answering model training method based on contrast learning - Google Patents

Robust visual question-answering model training method based on contrast learning Download PDF

Info

Publication number
CN116662591A
CN116662591A CN202310646697.9A CN202310646697A CN116662591A CN 116662591 A CN116662591 A CN 116662591A CN 202310646697 A CN202310646697 A CN 202310646697A CN 116662591 A CN116662591 A CN 116662591A
Authority
CN
China
Prior art keywords
attention
question
image
model
representation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310646697.9A
Other languages
Chinese (zh)
Inventor
鉴萍
张钧贺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN202310646697.9A priority Critical patent/CN116662591A/en
Publication of CN116662591A publication Critical patent/CN116662591A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

A robust visual question-answering model training method based on contrast learning belongs to the technical field of natural language processing and computer visual cross application. In the aspect of image enhancement, an enhancement method based on visual context disturbance is used, visual context which is weak in relation to problems is screened out through attention distribution of objects in an image, disturbance is added to the visual context to construct a new image representation, and therefore a model is enabled to learn the image representation irrelevant to the visual context. For the problems of general question types, a strategy of deleting the query auxiliary verbs is adopted for text enhancement; for other types of problems, a rewrite strategy is employed for text enhancement. The positive sample is constructed by the data enhancement method, and then the comparison learning method is utilized to optimize, so that the unbiased multi-modal representation of the input information is learned. The invention is suitable for the fields of artificial intelligence, natural language processing and the like, enhances the robustness of the model, and improves the accuracy of question and answer of the model on different scenes or data with different distributions.

Description

Robust visual question-answering model training method based on contrast learning
Technical Field
The invention relates to a visual question-answering model training method, in particular to a robust visual question-answering model training method based on contrast learning, and belongs to the technical field of natural language processing and computer visual cross application.
Background
A visual question-and-answer task refers to a given picture and a natural language question about that picture, with the hope that the computer will be able to predict the correct answer to the question. The computer not only needs to understand the natural language problem, but also needs to understand the semantics of the image, and synthesizes two kinds of information to perform reasoning so as to predict the answer. The prior art has achieved good results on visual question-answering tasks.
However, with the rapid development of the field of visual questions, researchers have found that there is a phenomenon in this field: visual question-answering models tend to rely on single modality bias in the input information, relying on shortcuts to answer questions, resulting in models that are difficult to generalize to other data sets or real scenes, lacking robustness to changes in data distribution.
In many improvements to bias questions in visual question-answering tasks, it is common that: these methods tend to focus only on language bias between the question category information and the answer, and not on potential bias between the image information and the answer, so the question remains unresolved.
Meanwhile, with the development and wide application of the visual question and answer field, researchers are increasingly aware that language bias is not the only bias, and further need to eliminate the bias from more aspects so as to enhance the robustness of the model and reduce the dependence of the model on the bias.
Disclosure of Invention
Aiming at the problem that the current visual question-answering field tends to depend on single-mode bias in input information, the main purpose of the invention is to provide a robust visual question-answering model training method based on contrast learning, and a positive sample is constructed by utilizing data enhancement methods on two modes of text and image; and then optimizing by using a contrast learning method, learning unbiased multi-modal representation of the input information for subsequent prediction, thereby enhancing the robustness of the visual question-answering model, reducing the dependence of the model on bias, and improving the question-answering accuracy of the model on different scenes or data with different distribution.
The invention aims at realizing the following technical scheme:
according to the robust visual question-answering model training method based on contrast learning, in the aspect of image enhancement, a visual context disturbance-based enhancement method is used, visual contexts which are weak in correlation with problems are screened out through attention distribution of objects in images, disturbance is added to the visual contexts to construct new image representations, and therefore the models are enabled to learn image representations irrelevant to the visual contexts, the models are enabled to be harder to rely on bias information existing in the images, and robustness of the models to image changes is improved. For the problems of general question types, a strategy of deleting the query auxiliary verb is adopted to carry out text enhancement, so that a potential co-occurrence mode between the query auxiliary verb and answers is cut off, the bias dependency degree of the model on the problems is reduced, and the accuracy of question and answer of the model on different scenes or data with different distributions is improved.
The invention discloses a robust visual question-answering model training method based on contrast learning, which comprises the following steps:
step 1: and (3) rewriting the input questions of the visual questions and answers to obtain rewritten enhancement questions.
Step 1.1 trimming of the T5, transfer Text-to-Text transducer model is performed on multiple rewritten datasets.
Step 1.2, deleting the problem category in the data set label on the basis of the original problem for a general question sentence, thereby deleting the query auxiliary verb corresponding to the problem and constructing a new enhanced problem; for other problems, the problem is input to the T5 model after trimming, and the rewrite problem corresponding to the problem is output.
Step 2: and inputting the questions and the images of the visual questions and answers to obtain the characteristics of the questions and the characteristics of the images.
Step 2.1: the GloVe word vector model is used to extract the question and rewrite the textual representation of the question.
Firstly, word segmentation is carried out on an input problem, a text in a natural language form is converted into an integer form which can be recognized by a computer, the maximum word number is set, and the problem is truncated; each Word in the question is then converted into a text representation vector word_embedded, the first dimension of which is n, representing the number of words.
Step 2.2: and (3) inputting the text representation vector obtained in the step 2.1 into a text encoder LSTM to extract the problem features.
The word_Embed vector is transmitted into a single-layer LSTM network to obtain a problem feature Y:
Y=LSTM(Word_Embed) (1)
similarly, the same processing can be performed on the rewrite problem to obtain the rewrite problem feature Y pos
Step 2.3: image features of the image based on the target are extracted using the fast R-CNN model.
First, object detection is performed on an input image, and object-based image features are extracted for each image using a fast R-CNN model based on res net 101.
X=Faster R-CNN(Input_Image) (2)
Wherein input_image represents an Input Image, X represents an Image feature, and a first dimension of the feature is m, which is used to represent the number of detection targets.
Step 3: and (3) inputting the problem features and the image features obtained in the step (2) into a deep collaborative attention mechanics learning module to obtain attention features after interaction of the two modes.
Step 3.1: and (3) extracting attention problem characteristics from the self-attention units entering the cascade L layers for the problem characteristics obtained in the step (2).
Step 3.2: and (3) extracting attention image features from the self-attention unit and the attention guiding unit which enter the cascade L layers for the image features obtained in the step (2), wherein the attention guiding is conducted by the attention problem features obtained in the step (3.1).
Step 4: attention features of the enhanced image are obtained using an image enhancement method based on visual context disturbance.
Step 4.1: and (3) carrying out the attention weight matrix in the first-layer self-attention operation in the step 3.2, and calculating the average value of each column vector of the attention weight matrix as the significant fraction of the current object.
The object with higher attention weight has stronger correlation with the problem, is essential for the model to answer the problem correctly, and is regarded as a key object of the problem; whereas objects with lower attention weights have little to no relevance to the question and do not help the model answer the question, treated as a visual context for the question.
In the deep self-attention operation performed inside the image modal information, since the self-attention operation calculation is performed a plurality of times, the attention distribution gradually tends to be stable, and the attention distribution is used as the basis for screening the visual context.
Attention weight matrix obtained by layer I self-attention calculationWherein->Each column represents the attention weight of the current object and m objects as a column vector. For selecting a salient object or region in the image, each column vector in the attention weight matrix is +.>Averaging can be performedThe attention weight average value of each object is obtained and is used as the significant fraction of the object, and the calculation process is shown in a formula (3).
Wherein the saliency score of the ith object is averaged from the column vector corresponding to the object,representing column vectorsIs the j-th element of (c).
Step 4.2: and (3) sequencing the significant scores of each object in the step 4.1 according to the size, and masking r objects with the minimum scores to obtain the attention feature of the enhanced image.
For r objects with the smallest saliency score, the attention features corresponding to the objects are given as minimum values, so that the attention weight is 0 after Softmax calculation is carried out, and the features are covered. The data enhancement can be performed at the image representation level, resulting in an enhanced image representation.
Step 5: and (3) respectively carrying out multi-mode fusion on the problem and the image representation in the step (3) and the enhancement problem and the image representation in the step (4) to respectively obtain multi-mode representations of the original sample and the positive sample.
Step 5.1: the attention characteristics of the two modes in the step 3 are input into an attention attenuation network and a full connection layer, and the attention attenuation network and the full connection layer are added to obtain the multi-mode representation of the original sample.
Step 5.2: the attention features of the enhanced text of step 3 and the attention features of the enhanced image of step 4 are input into the attention-deficit network and into a fully connected layer and added to obtain a multimodal representation of the positive sample.
Step 6: optimizing multi-modal representation by utilizing a contrast learning loss function, optimizing the prediction capability of a robust visual question-answering model by utilizing a cross entropy loss function, obtaining a trained robust visual question-answering model, and realizing high robust visual question-answering according to the trained robust visual question-answering model.
Step 6.1: the multi-modal representation of the original sample is optimized using the InfoNCE loss function to be close to the multi-modal representation of the positive sample and far from the multi-modal representations of the other samples in the same batch.
Step 6.2: the multi-mode representation of the step 5.1 is entered into a layer of full-connection classifier, and answers are predicted; while training using a binary cross entropy loss function.
Further comprising the step 7 of: compared with a model which does not use a data enhancement and contrast learning method, the robust visual question-answering model obtained through training in the step 6 has stronger robustness in the face of data distribution change, reduces the bias dependence degree of the robust visual question-answering model on the type of problems, improves the question-answering accuracy of the robust visual question-answering model on different scenes or data with different distributions, and improves the man-machine interaction performance.
Advantageous effects
1. According to the robust visual question-answering training method based on contrast learning, data enhancement is carried out on a plurality of modes, and the image enhancement method based on visual context is used on an image mode, so that bias existing in image information is effectively restrained, the method is applicable to various scenes with bias in texts or images, the model can show superior performance in more scenes, the question-answering accuracy of the robust visual question-answering model on different scenes or data with different distributions is improved, and man-machine interaction performance is improved.
2. The robust visual question-answering training method based on the contrast learning, disclosed by the invention, is based on the contrast learning method, and the flexibility of the contrast learning method is utilized to enable the robust visual question-answering training method to be combined with any visual question-answering model independently without being limited by a model structure, so that the robustness and performance of the robust visual question-answering training method on visual question-answering tasks can be improved on the basis of various model structures.
3. Aiming at the problem that a model is easy to depend on bias on a general question, the robust visual question-answering training method based on contrast learning adopts an enhancement strategy for deleting the query auxiliary verb, so that the bias dependency degree of the model on the general question is effectively inhibited.
Drawings
FIG. 1 is a flow chart of a robust visual question-answering training method based on contrast learning disclosed by the invention;
fig. 2 is a schematic diagram of a visual question-answering model based on contrast learning according to an embodiment of the present invention.
Detailed Description
The present invention will be described in detail with reference to the accompanying drawings and examples. The technical problems and the beneficial effects solved by the technical proposal of the invention are also described, and the described embodiment is only used for facilitating the understanding of the invention and does not have any limiting effect.
If training is desired to obtain a visual question-answering model as a life assistant of a visually impaired user, the robustness and accuracy of the model are required to be high, as shown in fig. 1, and the robust visual question-answering model training method based on contrast learning comprises the following steps:
step 1: and (3) rewriting the input questions of the visual questions and answers to obtain rewritten enhanced questions, so that the model can adapt to different language expression modes.
Step 1.1: trimming the T5 (Transfer Text-to-Text transducer) model over a plurality of rewritten datasets;
specifically, the T5 model was trimmed on the Quora Question Pairs, paraphrase Adversaries from Word Scrambling and Microsoft Research Paraphrase Corpus open source datasets.
Step 1.2: aiming at general questioning sentences, adopting a strategy for deleting the questioning auxiliary verbs to construct an enhancement problem; for other problems, a rewrite problem is generated for each problem in the dataset using the trimmed T5 model.
For a general question sentence, deleting the category of the question in the data set label on the basis of the original question, so as to delete the corresponding query auxiliary verb of the question to construct a new enhanced question; for other problems, the problem is input to the T5 model after trimming, and the rewrite problem corresponding to the problem is output.
Step 2: and inputting the questions and the images of the visual questions and answers to obtain the characteristics of the questions and the characteristics of the images.
Step 2.1: the GloVe word vector model is used to extract the question and rewrite the textual representation of the question.
Firstly, word segmentation is carried out on an input problem, texts in a natural language form are converted into integer forms which can be recognized by a computer, the maximum Word number is set to be 14, the problem is truncated, and secondly, each Word in the problem is converted into an n×300-dimension text representation vector word_Embed, wherein n epsilon [1,14] is the Word number in the problem.
Step 2.2: inputting the text representation obtained in the step 2.1 into a text encoder LSTM to extract problem features;
and (3) transmitting the word_Embled vector into a single-layer LSTM network, and obtaining the problem feature Y through a formula (1).
Similarly, the same processing can be performed on the rewrite problem to obtain the rewrite problem feature Y pos
Step 2.3: extracting image characteristics of the image based on the target by using a Faster R-CNN model;
first, object detection is performed on an input image, and object-based image features are extracted for each image using a fast R-CNN model based on the res net101 as shown in equation (2).
Where input_image represents an Input Image, X represents Image features in m×2048 dimensions, and m represents the number of detection targets, which is generally set to 36.
Step 3: and (3) inputting the problem features and the image features obtained in the step (2) into a deep collaborative attention mechanics learning module to obtain attention features after interaction of the two modes.
Step 3.1: extracting attention problem characteristics from the problem characteristics obtained in the step 2 by a self-attention unit entering a cascade L layer;
at the question text end, the question feature Y is input to the multi-headed self-attention unit of the L-layer cascade. Assuming that the head number is h, the first layer self-attention unit maps the problem feature into a query value Q, a key value K and a value V, and the calculation process is as follows:
Q=W Q ·Y l (4)
K=W K ·Y l (5)
V=W V ·Y l (6)
wherein W is Q 、W K 、W V Mapping matrix parameters respectively representing a query value Q, a key value K and a value V; y is Y l Representing the problem characteristics of the first layer input.
Y l+1 =Attention(Q,K,V)=W attention V (7)
Wherein QK T The normalized attention weight matrix, W, can be calculated from the Softmax function, which can be seen as a dot product similarity calculation of two vectors attention ∈R m×n ,Y l+1 The question feature outputted from the first layer is expressed, and the question feature outputted from the last layer is regarded as a notice question feature Y'.
The same processing can be performed on the rewrite problem to obtain the characteristic Y 'of the rewrite problem' pos
Step 3.2: and (3) extracting attention problem characteristics from the self-attention units entering the cascade L layers for the problem characteristics obtained in the step (2).
At the image end, the problem feature X is input into the multi-head self-attention unit and the multi-head guiding attention unit of the L-layer cascade. The number of heads is h, the calculation mode of the first layer self-attention unit is the same as that of the text end, and the self-attention unit is followed by the guiding attention unit, and the calculation process is as follows:
Q=W Q ·X l (9)
K=W K ·Y′ (10)
V=W V ·Y′ (11)
wherein W is Q 、W K 、W V Mapping matrix parameters respectively representing a query value Q, a key value K and a value V; x is X l Image representing layer I inputThe feature, Y', represents the attention question feature.
The calculation is performed according to the formulas (7), (8) in step 3.1, and the image feature output from the last layer is taken as the attention image feature X'.
Step 4: the attention features of the enhanced image are obtained using an image enhancement method based on visual context disturbance, as shown in fig. 2.
Step 4.1: performing a first-layer self-attention operation on the step 3.2, namely calculating the average value of each column vector of the attention weight matrix as the significant fraction of the current object;
the object with higher attention weight has stronger correlation with the problem, is essential for the model to answer the problem correctly, and is regarded as a key object of the problem; whereas objects with lower attention weights have little to no relevance to the question and do not help the model answer the question, treated as a visual context for the question.
In the deep self-attention operation performed in the image mode information, as the self-attention operation calculation is performed for a plurality of times, the attention distribution gradually tends to be stable, and the method is suitable for being used as the basis for screening visual contexts.
Attention weight matrix obtained for layer I self-attention calculation Wherein->Each column represents the attention weight of the current object and m objects as a column vector. To select a salient object or region in an image, we add +_for each column vector in the attention weight matrix>Averaging, the weighted average of the attention of each object can be obtained as the significant fraction of the object,the calculation process is shown in the formula (3).
Wherein the saliency score of the ith object is averaged from the column vector corresponding to the object,representing column vectorsIs the j-th element of (c).
Step 4.2: sorting the significant scores of each object in the step 4.1 according to the size, and covering up r objects with the minimum scores to obtain the attention feature of the enhanced image;
for r objects with the smallest saliency score, attention features corresponding to the objects can be given as minimum values, so that attention weight is 0 after Softmax calculation is performed, and the features are covered. This approach allows data enhancement at the image representation level to obtain an enhanced image representation X' pos
Step 5: and (3) respectively carrying out multi-mode fusion on the problem and the image representation in the step (3) and the enhancement problem and the image representation in the step (4) to respectively obtain multi-mode representations of the original sample and the positive sample.
Step 5.1: inputting the attention characteristics of the two modes in the step 3 into an attention attenuation network and a full-connection layer, and adding to obtain a multi-mode representation of an original sample;
specifically, through the deep collaborative attention mechanics learning stage, the output problem feature Y 'and the image feature X' already contain rich attention weight information of the problem word and the image region. A multi-layer perceptron (Multilayer Perceptron, MLP) of two fully connected layers is thus employed as the attention decay model, with the first fully connected layer using ReLU as the activation function and Dropout added.
Applying a Softmax function to the attenuated attention feature to calculate a new attention weight α, and based thereon, calculating a new attention featureThe calculation process is as follows:
wherein X' is an input feature, alpha ε [ alpha ] 12 ,...,α m ]Is the learned attention weight.
Attention features for outputAnd->The following linear multi-modal fusion function was used:
wherein the method comprises the steps ofFor two linear mapping matrices, d z And representing the dimension of the fusion characteristic, and obtaining the multi-modal representation z by using layer normalization stabilization training after fusion.
Step 5.2: inputting the attention features of the enhanced text in the step 3 and the attention features of the enhanced image in the step 4 into an attention attenuation network and a full-connection layer, and adding to obtain a multi-modal representation of the positive sample;
specifically, as with the implementation of step 5.1, the enhanced problem feature Y 'is input' pos And image feature X' pos A multi-modal representation z can be obtained pos As a positive sample.
Step 6: optimizing multi-modal representation by utilizing a contrast learning loss function, enhancing the robustness of the multi-modal representation to language change and image change, so that the model can obtain better generalization effect in a real scene, and better helping visually impaired users; the predictive power of the model is optimized using the cross entropy loss function.
Step 6.1: optimizing the multi-modal representation of the original sample using the InfoNCE loss function to bring it closer to the multi-modal representation of the positive sample and further away from the multi-modal representations of other samples in the same batch;
the current sample and its corresponding positive sample are represented as M in multiple modes i ,Respectively correspond to the z, z obtained in the previous step pos As shown in fig. 2, the optimization is performed using the InfoNCE loss function so that the multi-modal representation of the original sample is as close as possible to the multi-modal representation of the positive sample, and as far as possible from the multi-modal representations of the other negative samples of the same batch, the calculation process is as follows:
where N is the number of samples, τ is the temperature coefficient, I [j≠i] E {0,1} is an indicator function, representing that 1 is taken when j+.i, or 0 is taken otherwise,represents M i ,/>The similarity between the two vectors is calculated as follows:
and introducing language and image changes to construct a positive sample, and optimizing the multi-modal representation learned by the visual question-answer model in a contrast learning mode to enhance the robustness of the model.
Step 6.2: the multi-mode representation of the step 5.1 is entered into a layer of fully connected classifier to predict answers; while training using a binary cross entropy loss function.
In particular, as shown in FIG. 2, multiple modesVector s E R representing z mapped to vocabulary size by a full connectivity layer FCN and Sigmoid function G Wherein G is the number of candidate answers.
s=Sigmoid(FCN(z)) (16)
Finally, training an N-class classifier by using binary cross entropy (binary cross-entropy) as a loss function, wherein the calculation process is as follows:
wherein target is i And representing the target score corresponding to each answer label of the ith sample.
Finally, the robust visual question-answering model trained by the invention can be better used as an assistant for visually impaired users to help them to recognize the world. When the user asks "Is the traffic light red now? The model can not answer 'Yes' in a recklessly due to the existence of language bias that most answers to the questions are 'Yes' in the training set, but can truly observe the color of a signal lamp in the graph and give correct answers, which definitely increases the feasibility and safety of the floor application of the technology and further expands the application value of the technology.
The foregoing detailed description has set forth the objects, aspects and advantages of the invention in further detail, it should be understood that the foregoing description is only illustrative of the invention and is not intended to limit the scope of the invention, but is to be accorded the full scope of the invention as defined by the appended claims.

Claims (8)

1. A robust visual question-answering model training method based on contrast learning is characterized by comprising the following steps of: comprises the following steps of the method,
step 1: writing the input questions of the visual questions and answers to obtain written enhancement questions;
step 2: inputting a question and an image of a visual question and answer to obtain a question feature and an image feature;
step 3: inputting the problem features and the image features obtained in the step 2 into a deep collaborative attention mechanics learning module to obtain attention features after interaction of two modes;
step 4: obtaining attention features of the enhanced image using an image enhancement method based on visual context disturbance;
step 5: respectively carrying out multi-mode fusion on the problem and the image representation in the step 3 and the enhancement problem and the image representation in the step 4 to respectively obtain multi-mode representations of an original sample and a positive sample;
step 6: optimizing multi-modal representation by utilizing a contrast learning loss function, optimizing the prediction capability of a robust visual question-answering model by utilizing a cross entropy loss function, obtaining a trained robust visual question-answering model, and realizing high robust visual question-answering according to the trained robust visual question-answering model.
2. The robust visual question-answering model training method based on contrast learning as claimed in claim 1, wherein: further comprising the step 7 of: compared with a model which does not use a data enhancement and contrast learning method, the robust visual question-answering model obtained through training in the step 6 has stronger robustness in the face of data distribution change, and reduces the bias dependence degree of the robust visual question-answering model on the type of problems. The accuracy of question and answer of the model on different scenes or data with different distribution is improved, and the man-machine interaction performance is improved.
3. The robust visual question-answering model training method based on contrast learning as claimed in claim 1, wherein: the implementation method of the step 1 is that,
step 1: writing the input questions of the visual questions and answers to obtain written enhancement questions;
step 1.1, fine tuning a T5, namely a Transfer Text-to-Text transducer model on a plurality of rewritten data sets;
step 1.2, deleting the problem category in the data set label on the basis of the original problem for a general question sentence, thereby deleting the query auxiliary verb corresponding to the problem and constructing a new enhanced problem; for other problems, the problem is input to the T5 model after trimming, and the rewrite problem corresponding to the problem is output.
4. A robust visual question-answering model training method based on contrast learning as claimed in claim 3, wherein: the implementation method of the step 2 is that,
step 2.1: extracting a text representation of the question and overwriting the question using a GloVe word vector model;
firstly, word segmentation is carried out on an input problem, a text in a natural language form is converted into an integer form which can be recognized by a computer, the maximum word number is set, and the problem is truncated; then, each Word in the question is converted into a text representing vector word_Embed, wherein the first dimension of the representing vector is n and is used for representing the number of words;
step 2.2: inputting the text representation vector obtained in the step 2.1 into a text encoder LSTM to extract problem features;
the word_Embed vector is transmitted into a single-layer LSTM network to obtain a problem feature Y:
Y=LSTM(Word_Embed) (1)
similarly, the same processing can be performed on the rewrite problem to obtain the rewrite problem feature Y pos
Step 2.3: extracting image characteristics of the image based on the target by using a Faster R-CNN model;
firstly, performing target detection on an input image, and extracting image characteristics based on targets for each image by using a fast R-CNN model based on ResNet 101;
X=Faster R-CNN(Input_Image) (2)
wherein input_image represents an Input Image, X represents an Image feature, and a first dimension of the feature is m, which is used to represent the number of detection targets.
5. The robust visual question-answering model training method based on contrast learning as claimed in claim 4, wherein: the implementation method of the step 3 is that,
step 3.1: extracting attention problem characteristics from the problem characteristics obtained in the step 2 by a self-attention unit entering a cascade L layer;
step 3.2: and (3) extracting attention image features from the self-attention unit and the attention guiding unit which enter the cascade L layers for the image features obtained in the step (2), wherein the attention guiding is conducted by the attention problem features obtained in the step (3.1).
6. The robust visual question-answering model training method based on contrast learning as claimed in claim 5, wherein: the implementation method of the step 4 is that,
step 4.1: performing a first-layer self-attention operation on the step 3.2, namely calculating the average value of each column vector of the attention weight matrix as the significant fraction of the current object;
the object with higher attention weight has stronger correlation with the problem, is essential for the model to answer the problem correctly, and is regarded as a key object of the problem; while the object with lower attention weight has weaker or no correlation with the question and does not help the model answer the question, as a visual context of the question;
in the deep self-attention operation performed in the image modal information, as the self-attention operation calculation is performed for a plurality of times, the attention distribution gradually tends to be stable and is used as the basis for screening the visual context;
attention weight matrix obtained by layer I self-attention calculationWherein->As column vectors, each column represents the attention weights of the current object and m objects; for selecting a salient object or region in the image, each column vector in the attention weight matrix is +.>Averaging to obtain each objectAs a significant fraction of the object, the calculation process is shown as formula (3);
wherein the saliency score of the ith object is averaged from the column vector corresponding to the object,representing column vector +.>Is the j-th element of (2);
step 4.2: sorting the significant scores of each object in the step 4.1 according to the size, and covering up r objects with the minimum scores to obtain the attention feature of the enhanced image;
for r objects with the smallest saliency score, the attention features corresponding to the objects are given as minimum values, so that the attention weight is 0 after Softmax calculation is carried out, and the features are covered; the data enhancement can be performed at the image representation level, resulting in an enhanced image representation.
7. The robust visual question-answering model training method based on contrast learning of claim 6, wherein: the implementation method of the step 5 is that,
step 5.1: inputting the attention characteristics of the two modes in the step 3 into an attention attenuation network and a full-connection layer, and adding to obtain a multi-mode representation of an original sample;
step 5.2: the attention features of the enhanced text of step 3 and the attention features of the enhanced image of step 4 are input into the attention-deficit network and into a fully connected layer and added to obtain a multimodal representation of the positive sample.
8. The robust visual question-answering model training method based on contrast learning of claim 7, wherein: the implementation method of the step 6 is that,
step 6.1: optimizing the multi-modal representation of the original sample using the InfoNCE loss function to bring it closer to the multi-modal representation of the positive sample and further away from the multi-modal representations of other samples in the same batch;
step 6.2: the multi-mode representation of the step 5.1 is entered into a layer of full-connection classifier, and answers are predicted; and training by using a binary cross entropy loss function to obtain a trained robust visual question-answering model.
CN202310646697.9A 2023-06-02 2023-06-02 Robust visual question-answering model training method based on contrast learning Pending CN116662591A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310646697.9A CN116662591A (en) 2023-06-02 2023-06-02 Robust visual question-answering model training method based on contrast learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310646697.9A CN116662591A (en) 2023-06-02 2023-06-02 Robust visual question-answering model training method based on contrast learning

Publications (1)

Publication Number Publication Date
CN116662591A true CN116662591A (en) 2023-08-29

Family

ID=87725724

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310646697.9A Pending CN116662591A (en) 2023-06-02 2023-06-02 Robust visual question-answering model training method based on contrast learning

Country Status (1)

Country Link
CN (1) CN116662591A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235670A (en) * 2023-11-10 2023-12-15 南京信息工程大学 Medical image problem vision solving method based on fine granularity cross attention

Similar Documents

Publication Publication Date Title
CN110609891B (en) Visual dialog generation method based on context awareness graph neural network
CN110377710B (en) Visual question-answer fusion enhancement method based on multi-mode fusion
CN112100346B (en) Visual question-answering method based on fusion of fine-grained image features and external knowledge
CN108628935B (en) Question-answering method based on end-to-end memory network
CN109800434B (en) Method for generating abstract text title based on eye movement attention
CN111984766B (en) Missing semantic completion method and device
CN109214006B (en) Natural language reasoning method for image enhanced hierarchical semantic representation
CN108780464A (en) Method and system for handling input inquiry
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN114943230B (en) Method for linking entities in Chinese specific field by fusing common sense knowledge
CN115048447B (en) Database natural language interface system based on intelligent semantic completion
CN116151256A (en) Small sample named entity recognition method based on multitasking and prompt learning
CN114492441A (en) BilSTM-BiDAF named entity identification method based on machine reading understanding
CN114239585A (en) Biomedical nested named entity recognition method
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN115238691A (en) Knowledge fusion based embedded multi-intention recognition and slot filling model
CN116662591A (en) Robust visual question-answering model training method based on contrast learning
CN114781375A (en) Military equipment relation extraction method based on BERT and attention mechanism
Madureira et al. An overview of natural language state representation for reinforcement learning
CN117648429B (en) Question-answering method and system based on multi-mode self-adaptive search type enhanced large model
CN113204675B (en) Cross-modal video time retrieval method based on cross-modal object inference network
Cao et al. Visual question answering research on multi-layer attention mechanism based on image target features
WO2021129411A1 (en) Text processing method and device
CN116362246A (en) Entity identification and relation extraction method for multisource threat information
CN113626537B (en) Knowledge graph construction-oriented entity relation extraction method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination