CN116306681A - Method and system for constructing interpretive visual question-answer model based on fact scene - Google Patents

Method and system for constructing interpretive visual question-answer model based on fact scene Download PDF

Info

Publication number
CN116306681A
CN116306681A CN202211623149.6A CN202211623149A CN116306681A CN 116306681 A CN116306681 A CN 116306681A CN 202211623149 A CN202211623149 A CN 202211623149A CN 116306681 A CN116306681 A CN 116306681A
Authority
CN
China
Prior art keywords
image
text
model
visual question
answer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211623149.6A
Other languages
Chinese (zh)
Inventor
蔡林沁
方豪度
许诺影
钱坤阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN202211623149.6A priority Critical patent/CN116306681A/en
Publication of CN116306681A publication Critical patent/CN116306681A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses an interpretable visual question-answering model construction method and system based on a fact scene, which are used for acquiring a first data set and a second data set; pre-training the visual question-answering model to obtain an image feature extraction network and a text feature extraction network; the method for weight back propagation processes the image feature extraction network to obtain an image back facts sample; processing the text feature extraction network by an open source machine learning library to obtain a text anti-facts sample; introducing an antagonistic semi-facts sample to perform iterative updating on the visual question-answer model to obtain a visual question-answer prediction model; extracting feature data, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model; the invention has the beneficial effects that the problem of poor model interpretability in the current visual question-answering research is solved, so that the model stores key causal information to enhance the reasoning capacity of the model, and captures image features and text features in a finer granularity.

Description

Method and system for constructing interpretive visual question-answer model based on fact scene
Technical Field
The invention relates to the technical field of visual question and answer, in particular to an interpretable visual question and answer model construction method and system based on a fact scene.
Background
In recent years, computer vision and natural language processing have been actively developed in both the image field and the text field. The research area where the two are combined is the visual question and answer area, and the purpose of the visual question and answer (Visual Question Answering, VQA) task is to predict the answer to a question associated with a given image from that image. Inspired by VQA, the search for medical VQA has been spotlighted in recent years, and medical VQA is that both images and problems are derived from medical images and clinical medical problems related to the images. Recent studies have shown that interpretability determines the accuracy of the predicted answer, and that medicine VQA requires more interpretability than VQA in the general field, as incorrect answer predictions can have catastrophic consequences.
However, although there have been related studies on the interpretive techniques of neural network CNNs and recurrent neural network RNNs, there are few interpretive studies on the field of visual questioning and answering, particularly the field of medical image questioning and answering. For example, when asking for "what is abnormal in an image", "several places are abnormal in an image" for a medical image, a reliable interpretable method should be validated for the predicted answer, which validation should be based on the overall medical VQA system, rather than just a visual presentation of the image and text as a mechanism of attention. Such an interpretive method has not been fully explored, and therefore, it is necessary to study the interpretive technique in the field of medical image questions and answers.
Causal reasoning can be used for model interpretability, and most of current deep learning models are learned based on a data-driven mode of a statistical model, and the mode of a black box can directly learn implicit correlation through data, but cannot interpret results output by the learned model.
In view of this, the present application is specifically proposed.
Disclosure of Invention
The invention aims to solve the technical problem that in the prior art, a deep learning model cannot explain the output result of the model after learning, and aims to provide an interpretive visual question-answering model construction method and system based on a fact scene, which can realize that the result outputted by the deep learning has interpretive after the model is subjected to the deep learning.
The invention is realized by the following technical scheme:
the method for constructing the interpretable visual question-answering model based on the fact scene comprises the following steps:
acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;
constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;
processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;
processing the text feature extraction network by using an open source machine learning library to obtain a text inverse fact sample;
introducing an opposite half-fact sample of an image and a text, and carrying out iterative updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;
and extracting feature data in the second data set, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model.
In the traditional visual question-answering technical field, a deep learning model is mostly learned in a data driving mode based on a statistical model, and the black box mode can directly learn the implicit correlation of the data, but cannot explain the result output by the learned model; the invention provides a method for constructing an interpretable visual question-answer model based on a fact scene, which is characterized in that related networks are respectively extracted through weight back propagation and an open source machine learning library, and the problem of poor interpretability of the model in the current visual question-answer research is solved in a mode of continuously updating and iteratively optimizing the visual question-answer model by the extracted networks, so that the model stores key causal information to enhance the reasoning capacity of the model, and image features and text features are captured in a fine granularity.
Preferably, the substeps of obtaining the image feature extraction network and the text feature extraction network include:
in the visual question-answering model, extracting image features in the first data set through a ResNet50 network to obtain image features;
embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text characteristics;
and processing the image features and the text features through a bilinear attention network to obtain an image feature extraction network and a text feature extraction network.
Preferably, the substep of obtaining the image counterfactual sample comprises:
processing the image feature extraction network by adopting a weight back propagation WBP method to obtain a causal saliency map;
and combining the L1 norm, and approximately replacing the pixel point value in the causal saliency map with 0 to obtain the image anti-reality sample.
Preferably, the text counterfactual sample substep comprises:
processing the text feature extraction network by using an open source machine learning library SHAP to obtain importance scores of each word in the problem text associated with the image;
and combining the L1 norms, and replacing the word consent with the highest score with MASK to obtain the text counterfactual sample.
Preferably, the pre-training is specifically: and optimizing the symmetrical loss function by adopting cosine similarity in the gradient calculation stage of the visual question-answering model.
Preferably, the substep of obtaining the visual question-answer prediction model comprises:
the loss function of the original sample, the positive sample loss function, the inverse fact sample loss function and the L1 norm are used for deriving the network layer parameters and propagating along the gradient reverse direction to enable the loss function value to be minimum, and the relevant parameters are continuously subjected to iterative updating to obtain the visual question-answer prediction model.
Preferably, in the image-text data set, the image-text data is data formed by a piece of image and a question and an answer corresponding to the image, and the image-text data set is a set formed by a plurality of pieces of image-text data.
The invention also provides an interpretable visual question-answering model construction system based on the fact scene, which comprises a data acquisition module, a pre-training module, a first processing module, a second processing module, an iteration updating module and a verification module;
the data acquisition module is used for acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;
the pre-training module is used for constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;
the first processing module is used for processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;
the second processing module is used for processing the text feature extraction network by using an open source machine learning library to obtain a text counterfactual sample;
the iteration updating module is used for introducing an opposite half-fact sample of an image and a text, and carrying out iteration updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;
and the verification module is used for extracting the characteristic data in the second data set, and verifying the visual question-answer prediction model through the characteristic data to obtain an interpretable visual question-answer model.
Preferably, the pre-training module comprises an image feature extraction module, a text feature extraction module and a network processing module,
the image feature extraction module is used for extracting image features in the first data set through a ResNet50 network in the visual question-answering model to obtain image features;
the text feature extraction module is used for embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text features;
and the network processing module is used for processing the image characteristics and the text characteristics through a bilinear attention network to obtain an image characteristic extraction network and a text characteristic extraction network.
The invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.
Compared with the prior art, the invention has the following advantages and beneficial effects:
according to the method and the system for constructing the interpretable visual question-answering model based on the fact scene, the related networks are respectively extracted through weight back propagation and the open source machine learning library, and the problem that the interpretability of the model is not strong in the current visual question-answering research is solved in a mode of continuously updating and iteratively optimizing the visual question-answering model by the extracted networks, so that the model stores key causal information to enhance the reasoning capacity of the model, and image features and text features are captured in a fine granularity mode.
Drawings
In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow diagram of a construction method;
FIG. 2 is a diagram of a visual question-answering model framework;
FIG. 3 is a causal inference intervention strategy based on a facts scenario;
fig. 4 is a diagram of an interpretive reasoning effect over a reference dataset.
Detailed Description
For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known structures, circuits, materials, or methods have not been described in detail in order not to obscure the invention.
Throughout the specification, references to "one embodiment," "an embodiment," "one example," or "an example" mean: a particular feature, structure, or characteristic described in connection with the embodiment or example is included within at least one embodiment of the invention. Thus, the appearances of the phrases "in one embodiment," "in an example," or "in an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable combination and/or sub-combination in one or more embodiments or examples. Moreover, those of ordinary skill in the art will appreciate that the illustrations provided herein are for illustrative purposes and that the illustrations are not necessarily drawn to scale. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.
In the description of the present invention, the terms "front", "rear", "left", "right", "upper", "lower", "vertical", "horizontal", "high", "low", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, merely to facilitate description of the present invention and simplify description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the scope of the present invention.
Example 1
In the traditional visual question-answering technical field, the deep learning model is mostly learned by a data driving mode based on a statistical model, and the black box mode can directly learn the implicit correlation of the data, but cannot explain the result output by the learned model.
The embodiment discloses a method for constructing an interpretable visual question-answering model based on a fact scene, which extracts related networks respectively through weight back propagation and an open source machine learning library, solves the problem of weak interpretability of the model in the current visual question-answering research in a mode of continuously updating and iteratively optimizing the visual question-answering model by the extracted networks, so that the model stores key causal information to enhance the reasoning capacity of the model, and captures image features and text features in a finer granularity, and a flow diagram of the method for constructing the visual question-answering model is shown in fig. 1, and comprises the following steps:
s1: acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set; in the image-text data set, the image-text data is data formed by a piece of image and a corresponding related question and answer thereof, and the image-text data set is a set formed by a plurality of image-text data; the embodiment takes the acquired medical image visual question-answering model as a column.
S2: constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;
the substeps of obtaining the image feature extraction network and the text feature extraction network include: in the visual question-answering model, extracting image features in the first data set through a ResNet50 network to obtain image features; embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text characteristics; and processing the image features and the text features through a bilinear attention network to obtain an image feature extraction network and a text feature extraction network. The pre-training is specifically as follows: and optimizing the symmetrical loss function by adopting cosine similarity in the gradient calculation stage of the visual question-answering model.
After the medical image is sent into the model, the medical image firstly enters a ResNet50 network to extract image characteristics, and after the word embedding of the problem text is carried out by the GloVE model, the size of each word vector is 300 dimensions. Generating a problem text characteristic in an LSTM network fed into 1024D, wherein a forward propagation equation of an LSTM unit with a forgetting gate is as follows:
f t =σ(W fx x t +W fh h t-1 +b f )
i t =(W ix x t +W ih h t-1 +b i )
o t =σ(W ox x t +W oh h t-1 b o )
Figure BDA0004003296300000061
Figure BDA0004003296300000062
h t =o t ⊙σ(c t )
wherein f t ,i t ,o t The control system comprises a forgetting gate, an input gate and an output gate which are in control states respectively. W and b are the weight biases of three gates, c t Is the cell state of LSTM.
For feature fusion, the invention utilizes a bilinear attention network to fuse visual information and language information, and the combination of the fused features is expressed as follows:
Figure BDA0004003296300000063
Figure BDA0004003296300000064
where U and V are linearly embedded, p is a learnable mapping vector, and the overall framework is shown in FIG. 2 for the visual question-answering model.
S3: processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;
the sub-step of obtaining the image counterfactual sample comprises the following steps: processing the image feature extraction network by adopting a weight back propagation WBP method to obtain a causal saliency map; and combining the L1 norm, and approximately replacing the pixel point value in the causal saliency map with 0 to obtain the image anti-reality sample.
The specific process is as follows: a causal intervention strategy is designed to generate anti-causal examples in model training to strengthen causal correlations. Given an image input x, the causal saliency map s with answer label y=m m (x) Where n=1, =1, M is the class number and causal intervention is to s m (x) Causal information in x contained in (significant pixel value replaced with zero) is removed and then used as image modalityAn anti-facts causal sample; given a question text input t, the answer label is a causal saliency map s of y=m m (t) causal intervention is to select m Causal information removal in t contained in (t) (significant word replacement with [ MASK ]]) It is then tagged now as an anti-facts cause and effect sample of the text modality.
To generate a saliency map in the original picture pixel space, providing information for decision-driven features, the following describes weight back propagation, which is a novel efficient computational saliency mapping scheme applicable to arbitrary neural architectures, where each pixel's contribution to the final class-specific prediction can be evaluated.
Consider a vector input and linear mapping, let x be l In the form of an internal representation of the layer-i data, then l=0 is the input layer, i.e. x 0 =x, l=l is the next to last logic layer before softmax conversion, i.e
Figure BDA0004003296300000065
To give each hidden unit in the first layer the relative importance, all transforms after l are decomposed symbolically into one use +.>
Figure BDA0004003296300000066
The expressed operators, called saliency matrices, satisfy: />
Figure BDA0004003296300000071
Wherein x is L Is an M-dimensional vector corresponding to M different classes in y. Although represented in a matrix form, slightly abusive symbols, e.g. operators, are used
Figure BDA0004003296300000072
Effectively depends on the input x, and thus all nonlinearities are effectively absorbed into it. For an object associated with a given tag y=m, its causal characteristics are included in +.>
Figure BDA0004003296300000073
Inter-between the mth row of (c) and input xIn the action, namely:
Figure BDA0004003296300000074
wherein s is m (x) k Representing saliency map s m (x) Is (k) th element->
Figure BDA0004003296300000075
Is->
Figure BDA0004003296300000076
Is included in the single element of (a). Calculate->
Figure BDA0004003296300000077
A key observation of (a) is that it can be done recursively. Specifically, let g l (x l ) For layer I transformations, such as affine transformations, convolutions, activation functions, normalization, etc., there are: />
Figure BDA0004003296300000078
This means that the calculation can be recursively performed as follows
Figure BDA0004003296300000079
Where G (·) is the update rule. The update rules for the transitions common in deep networks are listed in table one.
Figure BDA00040032963000000710
In this embodiment: the image occlusion operation is limited such that the local image of the replacement part (i.e. the causal part of the image affecting the model output) is as small as possible. Any causal saliency map satisfying causal relation is a causal saliency map effectively shielding no matter the size of the shielding part, and all and only the focus part of the shielding image are shielded, so that the causal relation is covered in practice, which is detrimental to the interpretability of the model. To avoid this, the L1 norm is used to encourage causal portions of each image to occupy only a small portion of the entire image.
S4: processing the text feature extraction network by using an open source machine learning library to obtain a text inverse fact sample;
the text counterfactual sample substep includes: processing the text feature extraction network by using an open source machine learning library SHAP to obtain importance scores of each word in the problem text associated with the image; and combining the L1 norms, and replacing the word consent with the highest score with MASK to obtain the text counterfactual sample.
A saliency map, shapley shap, is generated, which is a generic model interpretability framework, in the original question text. It is inspired by the theory of game play, proposed and created. Classical methods include Shapley regression values, shapley sampling values. Shapley regression values retrains the model on the feature subset when computing the feature contribution. For feature i, all feature sets containing i and excluding i are first generated, and then the prediction results are retrained and calculated, thereby calculating an average of the contributions of feature i:
Figure BDA0004003296300000081
shapley sampling values avoids the need to iterate the process of training a new model, approximating the above equation by sampling. While Quantitative Input Influence is a broader framework of algorithmic interpretation, where the portion of the characteristic contribution is still approximated by sampling to Shapley values. The specified interpretation model is:
Figure BDA0004003296300000082
where g is the interpretation model, z' ∈ {0,1} M Is a joint vector and M is the maximum length of the vector. />
Figure BDA0004003296300000083
That is, the contribution of feature j (Shapley values), the joint vector characterizes which feature combinations the selected data points have, 0 represents no feature contained,1 represents an inclusion feature.
In this embodiment, step2 and Step3 further include: to input x from picture i And question text input t i Removing causal information and obtaining a counterfactual sample
Figure BDA0004003296300000084
And->
Figure BDA0004003296300000085
The method adopts the following shielding mode: />
Figure BDA0004003296300000086
Wherein T () is a masking function occlusion function: />
Figure BDA0004003296300000087
Where ω, ω, σ > 0 are the threshold and scaling parameters, simply the parameters controlling the range of occlusion and the pixel value. And defines the following objective function:
Figure BDA0004003296300000088
wherein f θ For predictive model, ++>
Figure BDA0004003296300000089
To the inverse fact sample loss function that needs to be optimized, < +.>
Figure BDA00040032963000000810
The text counterfactual sample for a text vector, i.e. the element in the vector with the highest score, is replaced with 0,/->
Figure BDA00040032963000000811
Representing the flip of class labels, i.e. l (x, t, y; f θ )=-l(x,t,y;f θ )。
It should also be noted that the objective function may lead to a solution of the refund, i.e. any causal saliency map satisfying causal relationships, regardless of the size of the occlusion area, is a causal saliency map that effectively occludes, and the occlusion image fully and only occludes the focal site, in factThe causal relationships are covered, which is detrimental to the interpretability of the model. To avoid this, with the L1 norm, the causal portion of each image is encouraged to account for only a small fraction of the entire image: l (L) reg =||s(·)|| 1
S5: introducing an opposite half-fact sample of an image and a text, and carrying out iterative updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;
the substep of obtaining the visual question-answer prediction model comprises the following steps: the loss function of the original sample, the positive sample loss function, the inverse fact sample loss function and the L1 norm are used for deriving the network layer parameters and propagating along the gradient reverse direction to enable the loss function value to be minimum, and the relevant parameters are continuously subjected to iterative updating to obtain the visual question-answer prediction model.
In order to avoid disturbances caused by the intervention strategy itself, i.e. the model does not learn to capture causal correlations, but rather learns the intervention operations (occlusion pictures) predicted for it. For example, when the model detects that an input is coded, it may learn to change predictions, regardless of whether the image lacks causal features, which may have an impact on the discrimination results. Thus, introducing an antagonistic control group, carrying out random occlusion on the non-causal related part of the image and the problem to obtain a half-fact sample x i ' and t i ',x’ i =x i -T(s m (x j ))⊙x i ,i≠j
As shown in fig. 3, according to the causal relationship assumption, the anti-facts sample obtained after causal intervention predicts the wrong answer, the original input facts sample and the semi-facts sample predicts the correct answer, the network layer parameters are derived through different loss functions and are propagated in the gradient opposite direction to minimize the loss function value, the parameters are updated, and the causal saliency map obtained through the weight opposite propagation technology becomes more accurate gradually along with the deepening of training, which is the explanation in the model training process.
In this embodiment, the loss functions specifically include four loss functions, namely, classification of the original samplesIt should be noted that the loss function of the counterfactual sample should be negative, as the classification of the counterfactual sample is a result of causal intervention, and cannot be divided into the correct classes, as well as the loss function of the positive sample classification, the loss function of the counterfactual sample classification, and the L1 norm function in S2. The objective function of the half fact sample is:
Figure BDA0004003296300000091
the objective function of the original sample is: />
Figure BDA0004003296300000092
The overall objective function is: l=l Cls -L Neg +L reg +L Pos The model can be helped to capture the causal relationship in the sample by optimizing the parameters of the model through the total objective function, and the model has stronger interpretability.
S6: and extracting feature data in the second data set, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model.
In this embodiment, as shown in fig. 4, the method also improves the interpretation of the model. In fig. 4, the testing procedure is described using a visualization technique to reveal the interpretability of the proposed method. The answer distributions for the two specific question patterns are first compared and then the feature map is used to show the most important areas on the test input.
In the first row of FIG. 4, the present method shows the ability to capture causal relationships for the problem pattern "is there an abnormal". This is a closed question with a "yes" or "no" candidate answer, while the answer to most closed questions in the train set is "no". For test inputs from VQA-RAD, there is an abnormality in shoulder bone density (red rectangle), and the baseline method almost always answers "no" due to the imbalance in distribution, whereas the method presented in this patent outputs about 80% of "yes". The method of the invention appears to infer abnormal bone density in the shoulder by accurately locating the correct region, while the baseline model gets a wrong answer since no abnormal region is seen in the image. This unsatisfactory performance may be due to language bias.
Further, in the second row of fig. 4, "what is an anomaly in the CT scan? A similar situation occurs for this topic. More than 50% of the answers in the training set were "cystic teratomas", and only 10% of the answers were "colon cancer". For test inputs from SLAKE, there is a tumor abnormality in the colon region, and the method of the present invention accurately identifies lesions by capturing the causal portion of the model. However, the lesions obtained from the baseline model are erroneous. In answer prediction, the baseline model appears to derive only "cystic teratomas" from the answer distribution in the training set, while the proposed method infers the correct answer "colon cancer" from the correct lesions, although the distribution of "colon cancer" in the training set is lower. These two examples demonstrate that the method of the present invention is effective for various Med-VQA datasets, especially for the language bias Med-VQA dataset.
A fact scenario is the causal reasoning about events that occur around, often yielding a thinking process of "if a certain condition is changed, then the result will not occur (if … then …)" or "even if a certain condition is not changed, then the result will occur (but … for …)", a thinking activity that negates events that have occurred and builds another assumption of a likelihood, called counter fact (counter fact) thinking. The term counter fact (counter fact) may be interpreted abstractly as that an event may occur under different conditions, and correspondingly, half fact (Semi-fact) and fact (fact), which are graphically interpreted in the case of a bank loan, table 4.1.
According to the method for constructing the interpretive visual question-answering model based on the fact scene, the problem that the interpretive model is not strong in the current visual question-answering research is solved, so that the model stores key causal information to enhance the reasoning capacity of the model, and image features and text features are captured in a finer granularity. In the embodiment, the model is tested by using the reference data set VQA-RAD and SLAKE, the model of the invention obtains competitive results, particularly good results in the aspect of open problems, and the interpretability of the model is also applicable to visual question-answering models in other fields.
Example two
The embodiment discloses an interpretive visual question-answering model construction system based on a fact scene, which aims to realize the construction method as in the first embodiment, and comprises a data acquisition module, a pre-training module, a first processing module, a second processing module, an iterative updating module and a verification module;
the data acquisition module is used for acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;
the pre-training module is used for constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;
the first processing module is used for processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;
the second processing module is used for processing the text feature extraction network by using an open source machine learning library to obtain a text counterfactual sample;
the iteration updating module is used for introducing an opposite half-fact sample of an image and a text, and carrying out iteration updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;
and the verification module is used for extracting the characteristic data in the second data set, and verifying the visual question-answer prediction model through the characteristic data to obtain an interpretable visual question-answer model.
The pre-training module comprises an image feature extraction module, a text feature extraction module and a network processing module,
the image feature extraction module is used for extracting image features in the first data set through a ResNet50 network in the visual question-answering model to obtain image features;
the text feature extraction module is used for embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text features;
and the network processing module is used for processing the image characteristics and the text characteristics through a bilinear attention network to obtain an image characteristic extraction network and a text characteristic extraction network.
Example III
The embodiment discloses an interpretability-based visual question-answer model construction system, wherein a computer program is stored, and when the computer program is executed by a processor, the method in the embodiment one is realized.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims (10)

1. The method for constructing the interpretable visual question-answering model based on the fact scene is characterized by comprising the following steps of:
acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;
constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;
processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;
processing the text feature extraction network by using an open source machine learning library to obtain a text inverse fact sample;
introducing an opposite half-fact sample of an image and a text, and carrying out iterative updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;
and extracting feature data in the second data set, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model.
2. The method for constructing an interpretive visual question-answer model based on a fact scene according to claim 1, wherein the substeps of obtaining the image feature extraction network and the text feature extraction network include:
in the visual question-answering model, extracting image features in the first data set through a ResNet50 network to obtain image features;
embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text characteristics;
and processing the image features and the text features through a bilinear attention network to obtain an image feature extraction network and a text feature extraction network.
3. The method for constructing an interpretive visual question-answer model based on a fact scene according to claim 1, wherein the sub-step of obtaining the image anti-fact sample comprises:
processing the image feature extraction network by adopting a weight back propagation WBP method to obtain a causal saliency map;
and combining the L1 norm, and approximately replacing the pixel point value in the causal saliency map with 0 to obtain the image anti-reality sample.
4. The method for constructing a fact scene based interpretable visual question-answer model of claim 3, wherein the text counterfactual sample substep includes:
processing the text feature extraction network by using an open source machine learning library SHAP to obtain importance scores of each word in the problem text associated with the image;
and combining the L1 norms, and replacing the word consent with the highest score with MASK to obtain the text counterfactual sample.
5. The method for constructing an interpretable visual question-answer model based on a fact scene according to claim 1, wherein the pre-training is specifically: and optimizing the symmetrical loss function by adopting cosine similarity in the gradient calculation stage of the visual question-answering model.
6. The method for constructing an interpretable visual question-answer model based on a fact scene according to claim 1, wherein the sub-step of obtaining the visual question-answer prediction model comprises:
the loss function of the original sample, the positive sample loss function, the inverse fact sample loss function and the L1 norm are used for deriving the network layer parameters and propagating along the gradient reverse direction to enable the loss function value to be minimum, and the relevant parameters are continuously subjected to iterative updating to obtain the visual question-answer prediction model.
7. The method for constructing a scene-based interpretable visual question-answer model according to any one of claims 1 to 6, wherein in the image-text data set, image-text data is data formed by a question and an answer related to an image corresponding to the image, and the image-text data set is a set formed by a plurality of image-text data.
8. The system is characterized by comprising a data acquisition module, a pre-training module, a first processing module, a second processing module, an iteration updating module and a verification module;
the data acquisition module is used for acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;
the pre-training module is used for constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;
the first processing module is used for processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;
the second processing module is used for processing the text feature extraction network by using an open source machine learning library to obtain a text counterfactual sample;
the iteration updating module is used for introducing an opposite half-fact sample of an image and a text, and carrying out iteration updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;
and the verification module is used for extracting the characteristic data in the second data set, and verifying the visual question-answer prediction model through the characteristic data to obtain an interpretable visual question-answer model.
9. The fact scene based interpretive visual question-answering model building system of claim 8, wherein said pre-training module comprises an image feature extraction module, a text feature extraction module, and a network processing module,
the image feature extraction module is used for extracting image features in the first data set through a ResNet50 network in the visual question-answering model to obtain image features;
the text feature extraction module is used for embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text features;
and the network processing module is used for processing the image characteristics and the text characteristics through a bilinear attention network to obtain an image characteristic extraction network and a text characteristic extraction network.
10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 8.
CN202211623149.6A 2022-12-16 2022-12-16 Method and system for constructing interpretive visual question-answer model based on fact scene Pending CN116306681A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211623149.6A CN116306681A (en) 2022-12-16 2022-12-16 Method and system for constructing interpretive visual question-answer model based on fact scene

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211623149.6A CN116306681A (en) 2022-12-16 2022-12-16 Method and system for constructing interpretive visual question-answer model based on fact scene

Publications (1)

Publication Number Publication Date
CN116306681A true CN116306681A (en) 2023-06-23

Family

ID=86798437

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211623149.6A Pending CN116306681A (en) 2022-12-16 2022-12-16 Method and system for constructing interpretive visual question-answer model based on fact scene

Country Status (1)

Country Link
CN (1) CN116306681A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117077761A (en) * 2023-07-31 2023-11-17 浙江大学 Visual interpretation method and system for deep neural network model based on inverse fact comparison
CN118227770A (en) * 2024-05-24 2024-06-21 阿里巴巴(中国)有限公司 Task processing method, legal question answering method and task processing model training method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117077761A (en) * 2023-07-31 2023-11-17 浙江大学 Visual interpretation method and system for deep neural network model based on inverse fact comparison
CN117077761B (en) * 2023-07-31 2024-03-26 浙江大学 Visual interpretation method and system for deep neural network model based on inverse fact comparison
CN118227770A (en) * 2024-05-24 2024-06-21 阿里巴巴(中国)有限公司 Task processing method, legal question answering method and task processing model training method
CN118227770B (en) * 2024-05-24 2024-08-06 阿里巴巴(中国)有限公司 Task processing method, legal question answering method and task processing model training method

Similar Documents

Publication Publication Date Title
Liang et al. Explaining the black-box model: A survey of local interpretation methods for deep neural networks
Chen et al. Order-free rnn with visual attention for multi-label classification
Patro et al. U-cam: Visual explanation using uncertainty based class activation maps
Logeswaran et al. Sentence ordering and coherence modeling using recurrent neural networks
Cadene et al. Murel: Multimodal relational reasoning for visual question answering
Zhang et al. Mdnet: A semantically and visually interpretable medical image diagnosis network
Zhang et al. Learning noise-aware encoder-decoder from noisy labels by alternating back-propagation for saliency detection
JP2021166046A (en) Method for training convolutional neural network for image recognition using image conditional mask language modeling
CN116306681A (en) Method and system for constructing interpretive visual question-answer model based on fact scene
Zhang et al. Triplet attention and dual-pool contrastive learning for clinic-driven multi-label medical image classification
CN111275118A (en) Chest film multi-label classification method based on self-correction type label generation network
Vahadane et al. Dual encoder attention u-net for nuclei segmentation
CN109242029A (en) Identify disaggregated model training method and system
Ruwa et al. Mood-aware visual question answering
CN112420125A (en) Molecular attribute prediction method and device, intelligent equipment and terminal
Utkin et al. A new adaptive weighted deep forest and its modifications
Wu et al. FER-CHC: Facial expression recognition with cross-hierarchy contrast
Patro et al. Uncertainty class activation map (U-CAM) using gradient certainty method
Singh et al. Efficient evolving deep ensemble medical image captioning network
Ponti et al. Training deep networks from zero to hero: avoiding pitfalls and going beyond
Li et al. Unifying model explainability and robustness for joint text classification and rationale extraction
Rafi et al. A Deep Learning-Based Bengali Visual Question Answering System
Lonij et al. Open-world visual recognition using knowledge graphs
Wu et al. AGNet: Automatic generation network for skin imaging reports
CN116541507A (en) Visual question-answering method and system based on dynamic semantic graph neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination