CN116306681A

CN116306681A - Method and system for constructing interpretive visual question-answer model based on fact scene

Info

Publication number: CN116306681A
Application number: CN202211623149.6A
Authority: CN
Inventors: 蔡林沁; 方豪度; 许诺影; 钱坤阳
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-12-16
Filing date: 2022-12-16
Publication date: 2023-06-23

Abstract

The invention discloses an interpretable visual question-answering model construction method and system based on a fact scene, which are used for acquiring a first data set and a second data set; pre-training the visual question-answering model to obtain an image feature extraction network and a text feature extraction network; the method for weight back propagation processes the image feature extraction network to obtain an image back facts sample; processing the text feature extraction network by an open source machine learning library to obtain a text anti-facts sample; introducing an antagonistic semi-facts sample to perform iterative updating on the visual question-answer model to obtain a visual question-answer prediction model; extracting feature data, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model; the invention has the beneficial effects that the problem of poor model interpretability in the current visual question-answering research is solved, so that the model stores key causal information to enhance the reasoning capacity of the model, and captures image features and text features in a finer granularity.

Description

Method and system for constructing interpretive visual question-answer model based on fact scene

Technical Field

The invention relates to the technical field of visual question and answer, in particular to an interpretable visual question and answer model construction method and system based on a fact scene.

Background

In recent years, computer vision and natural language processing have been actively developed in both the image field and the text field. The research area where the two are combined is the visual question and answer area, and the purpose of the visual question and answer (Visual Question Answering, VQA) task is to predict the answer to a question associated with a given image from that image. Inspired by VQA, the search for medical VQA has been spotlighted in recent years, and medical VQA is that both images and problems are derived from medical images and clinical medical problems related to the images. Recent studies have shown that interpretability determines the accuracy of the predicted answer, and that medicine VQA requires more interpretability than VQA in the general field, as incorrect answer predictions can have catastrophic consequences.

However, although there have been related studies on the interpretive techniques of neural network CNNs and recurrent neural network RNNs, there are few interpretive studies on the field of visual questioning and answering, particularly the field of medical image questioning and answering. For example, when asking for "what is abnormal in an image", "several places are abnormal in an image" for a medical image, a reliable interpretable method should be validated for the predicted answer, which validation should be based on the overall medical VQA system, rather than just a visual presentation of the image and text as a mechanism of attention. Such an interpretive method has not been fully explored, and therefore, it is necessary to study the interpretive technique in the field of medical image questions and answers.

Causal reasoning can be used for model interpretability, and most of current deep learning models are learned based on a data-driven mode of a statistical model, and the mode of a black box can directly learn implicit correlation through data, but cannot interpret results output by the learned model.

In view of this, the present application is specifically proposed.

Disclosure of Invention

The invention aims to solve the technical problem that in the prior art, a deep learning model cannot explain the output result of the model after learning, and aims to provide an interpretive visual question-answering model construction method and system based on a fact scene, which can realize that the result outputted by the deep learning has interpretive after the model is subjected to the deep learning.

The invention is realized by the following technical scheme:

the method for constructing the interpretable visual question-answering model based on the fact scene comprises the following steps:

acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;

constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;

processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;

processing the text feature extraction network by using an open source machine learning library to obtain a text inverse fact sample;

introducing an opposite half-fact sample of an image and a text, and carrying out iterative updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;

and extracting feature data in the second data set, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model.

In the traditional visual question-answering technical field, a deep learning model is mostly learned in a data driving mode based on a statistical model, and the black box mode can directly learn the implicit correlation of the data, but cannot explain the result output by the learned model; the invention provides a method for constructing an interpretable visual question-answer model based on a fact scene, which is characterized in that related networks are respectively extracted through weight back propagation and an open source machine learning library, and the problem of poor interpretability of the model in the current visual question-answer research is solved in a mode of continuously updating and iteratively optimizing the visual question-answer model by the extracted networks, so that the model stores key causal information to enhance the reasoning capacity of the model, and image features and text features are captured in a fine granularity.

Preferably, the substeps of obtaining the image feature extraction network and the text feature extraction network include:

in the visual question-answering model, extracting image features in the first data set through a ResNet50 network to obtain image features;

embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text characteristics;

and processing the image features and the text features through a bilinear attention network to obtain an image feature extraction network and a text feature extraction network.

Preferably, the substep of obtaining the image counterfactual sample comprises:

processing the image feature extraction network by adopting a weight back propagation WBP method to obtain a causal saliency map;

and combining the L1 norm, and approximately replacing the pixel point value in the causal saliency map with 0 to obtain the image anti-reality sample.

Preferably, the text counterfactual sample substep comprises:

processing the text feature extraction network by using an open source machine learning library SHAP to obtain importance scores of each word in the problem text associated with the image;

and combining the L1 norms, and replacing the word consent with the highest score with MASK to obtain the text counterfactual sample.

Preferably, the pre-training is specifically: and optimizing the symmetrical loss function by adopting cosine similarity in the gradient calculation stage of the visual question-answering model.

Preferably, the substep of obtaining the visual question-answer prediction model comprises:

the loss function of the original sample, the positive sample loss function, the inverse fact sample loss function and the L1 norm are used for deriving the network layer parameters and propagating along the gradient reverse direction to enable the loss function value to be minimum, and the relevant parameters are continuously subjected to iterative updating to obtain the visual question-answer prediction model.

Preferably, in the image-text data set, the image-text data is data formed by a piece of image and a question and an answer corresponding to the image, and the image-text data set is a set formed by a plurality of pieces of image-text data.

The invention also provides an interpretable visual question-answering model construction system based on the fact scene, which comprises a data acquisition module, a pre-training module, a first processing module, a second processing module, an iteration updating module and a verification module;

the data acquisition module is used for acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set;

the pre-training module is used for constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;

the first processing module is used for processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;

the second processing module is used for processing the text feature extraction network by using an open source machine learning library to obtain a text counterfactual sample;

the iteration updating module is used for introducing an opposite half-fact sample of an image and a text, and carrying out iteration updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;

and the verification module is used for extracting the characteristic data in the second data set, and verifying the visual question-answer prediction model through the characteristic data to obtain an interpretable visual question-answer model.

Preferably, the pre-training module comprises an image feature extraction module, a text feature extraction module and a network processing module,

the image feature extraction module is used for extracting image features in the first data set through a ResNet50 network in the visual question-answering model to obtain image features;

the text feature extraction module is used for embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text features;

and the network processing module is used for processing the image characteristics and the text characteristics through a bilinear attention network to obtain an image characteristic extraction network and a text characteristic extraction network.

The invention also provides a computer storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described above.

Compared with the prior art, the invention has the following advantages and beneficial effects:

according to the method and the system for constructing the interpretable visual question-answering model based on the fact scene, the related networks are respectively extracted through weight back propagation and the open source machine learning library, and the problem that the interpretability of the model is not strong in the current visual question-answering research is solved in a mode of continuously updating and iteratively optimizing the visual question-answering model by the extracted networks, so that the model stores key causal information to enhance the reasoning capacity of the model, and image features and text features are captured in a fine granularity mode.

Drawings

In order to more clearly illustrate the technical solutions of the exemplary embodiments of the present invention, the drawings that are needed in the examples will be briefly described below, it being understood that the following drawings only illustrate some examples of the present invention and therefore should not be considered as limiting the scope, and that other related drawings may be obtained from these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow diagram of a construction method;

FIG. 2 is a diagram of a visual question-answering model framework;

FIG. 3 is a causal inference intervention strategy based on a facts scenario;

fig. 4 is a diagram of an interpretive reasoning effect over a reference dataset.

Detailed Description

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one of ordinary skill in the art that: no such specific details are necessary to practice the invention. In other instances, well-known structures, circuits, materials, or methods have not been described in detail in order not to obscure the invention.

Throughout the specification, references to "one embodiment," "an embodiment," "one example," or "an example" mean: a particular feature, structure, or characteristic described in connection with the embodiment or example is included within at least one embodiment of the invention. Thus, the appearances of the phrases "in one embodiment," "in an example," or "in an example" in various places throughout this specification are not necessarily all referring to the same embodiment or example. Furthermore, the particular features, structures, or characteristics may be combined in any suitable combination and/or sub-combination in one or more embodiments or examples. Moreover, those of ordinary skill in the art will appreciate that the illustrations provided herein are for illustrative purposes and that the illustrations are not necessarily drawn to scale. The term "and/or" as used herein includes any and all combinations of one or more of the associated listed items.

In the description of the present invention, the terms "front", "rear", "left", "right", "upper", "lower", "vertical", "horizontal", "high", "low", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, merely to facilitate description of the present invention and simplify description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and therefore should not be construed as limiting the scope of the present invention.

Example 1

In the traditional visual question-answering technical field, the deep learning model is mostly learned by a data driving mode based on a statistical model, and the black box mode can directly learn the implicit correlation of the data, but cannot explain the result output by the learned model.

The embodiment discloses a method for constructing an interpretable visual question-answering model based on a fact scene, which extracts related networks respectively through weight back propagation and an open source machine learning library, solves the problem of weak interpretability of the model in the current visual question-answering research in a mode of continuously updating and iteratively optimizing the visual question-answering model by the extracted networks, so that the model stores key causal information to enhance the reasoning capacity of the model, and captures image features and text features in a finer granularity, and a flow diagram of the method for constructing the visual question-answering model is shown in fig. 1, and comprises the following steps:

s1: acquiring a first data set and a second data set, wherein the first data set is an image-text pair data set, and the second data set is a visual question-answer data set; in the image-text data set, the image-text data is data formed by a piece of image and a corresponding related question and answer thereof, and the image-text data set is a set formed by a plurality of image-text data; the embodiment takes the acquired medical image visual question-answering model as a column.

S2: constructing a visual question-answer model, and pre-training the visual question-answer model through the first data set to obtain an image feature extraction network and a text feature extraction network;

the substeps of obtaining the image feature extraction network and the text feature extraction network include: in the visual question-answering model, extracting image features in the first data set through a ResNet50 network to obtain image features; embedding the text word of the problem through a GloVe model, and inputting the embedded model into a 1024D LSTM network to obtain text characteristics; and processing the image features and the text features through a bilinear attention network to obtain an image feature extraction network and a text feature extraction network. The pre-training is specifically as follows: and optimizing the symmetrical loss function by adopting cosine similarity in the gradient calculation stage of the visual question-answering model.

After the medical image is sent into the model, the medical image firstly enters a ResNet50 network to extract image characteristics, and after the word embedding of the problem text is carried out by the GloVE model, the size of each word vector is 300 dimensions. Generating a problem text characteristic in an LSTM network fed into 1024D, wherein a forward propagation equation of an LSTM unit with a forgetting gate is as follows:

f _t ＝σ(W _fx x _t +W _fh h _t-1 +b _f )

i _t ＝(W _ix x _t +W _ih h _t-1 +b _i )

o _t ＝σ(W _ox x _t +W _oh h _t-1 b _o )

h _t ＝o _t ⊙σ(c _t )

wherein f _t ，i _t ，o _t The control system comprises a forgetting gate, an input gate and an output gate which are in control states respectively. W and b are the weight biases of three gates, c _t Is the cell state of LSTM.

For feature fusion, the invention utilizes a bilinear attention network to fuse visual information and language information, and the combination of the fused features is expressed as follows:

where U and V are linearly embedded, p is a learnable mapping vector, and the overall framework is shown in FIG. 2 for the visual question-answering model.

S3: processing the image feature extraction network by adopting a weight back propagation method to obtain an image back facts sample;

the sub-step of obtaining the image counterfactual sample comprises the following steps: processing the image feature extraction network by adopting a weight back propagation WBP method to obtain a causal saliency map; and combining the L1 norm, and approximately replacing the pixel point value in the causal saliency map with 0 to obtain the image anti-reality sample.

The specific process is as follows: a causal intervention strategy is designed to generate anti-causal examples in model training to strengthen causal correlations. Given an image input x, the causal saliency map s with answer label y=m _m (x) Where n=1, =1, M is the class number and causal intervention is to s _m (x) Causal information in x contained in (significant pixel value replaced with zero) is removed and then used as image modalityAn anti-facts causal sample; given a question text input t, the answer label is a causal saliency map s of y=m _m (t) causal intervention is to select _m Causal information removal in t contained in (t) (significant word replacement with [ MASK ]]) It is then tagged now as an anti-facts cause and effect sample of the text modality.

To generate a saliency map in the original picture pixel space, providing information for decision-driven features, the following describes weight back propagation, which is a novel efficient computational saliency mapping scheme applicable to arbitrary neural architectures, where each pixel's contribution to the final class-specific prediction can be evaluated.

Consider a vector input and linear mapping, let x be ^l In the form of an internal representation of the layer-i data, then l=0 is the input layer, i.e. x ⁰ =x, l=l is the next to last logic layer before softmax conversion, i.e

To give each hidden unit in the first layer the relative importance, all transforms after l are decomposed symbolically into one use +.>

The expressed operators, called saliency matrices, satisfy: />

Wherein x is ^L Is an M-dimensional vector corresponding to M different classes in y. Although represented in a matrix form, slightly abusive symbols, e.g. operators, are used

Effectively depends on the input x, and thus all nonlinearities are effectively absorbed into it. For an object associated with a given tag y=m, its causal characteristics are included in +.>

Inter-between the mth row of (c) and input xIn the action, namely:

wherein s is _m (x) _k Representing saliency map s _m (x) Is (k) th element->

Is->

Is included in the single element of (a). Calculate->

A key observation of (a) is that it can be done recursively. Specifically, let g _l (x ^l ) For layer I transformations, such as affine transformations, convolutions, activation functions, normalization, etc., there are: />

This means that the calculation can be recursively performed as follows

Where G (·) is the update rule. The update rules for the transitions common in deep networks are listed in table one.

In this embodiment: the image occlusion operation is limited such that the local image of the replacement part (i.e. the causal part of the image affecting the model output) is as small as possible. Any causal saliency map satisfying causal relation is a causal saliency map effectively shielding no matter the size of the shielding part, and all and only the focus part of the shielding image are shielded, so that the causal relation is covered in practice, which is detrimental to the interpretability of the model. To avoid this, the L1 norm is used to encourage causal portions of each image to occupy only a small portion of the entire image.

S4: processing the text feature extraction network by using an open source machine learning library to obtain a text inverse fact sample;

the text counterfactual sample substep includes: processing the text feature extraction network by using an open source machine learning library SHAP to obtain importance scores of each word in the problem text associated with the image; and combining the L1 norms, and replacing the word consent with the highest score with MASK to obtain the text counterfactual sample.

A saliency map, shapley shap, is generated, which is a generic model interpretability framework, in the original question text. It is inspired by the theory of game play, proposed and created. Classical methods include Shapley regression values, shapley sampling values. Shapley regression values retrains the model on the feature subset when computing the feature contribution. For feature i, all feature sets containing i and excluding i are first generated, and then the prediction results are retrained and calculated, thereby calculating an average of the contributions of feature i:

shapley sampling values avoids the need to iterate the process of training a new model, approximating the above equation by sampling. While Quantitative Input Influence is a broader framework of algorithmic interpretation, where the portion of the characteristic contribution is still approximated by sampling to Shapley values. The specified interpretation model is:

where g is the interpretation model, z' ∈ {0,1} ^M Is a joint vector and M is the maximum length of the vector. />

That is, the contribution of feature j (Shapley values), the joint vector characterizes which feature combinations the selected data points have, 0 represents no feature contained,1 represents an inclusion feature.

In this embodiment, step2 and Step3 further include: to input x from picture _i And question text input t _i Removing causal information and obtaining a counterfactual sample

And->

The method adopts the following shielding mode: />

Wherein T () is a masking function occlusion function: />

Where ω, ω, σ > 0 are the threshold and scaling parameters, simply the parameters controlling the range of occlusion and the pixel value. And defines the following objective function:

wherein f _θ For predictive model, ++>

To the inverse fact sample loss function that needs to be optimized, < +.>

The text counterfactual sample for a text vector, i.e. the element in the vector with the highest score, is replaced with 0,/->

Representing the flip of class labels, i.e. l (x, t, y; f _θ )＝-l(x,t,y；f _θ )。

It should also be noted that the objective function may lead to a solution of the refund, i.e. any causal saliency map satisfying causal relationships, regardless of the size of the occlusion area, is a causal saliency map that effectively occludes, and the occlusion image fully and only occludes the focal site, in factThe causal relationships are covered, which is detrimental to the interpretability of the model. To avoid this, with the L1 norm, the causal portion of each image is encouraged to account for only a small fraction of the entire image: l (L) _reg ＝||s(·)|| ₁ 。

S5: introducing an opposite half-fact sample of an image and a text, and carrying out iterative updating on the visual question-answer model by combining the image inverse-fact sample and the text inverse-fact sample to obtain a visual question-answer prediction model;

the substep of obtaining the visual question-answer prediction model comprises the following steps: the loss function of the original sample, the positive sample loss function, the inverse fact sample loss function and the L1 norm are used for deriving the network layer parameters and propagating along the gradient reverse direction to enable the loss function value to be minimum, and the relevant parameters are continuously subjected to iterative updating to obtain the visual question-answer prediction model.

In order to avoid disturbances caused by the intervention strategy itself, i.e. the model does not learn to capture causal correlations, but rather learns the intervention operations (occlusion pictures) predicted for it. For example, when the model detects that an input is coded, it may learn to change predictions, regardless of whether the image lacks causal features, which may have an impact on the discrimination results. Thus, introducing an antagonistic control group, carrying out random occlusion on the non-causal related part of the image and the problem to obtain a half-fact sample x _i ' and t _i '，x’ _i ＝x _i -T(s _m (x _j ))⊙x _i ,i≠j

As shown in fig. 3, according to the causal relationship assumption, the anti-facts sample obtained after causal intervention predicts the wrong answer, the original input facts sample and the semi-facts sample predicts the correct answer, the network layer parameters are derived through different loss functions and are propagated in the gradient opposite direction to minimize the loss function value, the parameters are updated, and the causal saliency map obtained through the weight opposite propagation technology becomes more accurate gradually along with the deepening of training, which is the explanation in the model training process.

In this embodiment, the loss functions specifically include four loss functions, namely, classification of the original samplesIt should be noted that the loss function of the counterfactual sample should be negative, as the classification of the counterfactual sample is a result of causal intervention, and cannot be divided into the correct classes, as well as the loss function of the positive sample classification, the loss function of the counterfactual sample classification, and the L1 norm function in S2. The objective function of the half fact sample is:

the objective function of the original sample is: />

The overall objective function is: l=l _Cls -L _Neg +L _reg +L _Pos The model can be helped to capture the causal relationship in the sample by optimizing the parameters of the model through the total objective function, and the model has stronger interpretability.

S6: and extracting feature data in the second data set, and verifying the visual question-answer prediction model through the feature data to obtain an interpretable visual question-answer model.

In this embodiment, as shown in fig. 4, the method also improves the interpretation of the model. In fig. 4, the testing procedure is described using a visualization technique to reveal the interpretability of the proposed method. The answer distributions for the two specific question patterns are first compared and then the feature map is used to show the most important areas on the test input.

In the first row of FIG. 4, the present method shows the ability to capture causal relationships for the problem pattern "is there an abnormal". This is a closed question with a "yes" or "no" candidate answer, while the answer to most closed questions in the train set is "no". For test inputs from VQA-RAD, there is an abnormality in shoulder bone density (red rectangle), and the baseline method almost always answers "no" due to the imbalance in distribution, whereas the method presented in this patent outputs about 80% of "yes". The method of the invention appears to infer abnormal bone density in the shoulder by accurately locating the correct region, while the baseline model gets a wrong answer since no abnormal region is seen in the image. This unsatisfactory performance may be due to language bias.

Further, in the second row of fig. 4, "what is an anomaly in the CT scan? A similar situation occurs for this topic. More than 50% of the answers in the training set were "cystic teratomas", and only 10% of the answers were "colon cancer". For test inputs from SLAKE, there is a tumor abnormality in the colon region, and the method of the present invention accurately identifies lesions by capturing the causal portion of the model. However, the lesions obtained from the baseline model are erroneous. In answer prediction, the baseline model appears to derive only "cystic teratomas" from the answer distribution in the training set, while the proposed method infers the correct answer "colon cancer" from the correct lesions, although the distribution of "colon cancer" in the training set is lower. These two examples demonstrate that the method of the present invention is effective for various Med-VQA datasets, especially for the language bias Med-VQA dataset.

A fact scenario is the causal reasoning about events that occur around, often yielding a thinking process of "if a certain condition is changed, then the result will not occur (if … then …)" or "even if a certain condition is not changed, then the result will occur (but … for …)", a thinking activity that negates events that have occurred and builds another assumption of a likelihood, called counter fact (counter fact) thinking. The term counter fact (counter fact) may be interpreted abstractly as that an event may occur under different conditions, and correspondingly, half fact (Semi-fact) and fact (fact), which are graphically interpreted in the case of a bank loan, table 4.1.

According to the method for constructing the interpretive visual question-answering model based on the fact scene, the problem that the interpretive model is not strong in the current visual question-answering research is solved, so that the model stores key causal information to enhance the reasoning capacity of the model, and image features and text features are captured in a finer granularity. In the embodiment, the model is tested by using the reference data set VQA-RAD and SLAKE, the model of the invention obtains competitive results, particularly good results in the aspect of open problems, and the interpretability of the model is also applicable to visual question-answering models in other fields.

Example two

The embodiment discloses an interpretive visual question-answering model construction system based on a fact scene, which aims to realize the construction method as in the first embodiment, and comprises a data acquisition module, a pre-training module, a first processing module, a second processing module, an iterative updating module and a verification module;

The pre-training module comprises an image feature extraction module, a text feature extraction module and a network processing module,

Example III

The embodiment discloses an interpretability-based visual question-answer model construction system, wherein a computer program is stored, and when the computer program is executed by a processor, the method in the embodiment one is realized.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. The method for constructing the interpretable visual question-answering model based on the fact scene is characterized by comprising the following steps of:

2. The method for constructing an interpretive visual question-answer model based on a fact scene according to claim 1, wherein the substeps of obtaining the image feature extraction network and the text feature extraction network include:

3. The method for constructing an interpretive visual question-answer model based on a fact scene according to claim 1, wherein the sub-step of obtaining the image anti-fact sample comprises:

4. The method for constructing a fact scene based interpretable visual question-answer model of claim 3, wherein the text counterfactual sample substep includes:

5. The method for constructing an interpretable visual question-answer model based on a fact scene according to claim 1, wherein the pre-training is specifically: and optimizing the symmetrical loss function by adopting cosine similarity in the gradient calculation stage of the visual question-answering model.

6. The method for constructing an interpretable visual question-answer model based on a fact scene according to claim 1, wherein the sub-step of obtaining the visual question-answer prediction model comprises:

7. The method for constructing a scene-based interpretable visual question-answer model according to any one of claims 1 to 6, wherein in the image-text data set, image-text data is data formed by a question and an answer related to an image corresponding to the image, and the image-text data set is a set formed by a plurality of image-text data.

8. The system is characterized by comprising a data acquisition module, a pre-training module, a first processing module, a second processing module, an iteration updating module and a verification module;

9. The fact scene based interpretive visual question-answering model building system of claim 8, wherein said pre-training module comprises an image feature extraction module, a text feature extraction module, and a network processing module,

10. A computer storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of claims 1 to 8.