CN115239944A

CN115239944A - Image title automatic generation method based on causal reasoning

Info

Publication number: CN115239944A
Application number: CN202210661517.XA
Authority: CN
Inventors: 刘兵; 王栋; 刘明明; 陆劲夫; 李穗
Original assignee: China University of Mining and Technology CUMT
Current assignee: China University of Mining and Technology CUMT
Priority date: 2022-06-13
Filing date: 2022-06-13
Publication date: 2022-10-25

Abstract

The invention discloses a causal reasoning image caption generating method based on a causal graph, which is suitable for being used in image captions. Constructing a detailed causal inference method image caption CIIC framework based on a causal graph, wherein the causal inference method image caption CIIC framework comprises a non-aligned feature transform encoder and an intervention type transform decoder, and the non-aligned feature transform decoder comprises a FASTERR-CNN, an intervention type target detector IOD and a standard transform encoder which are sequentially connected; the interference type Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder; the interventional object detector IOD and the interventional Transformer decoder ITD jointly control the visual confusion factor and the text confusion factor to encode and then decode the input image. Confusion can be eliminated through back door adjustment, the problem of visual characteristics entangled in a coded image in the traditional image description is effectively solved, and the image description has stronger robustness.

Description

Image title automatic generation method based on causal reasoning

Technical Field

The invention relates to an automatic generation method of an image caption, in particular to a causal reasoning image caption generation method based on a causal graph and used in the image caption.

Background

Existing image captioning methods generally follow an encoder-decoder architecture, where the features of an image extracted by a CNN are fed into an RNN (usually LSTM-based) to generate a corresponding sentence. Because RNN-based models are limited in their sequential structure, convolutional language models have explored ways to replace traditional RNNs. Therefore, unlike the convolution operation, the new transform-based caption model achieves a considerable result based on the multi-head attention paradigm.

However, most transform-based image description models may still learn the dataset bias due to hidden confounding factors, how to deal with language confusion in image captions caused by dataset bias due to visual and visual information still exists and has not yet been explored. In terms of visual performance, most models employ pre-trained detectors, which ignore the problem of entangled visual features in the image. In terms of model structure improvement, most current Transformer-based image descriptions look like two elusive confounding factors: visual and linguistic confusion, which typically results in bias in training, spurious correlations during testing and reduced generalization ability of the model. Therefore, a new method needs to be proposed to solve the false correlation and dataset bias caused by the confusion factor that may be learned by these image description models.

Today, there is a trend to introduce causal reasoning into different deep learning models. These efforts make it possible to impart a deep neural network with a learning causal effect. Causal effects significantly improve the performance of many Computer Vision (CV) and Natural Language Processing (NLP) models, including image classification, image semantic segmentation, visual feature representation, visual dialog, image captioning, and dialog generation. The existing research analyzes the false correlation between the visual features and the subtitles through the causal relationship, and provides a framework for solving the confusion factor by an image subtitle deconstruction method (DIC), but still has some limitations. In their causal graph, the entire data set is considered a confounding factor, which is a difficult hierarchy to determine, and must be eliminated through complex front gate adjustments based on additional moderators. DIC focuses on eliminating the aliasing factor of the decoder and neglecting the aliasing factor of the visual features in the encoder, resulting in severe performance degradation.

Disclosure of Invention

The technical problem is as follows: aiming at the defects of the technology, the image title automatic generation method based on causal reasoning is provided, visual and language confusion factors in the sentence generation process can be processed simultaneously, a more detailed causal graph is displayed, and the performance of an image description model based on a Transformer is improved remarkably.

The technical scheme is as follows: in order to achieve the above technical object, an image title automatic generation method based on causal reasoning and transform according to the present invention is characterized in that: constructing a detailed cause and effect reasoning method image caption CIIC framework based on a cause and effect diagram, wherein the cause and effect diagram comprises a visual confusion factor and a text confusion factor;

the causal inference method image caption CIIC framework comprises a sequentially connected non-aligned feature transform encoder UFT and an interventional transform decoder ITD, wherein the non-aligned feature transform decoder comprises a sequentially connected FASTER R-CNN, an interventional target detector IOD and a standard transform encoder; the interference type Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder; an intervention type target detector IOD and an intervention type Transformer decoder ITD jointly control a visual confusion factor and a text confusion factor to encode an input image and then decode the input image;

the method comprises the steps that a non-alignment feature Transformer encoder UFT sends confusion-removing visual features extracted by an IOD and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors, integration is carried out through self attention and cross attention, addForm operation and feedforward propagation operation in a traditional Transformer are carried out, obtained outputs are transmitted to a next coding block, and the obtained outputs are L blocks in total, namely L codes are overlapped; the ITD input of the interference-predictive Transformer decoder is a currently generated sentence part, the ITD is subjected to cross attention with the final output of an encoding end through position embedding and a mask layer, addMorm operation and feed-forward propagation operation are carried out, visual and language confusion in the decoding process is eliminated through a causal intervention CI module, then the AddMorm operation is carried out, and L-time decoding is repeated to obtain final prediction output; causal intervention CI module is to fuse visual and linguistic features h ₂ With a visual confusion factor D ₁ And a language confusion factor D ₂ A combination of the expectations of;

the interventional object detector IOD separates the region-based visual features by eliminating visual confusion factors: separating the characteristics of the region of interest through an intervention type object sensor, and then combining the characteristics with the bottom-up characteristic of a FASTER regional convolutional neural network FASTER R-CNN to be used as the input of a transform encoder; the intervention target detector IOD integrates causal reasoning into the image features extracted by FASTER R-CNN to cope with the visual confusion extracted by the traditional pre-training model, so as to obtain a non-entangled representation based on regions; inputting a result generated in a decoding stage into an intervention type Transformer decoder ITD, introducing causal intervention into a Transformer decoder used for generating a traditional image title, and reducing visual and language confusion in a decoding process;

the visual and linguistic concepts are simultaneously established through the encoder and the decoder, unobserved aliasing factors between the intervention type target detector IOD and the intervention type Transformer decoder ITD are relieved, visual and linguistic aliasing is eliminated, pseudo-correlation occurring in visual feature representation and title generation is effectively eliminated, and finally a more real image title is generated.

The method comprises the following specific steps:

extracting image features of the image of the title to be generated through FASTER R-CNN respectively, and eliminating visual confusion regional features in the image features by using an intervention type target detector IOD;

in particular, since the FASTER R-CNN target detector uses likelihood estimation P (Y | X) as a training target for the classifier, resulting in spurious correlations caused by the aliasing factor Z,

P(Y|X)＝∑ _z P(Y|X，Z＝z)P(Z＝z|X)

wherein X is the regional visual characteristic based on the input image, the visual confusion factor of the Z image, and Y is the category label;

therefore, using causal inference intervention P (Y | do (X)) as a new classifier for object detection, where do operator do (·) acts to cut link Z → X, the training time is too long since actual training requires estimation of P (Y | do (X) by sampling, and therefore by applying normalized weighted geometric mean NWGM approximation, the class probability output by the interventional target detector is:

where concat represents the splicing of the matrix,

is the (i) th class label and,

is x of the pre-trained classifier

Probability output of the class; x represents the region feature and y in the specific input image ⁱ _c Represents the corresponding feature of the region, X and Y represent X and Y ⁱ _c Random variables of (2), x and y ⁱ _c Expressed as a specific sample value;

approximating a confusion factor therein as a fixed confusion factor dictionary

n denotes the class size in the data set, z _i Representing the average RoI characteristics of the ith, wherein each RoI characteristic is pre-trained by FASTER R-CNN, and the specific IOD characteristic extractor method comprises the steps of firstly checking the interest region RoI on the characteristic diagram, using a FASTER regional convolution neural network FASTER R-CNN to extract the interest region RoI on the characteristic diagram, and using the characteristics of each interest region RoI to respectively predict the bounding box y ^B And class probability output label y with peripheral vision confusion factor interference ^C Outputting label y according to class probability ^C And a confusion dictionary Z for predicting final class labels y by performing do operator ^I To eliminate peripheral visual confusion factor interference;

extracting confusion-solving object features from candidate regions of all interest regions RoI by using an intervention type target detector IOD as features of the IOD, sending the IOD features and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors as the features of the IOD, wherein the bottom-up features extracted from the same image have the distinguishing capability of different target attributes, Q represents a query vector, K represents a vector of the correlation of query information and other information, and V represents a vector of queried information, and integrating self attention and cross attention to promote visual representation of a CIIC model; because the bottom-up features and the IOD features are not aligned, a multi-view Transformer encoder, i.e., a non-aligned feature Transformer encoder UFT, is introduced to adjust them, and the bottom-up features and IOD features are input to the UFT encoder for alignment and fusion operations:

let the bottom-up and IOD features extracted from the image be

And

where m ≠ n and d1 ≠ d2, X is mapped using two linear layers constructed in a transform network _F And X _I Converted into a common d-dimensional space and used separately

And

show, select

As the main visual feature, and learning by the main visual feature using the following formula

Cross attention of (c):

where Multihead (-) represents the multi-headed attention function of a standard Transformer,

is that

Corresponding features in the above, likewise, are established in connection with

The multi-head attention model of (1):

three for

Attention from the middle-finding multiple head, i.e. Q, K and V all come from

Therefore, pay attention to

All having the same shape, and then using a residual standard layer Addnorm pair

Performing fusion packaging, wherein the fused characteristic information F is as follows:

wherein LayerNorm represents layer standardization, and finally, the fused characteristic information F is sent to an FFN module serving as a feedforward neural network in a Transformer to generate a UFT coding result;

in order to relieve the false correlation between the participated visual characteristics and the words with corresponding meanings, a standard Transformer decoder structure is constructed, a causal intervention module CI is integrated into each Transformer decoder layer based on the standard Transformer decoder structure, a region-based non-entangled representation and text are obtained in an encoder as the input of a decoder, and visual and language confusion in the decoding process is eliminated through the causal intervention module CI, so that the final image title language description is generated.

Further, an SCM (structural causal model) is utilized to construct participation visual feature V and visual context D ₁ Language context D ₂ And partially generating participated word characteristics h of sentences ₁ Fusion characteristics h ₂ And causal relationships between predictor words W: where the true causal effect is V → W, visual context D ₁ Language context D ₂ Influencing the visual characteristics V and the predicted words W, respectively, the middle language context D ₂ By participating in word features h ₁ Influencing visual characteristics V, word characteristics h ₁ Co-influencing the fusion feature h with the visual feature V ₂ Finally, the predicted word W is influenced; in particular, the causal effect V → W represents that the visual feature in which it is involved results in the generation of the corresponding word, D ₁ Causal Effect on V represents D ₁ → V because when a title model is trained, some frequently occurring visual context can severely affect the participating visual features, and the causal effect D ₁ → W means that visual context directly affects the frequency of occurrence of some related words in the generated description, D ₂ →h ₁ → V denotes that the participatory word feature affected by the language context guides the participatory visual feature by multi-headed cross-attention, h ₁ →h ₂ 、V→h ₂ And h ₂ → W denotes that the decoder fuses the visual feature and the language feature and utilizes the fused feature h ₂ Deducing the next predicted word W when using the probability of observation P (W | V, h) without causal intervention ₁ ) When used as a training target, the training target is determined by the confusion factor D ₁ And D ₂ The generative description model may learn some spurious correlations between the visual features V and the predicted words W, P (WV, h) to explain causal interventions in the image generation description ₁ ) Expressed as:

wherein the aliasing factor D ₁ And D ₂ Usually by P (d) ₁ I V) and P (d) ₂ |h ₁ ) Introduction of observation bias, use of causal interventions P (W | do (V), do (h) ₁ ) Instead of the conventional image description training target P (W | V, h) ₁ ) Thereby eliminating D ₁ Causal effects on the visual characteristics V, D ₂ For participatory word characteristics h ₁ So that V ← D is blocked ₁ → W and h ₁ ←D ₂ → W two back door paths, eliminating spurious associations; suppose a confusion factor D ₁ And D ₂ Can be layered respectively, then P (W | do (V), do (h) ₁ ) Can be adjusted according to the back door as:

thus, the image generates a description model (P (W | do (V)), do (h) according to the adjustment in the formula ₁ ) ) the predicted output probability for the model) is forced to learn the true causal effect: v → W instead of by the visual confusion factor D ₁ And a language confusion factor D ₂ The resulting false association; due to D ₁ And D ₂ Are all unobserved and, beyond the goal of the image generation description, an approximate visual confusion factor dictionary D needs to be constructed ₁ And a language confusion factor dictionary D ₂ Constructing a visual matrix by setting each entry in the image visual characteristics to the average RoI characteristics of the objects in the category in each image classification dataset

Where c is the number of classes in the training dataset and d _v Dimension representing each of the RoI features while using d in a predefined vocabulary _e Uygur word embedding

Constructing a semantic space, N being the length of the vocabulary, d _e Is a word feature dimension; the training description model then learns two linear projections

Will vision matrix V _r Word-of-sum embedding W _e By the formula: d ₁ ＝V _r P _v ,D ₂ ＝W _e P _w Respectively converted into D ₁ And D ₂ Using the NWGM approximation to calculate:

P(W|do(V),do(h ₁ ))≈Softmax{g(h ₂ ,E _D1 [D ₁ ],E _D2 [D ₂ ])},

wherein g (-) represents the fully connected layer,

and

by setting D ₁ And D ₂ By fusing the features h ₂ Conditional on increasing the representation capability of the interferometric Transformer decoder; do (h) ₁ ) P (W | do (V)) represents the probability P (W | do (V)), do (h) of predicting the generated word after eliminating the visual confusion feature ₁ ) ) represents the predicted output probability after eliminating the language context confusion and visual confusion features.

Further, the unaligned feature Transformer encoder comprises a FASTER R-CNN, an interventional object detector IOD and a standard Transformer encoder comprising a multi-head attention layer, a residual error standard layer and a feedforward neural network layer;

the interference-predictive Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder, wherein the standard Transformer decoder comprises a covering attention layer, a multi-head attention layer, a residual standard layer and a feedforward neural network layer;

wherein, the multi-head attention layer, the residual error standard layer and the feedforward neural network layer of the non-alignment feature Transformer coder are partially stacked for L times; the method comprises the following steps that (1) parts consisting of a covering attention layer, a multi-head attention layer, a residual error standard layer, a feedforward neural network layer and a causal intervention CI module of an interference type Transformer decoder are stacked for L times;

the transform decoder and the transform encoder both comprise a multi-head attention layer, a residual standard layer and a feedforward neural network layer, and the interventional transform decoder passes through a visual dictionary D ₁ And language dictionary D ₂ Causal intervention module CI will fuse features h ₂ With a visual confusion factor D ₁ And a language confusion factor D ₂ To predict a next word at each time step, a beginning marker being entered as text at the beginning of the prediction, and a last generated word being entered as text at each time step; that is, by back-gate adjustment, cutting off the chain of confounding factors effectively eliminates the unobserved confounding factorsCausal intervention;

firstly, extracting bottom-up features from an input image through FASTER R-CNN, extracting confusion-solving object features from a RoI candidate region through an intervention target detector IOD, performing alignment and fusion operations by a UFT encoder by taking the bottom-up features and the IOD features as input, and performing integration by the intervention Transformer decoder by taking the integrated visual features as input, wherein the input word information is simultaneously combined at each time step, and the output of the last decoder layer is projected to an N-dimensional space by a linear embedded layer, wherein N is the size of a word table; finally, the probability of words in the vocabulary is predicted by using softmax operation to generate final predicted words, namely, each time step word is from a real labeled sentence during training, and the final predicted words are input as output words of the last time step.

Further, pre-training a causal inference method image caption CIIC framework:

firstly, pre-training is carried out by adopting word-level cross entropy, a training set comprises images and corresponding description sentences, and a loss function is as follows:

theta is all parameters of a causal inference method image caption CIIC framework model, and comprises weight and bias, w _1:T Is the target true sequence, optimizes the irreducible measure of the model through Reinforcement Learning (RL), and adopts a variant of self-supervised sequence training (SCST) on the beam search sampling sequence, minimizing the negative expectation score:

wherein the reward r (-) is a CIDER-D score;

testing the image caption CIIC framework of the trained causal reasoning method: sentences are generated word by word in sequence using beam search, the trained model is input into the image to be recognized, and then the image is input into the decoder through a series of processes, and in the first decoding step, the top k candidates are considered. K second words are generated for the k first words. The top k [ first word, second word ] combinations are selected taking into account the score obtained. For the k second words, k third words are selected, and the top k first, second, third word combinations are selected. Each decoding step is repeated. And after finishing the k sequences, selecting the sequence with the best comprehensive score to obtain the sequence with the highest probability in the last bundle.

Has the beneficial effects that:

1) According to the method, a new image description architecture CIIC based on a Transformer is adopted from the aspect of causal relationship, causal intervention is seamlessly combined into target detection and description generation, and confusion effects are jointly relieved. On the one hand, the proposed IOD effectively unwraps visual feature entanglement and facilitates de-obfuscation of image descriptions. On the other hand, the proposed ITD employs causal intervention to simultaneously process visual and language confounding factors in the sentence generation process;

2) The method decomposes the confusion factor into visual and textual confusion factors and displays a more detailed cause and effect diagram.

3) The method can obviously improve the performance of the image description model based on the Transformer, and obtains the best image description performance at present under the single model setting of the MS-COCO data set.

Drawings

Fig. 1 is a diagram of an image caption framework used in the automatic image caption generation method based on causal reasoning according to the present invention.

FIG. 2 is a schematic diagram of the structure of an interventional object detector used in the method of the invention.

FIG. 3 is a diagram illustrating causal intervention in the image description of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings of the specification:

as shown in fig. 1, the automatic image title generation method based on causal reasoning according to the present invention first divides the confusion factor, and divides the existing causal graph into two categories: visual confusion and text confusion. A causal inference method image caption (CIIC) framework structure based on a causal graph comprises the following steps: both types of aliasing are faced jointly by an Interventional Object Detector (IOD) and an Interventional Transformer Decoder (ITD). The IOD integrates causal reasoning into FASTER R-CNN to cope with visual confusion, with the aim of obtaining a region-based non-entangled representation. ITD eliminates visual and linguistic confusion by doing it at the transform decoder stage. Region-of-interest features are first separated by an interventional object perceptron (IOD) and then combined with the bottom-up property of FASTER R-CNN as input to the transform encoder. In CIIC, we propose a causal intervention module to deal with visual and linguistic confusion in word prediction. Our CIIC can effectively eliminate the pseudo-correlation that occurs in visual feature representation and title generation to obtain a more realistic image title.

Of Interventional Object Detectors (IODs), where FASTER R-CNN is used as a visual skeleton to extract regions of interest (RoI) on feature maps. Using each RoI feature to predict class probability output y separately ^C And bounding box y ^B . Output y according to class probability ^C And obfuscating dictionary Z, we perform do operator to predict the final class label y ⁱ 。

The region of interest features are separated by an intervening object perceptron and then combined with the bottom-up property of FASTER R-CNN as input to the transform encoder. In CIIC, a causal Intervention module, cash Intervention, is proposed to deal with visual and linguistic confusion in word prediction. The symbol "L ×" in fig. 1 indicates that the coding block (including the multi-head attention layer, the residual standard layer, and the feedforward neural network layer) and the decoding block (including the mask attention layer, the multi-head attention layer, the residual standard layer, and the feedforward neural network layer, and the causal intervention CI module) in the dashed box are stacked L times. CIIC can effectively eliminate the false correlation occurring in visual feature representation and title generation to obtain a more realistic image title.

Wherein specifically:

the image features of the title to be generated are respectively extracted through FASTER R-CNN, and the regional features which are proposed through an Intervention Object Detector (IOD) and eliminate visual confusion are provided.

The adopted method of the intervention type target detector is as follows:

conventional object detectors, such as FASTER R-CNN, basically use likelihood estimation P (Y | X) as a training object for the classifier, resulting in pseudo-correlation caused by the aliasing factor Z.

P(Y|X)＝∑ _z P(Y|X，Z＝z)P(Z＝z|X)

Where X is the regional visual features based on the input image, Z is the visual confusion factor for the image, and Y is the category label.

We propose to use causal inference intervention P (Y | do (X)) as a new classifier for object detection, where the do operator do (-) acts to cut the link Z → X. Since actual training requires time consuming and labor intensive sampling to estimate P (Y | do (X)), which would make training time prohibitive, the approximation of Normalized Weighted Geometric Mean (NWGM) is applied:

(class probability of interventional target Detector output)

Where concat represents the concatenation of the matrices,

is the i-th class label and,

is x of the pre-trained classifier

Probability output of the class; x represents the region feature and y in the specific input image ⁱ _c Indicating the corresponding feature of the region. X and Y represent X and Y ⁱ _c Random variables of (2), x and y ⁱ _c Represented as a specific sample value.

Approximating a confusion factor therein as a fixed confusion factor dictionary

n denotes the class size in the dataset, z _i Representing the average RoI characteristics of the ith, each RoI characteristic is pre-trained by FASTER R-CNN, and the specific IOD characteristic extractor structure is shown in FIG. 2, wherein FASTER R-CNN (FASTER regional convolutional neural network) is used for extracting the region of interest RoI on the characteristic diagram, such as 'the upper body part of a blue clothes child', and each RoI characteristic is used for respectively predicting the class probability output label y ^C (with interference of peripheral visual confusion factor) and bounding box y ^B Outputting label y according to class probability ^C And a confusion dictionary Z for predicting the final class label y by executing do operator ^I (elimination of peripheral visual confusion factor interference);

and thirdly, extracting confusion-solving object features, namely IOD features, from all the RoI candidate regions by using an IOD extractor, and sending the IOD features and the bottom-up features extracted from the same image into two linear layers to map to generate Q, K and V vectors in consideration of the discrimination capability that the extracted bottom-up features have different target attributes, wherein Q represents a query vector, K represents a vector of the correlation between the query information and other information, and V represents a vector of the queried information. Integrating by self-attention and cross-attention to facilitate visual representation of the CIIC model; since the bottom-up features and the IOD features are misaligned, a multi-view Transformer encoder, i.e., a non-alignment feature Transformer encoder UFT, is introduced to adjust them, the UFT encoder taking as input the misaligned visual features (referring to the bottom-up features and the IOD features) while performing the alignment and fusion operations:

is provided with

And

representing bottom-up and IOD features extracted from an image, respectively, where m ≠ n and d1 ≠ d2, using two linear layers (constructed in a transform network) to map X to X _F And X _I Converted into a common d-dimensional space, respectively

And

show, select

Cross attention of (c):

where MultiHead (-) represents the multi-headed attention function of a standard transducer,

is that

The corresponding features in the above, likewise, are established in connection with

The multi-head attention model is:

note that

All have the same shape (the above formula is in three)

Finding outThe multiple head is self-attentive, i.e. Q, K and V all come from

) Then, encapsulation is performed by using AddNorm (residual standard layer), and the fused feature information F is as follows:

the LayerNorm represents layer standardization, and finally, the fused characteristic information F is sent to an FFN module (a feedforward neural network in a Transformer) to generate a coding result of the UFT;

and fourthly, in order to relieve the false correlation between the participatory visual characteristics and the corresponding words, constructing a Transformer-based decoder structure, integrating a causal intervention module into each Transformer decoder layer based on the Transformer decoder structure so as to deal with visual and language confusion in the image description, as shown in the figure 1, introducing the causal intervention module into the traditional Transformer decoder, taking the non-entangled representation and the text based on the region obtained in the encoder as the input of the decoder, and eliminating the visual and language confusion in the decoding process through the causal intervention module to generate the final image description.

Causal intervention in image description: as shown in fig. 3, the real causal effect V → W is captured by cutting two links D2 → h1, D1 → V, respectively, to block the rear door path V ← h1 ← D2 → W, V ← D1 → W.

Construction of participation visual feature V and visual context D by using SCM (structural causal model) ₁ Language context D ₂ Partial generation of participatory word characteristics h of sentences ₁ Fusion characteristics h ₂ And causal relationships between the predicted words W:

in particular, the causal effect V → W indicates that the visual feature in which it is involved results in the generation of a corresponding word, D ₁ Causal Effect on V represents D ₁ → V, since some frequently occurring visual context can severely impact participating visuals when a title model is trainedThe sense characteristics, and the causal effects D ₁ → W means that visual context directly affects the frequency of occurrence of some related words in the generated description, D ₂ →h ₁ → V represents the participatory word feature affected by the language context leading the participatory visual feature by multi-headed cross-attention, h ₁ →h ₂ 、V→h ₂ And h ₂ → W denotes that the decoder fuses the visual feature and the language feature and utilizes the fused feature h ₂ Deducing the next predicted word W, when using the observation probability P (W | V, h) ₁ ) (observation probability without causal intervention) as training target due to the confusion factor D ₁ And D ₂ The generative description model may learn some false correlations between the visual features V and the predicted words W, P (W | V, h) to explain causal interventions in the image generation description ₁ ) Expressed as:

wherein the confounding factor generally passes through P (d) ₁ I V) and P (d) ₂ |h ₁ ) Introduction of observation bias, use of causal interventions P (W | do (V), do (h) ₁ ) Instead of the conventional image description training target P (W | V, h) ₁ ) Thereby eliminating D ₁ Causal effects on the visual characteristics V, D ₂ For participatory word characteristic h ₁ So that V ← D is blocked ₁ → W and h ₁ ←D ₂ → W two back door paths, eliminating spurious associations; suppose confusion factor D ₁ And D ₂ Can be layered respectively, then P (W | do (V), do (h) ₁ ) Can be adjusted according to the back door as:

thus, the image generates a description model (P (W | do (V)), do (h) according to the adjustments in the formula ₁ ) ) the predicted output probability for the model) is forced to learn the true causal effect: v → W instead of by the visual confusion factor D ₁ And language confusionFactor D ₂ The resulting false association; due to D ₁ And D ₂ Are all unobserved and, beyond the goal of the image generation description, an approximate visual confusion factor dictionary D needs to be constructed ₁ And a language confusion factor dictionary D ₂ (visual features and word embedding are obtained by linear projection, D ₁ And D ₂ Through coarse inclination, with D ₁ And D ₂ Different), a visual matrix is constructed by setting each entry in the image visual characteristics to the average RoI characteristics of the objects in each class (class in the image classification dataset)

Where c is the number of classes in the training dataset (the common standard dataset) and d _v Dimension representing each of the RoI features while using d in a predefined vocabulary _e Uygur word embedding

Constructing a semantic space, N being the length of the vocabulary, d _e Is a word feature dimension; the description model is then trained to learn two linear projections

Will vision matrix V _r Word-and-phrase embedding W _e By the formula: d ₁ ＝V _r P _v ,D ₂ ＝W _e P _w Are converted into D respectively ₁ And D ₂ Using the NWGM approximation to calculate:

P(W|do(V),do(h ₁ ))≈Softmax{g(h ₂ ,E _D1 [D ₁ ],E _D2 [D ₂ ])},

wherein g (-) represents the fully connected layer,

and

by setting D ₁ And D ₂ By fusing the features h ₂ Conditional on increasingThe presentation capability of the ITD;

transformer decoder architecture: the transform decoder architecture is shown in fig. 1, where the non-aligned feature transform (UFT) encoder consists of FASTER R-CNN (FASTER regional convolutional neural network), an intervening target detector IOD, and a standard transform encoder (including multi-headed attention layer, residual standard layer, and feedforward neural network layer), and the intervening transform decoder consists of a causal intervening CI module inserted after the feedforward neural network layer module of the standard transform decoder (corresponding to the light red layer above the middle dashed box decoder in fig. 1), where the symbol "L ×" represents that the coding block (including multi-headed attention layer, residual standard layer, and feedforward neural network layer) and the decoding block (including masked attention layer, multi-headed attention layer, residual standard layer, and feedforward neural network layer, and causal intervening CI module) in the dashed box are stacked L times. A general Transformer decoder is similar to a Transformer encoder and comprises a multi-head attention layer, a residual standard layer and a feedforward neural network layer, except that a plurality of covering self-attention layers are stacked in sequence by L identical decoder layers ₁ And language dictionary D ₂ CI Module will fuse features h ₂ With a visual confusion factor D ₁ And a language confusion factor D ₂ Combined to predict the next word at each time step (a beginning marker is entered as text at the beginning of the prediction, followed by the last generated word as text entry at each time step). I.e. it is in fact a causal intervention by backgate adjustment (elimination of confounders that would not have been observed by cutting the chain of confounders). Firstly, extracting bottom-up features from an input image through FASTER R-CNN, extracting confusion object features from a RoI candidate region through an intervention target detector IOD, performing alignment and fusion operations by using unaligned visual features (the bottom-up features and the IOD features) as input by a UFT (unified view transform) encoder, and performing integrated visual features on each object by using an Intervention Transform Decoder (ITD)The time step is combined with the input word information and the output of the last decoder layer is then projected by the linear embedding layer into an N-dimensional space, where N is the vocabulary size. Finally, the probability of predicting words in the vocabulary using softmax operation yields the final predicted words (each time step word from the real labeled sentence at training, the final prediction being the output word at the last time step at the input).

Pre-training a CIIC model, wherein the model is pre-trained by adopting word-level cross entropy (a training set comprises images and corresponding description sentences), and a loss function is as follows:

where θ is all parameters (including weight and bias) of the model (CIIC model), w _1:T Is the target true sequence, optimizes the irreducible measure of the model through Reinforcement Learning (RL), and adopts a variant of self-supervised sequence training (SCST) on the beam search sampling sequence, minimizing the negative expectation score:

wherein the reward r (-) is the CIDER-D score;

in the testing phase, sentences are generated word by word in sequence using beam search, the trained model is input into the image to be recognized, and then the image is input into the decoder through a series of processes, and in the first decoding step, the top k candidates are considered. K second words are generated for the k first words. The first k [ first word, second word ] combinations are selected, taking into account the score obtained. For the k second words, k third words are selected, and the top k first, second, third word combinations are selected. Each decoding step is repeated. And after finishing the k sequences, selecting the sequence with the best comprehensive score to obtain the sequence with the highest probability in the last bundle.

To represent image features, first of allThe proposed IOD is trained on the MSCOCO dataset to extract 1024-dimensional IOD features of the top 100 objects with the highest confidence. The 2048-dimensional bottom-Up features of the detected object are then extracted using a pre-trained Up-Down model. These two features are finally projected linearly to the model of input dimension d =512 and input to the UFT encoder. Words were represented in the experiment using one-hot vectors and pre-trained GloVe word embedding, respectively. Both projected linearly onto the 512-dimensional input vector of the ITD. To represent the word positions in the sentence, the input vectors and their sinusoidal position codes are added before the first decoding layer. Words outside the vocabulary are represented as all zero vectors. An Adam optimizer was used in the training phase, with a batch size of 10 and a bundle size of 5. A step decay scheme with preheat equal to 20000 is used to vary the learning rate. All models were first trained on 30 epochs with cross-entropy loss, then on another 30 with a learning rate of 5 × 10 with CIDER reward ^-6 Is further optimized. In the inference phase, we employ a bundle search strategy with bundle size 3.

In conclusion, causal intervention is seamlessly integrated into target detection and description generation by using a new transform-based image description architecture CIIC from the causal point of view, and confusion effects are jointly alleviated. In one aspect, the proposed IOD effectively unwraps visual feature entanglement and facilitates de-obfuscation of image descriptions. On the other hand, the proposed ITD adopts causal intervention to simultaneously process visual and language confusion factors in the sentence generation process, and experimental results show that the method remarkably improves the performance of a Transformer-based image description model and realizes a new advanced level in a single model structure of an MS-COCO data set.

Claims

1. An automatic image title generation method based on causal reasoning and Transformer is characterized in that: constructing a detailed cause and effect reasoning method image caption CIIC framework based on a cause and effect diagram, wherein the cause and effect diagram comprises a visual confusion factor and a text confusion factor;

the causal inference method image caption CIIC framework comprises a sequentially connected non-aligned feature transform encoder UFT and an interventional transform decoder ITD, wherein the non-aligned feature transform decoder comprises a sequentially connected FASTER R-CNN, an interventional target detector IOD and a standard transform encoder; the interference type Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder; an intervention type target detector IOD and an intervention type Transformer decoder ITD jointly control a visual confusion factor and a text confusion factor to encode an input image, and then decode the input image;

the interventional object detector IOD separates the region-based visual features by eliminating visual confusion factors: separating the characteristics of the region of interest through an intervention type object sensor, and then combining the characteristics with the bottom-up characteristic of a FASTER regional convolutional neural network FASTER R-CNN to be used as the input of a transform encoder; the intervention target detector IOD integrates causal reasoning into image features extracted by FASTER R-CNN to cope with visual confusion extracted by a traditional pre-training model, so as to obtain non-entangled expression based on regions; inputting the result generated in the decoding stage into an intervention type Transformer decoder ITD, introducing causal intervention into a Transformer decoder used for generating a traditional image title, and reducing visual and language confusion in the decoding process;

by simultaneously establishing visual and linguistic concepts through the encoder and the decoder, unobserved aliasing factors between the IOD and the ITD are relieved, visual and linguistic aliasing is eliminated, pseudo-correlation occurring in visual feature representation and title generation is effectively eliminated, and finally a truer image title is generated.

2. The causal inference and transform-based image title automatic generation method of claim 1, wherein: the method comprises the following specific steps:

P(Y|X)＝Σ _z P(Y|X，Z＝z)P(Z＝z|X)

where concat represents the splicing of the matrix,

is the i-th class label and,

is x of the pre-trained classifier

Probability output of the class; x denotes the region feature and y in the specific input image ⁱ _c Represents the corresponding feature of the region, X and Y represent X and Y ⁱ _c Random variables of (1), x and y ⁱ _c Expressed as a specific sample value;

approximating the confusion factor therein to a fixed confusion factor dictionary

n denotes the class size in the dataset, z _i Representing the average RoI characteristics of the ith, wherein each RoI characteristic is pre-trained by FASTER R-CNN, the specific IOD characteristic extractor method comprises the steps of firstly checking the region of interest RoI on the characteristic diagram, using a FASTER regional convolutional neural network FASTER R-CNN to extract the region of interest RoI on the characteristic diagram, and using the characteristics of each region of interest RoI to respectively predict a bounding box y ^B And class probability output label y with peripheral vision confusion factor interference ^C Output label y according to class probability ^C And a confusion dictionary Z for predicting the final class label y by executing do operator ^I To eliminate peripheral visual confusion factor interference;

extracting confusion-solving object features from candidate regions of all interest regions RoI by using an intervention type target detector IOD as features of the IOD, sending the IOD features and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors as the features of the IOD, wherein the bottom-up features extracted from the same image have the distinguishing capability of different target attributes, Q represents a query vector, K represents a vector of the correlation of query information and other information, and V represents a vector of queried information, and integrating self attention and cross attention to promote visual representation of a CIIC model; since the bottom-up features and the IOD features are not aligned, a multi-view transform encoder, i.e., a non-aligned feature transform encoder UFT, is introduced to adjust them, inputting the bottom-up features and IOD features into the UFT encoder for alignment and fusion operations:

let the bottom-up and IOD features extracted from the image be

And

where m ≠ n and d1 ≠ d2, X is represented by two linear layers constructed in a Transformer network _F And X _I Converted into a common d-dimensional space and used separately

And

show, select

Cross attention of (1):

is that

The corresponding features ofAs such, establish information about

The multi-head attention model of (1):

three are

The attention of the middle-finding multiple heads, i.e., Q, K and V all come from

Therefore, pay attention to

3. The causal inference and transform-based image title automatic generation method of claim 2, wherein: construction of participatory visual feature V and visual context D by using structural causal model SCM ₁ Language context D ₂ And partially generating participated word characteristics h of sentences ₁ Fusion characteristics h ₂ And causal relationships between the predicted words W: where the true causal effect is V → W, visual context D ₁ Language context D ₂ Influencing the visual characteristics V and the predicted words W, respectively, the middle language context D ₂ By participating in word features h ₁ Influencing visual characteristics V, word characteristics h ₁ Co-influencing the fusion feature h with the visual feature V ₂ Finally, the predicted word W is influenced; in particular, the causal effect V → W indicates that the visual feature in which it is involved results in the generation of a corresponding word, D ₁ Causal Effect on V represents D ₁ → V, since some frequently occurring visual context can severely affect the participating visual features when a title model is trained, and the causal effect D ₁ → W means that the visual context directly affects the frequency of occurrence of some related words in the generated description, D ₂ →h ₁ → V denotes that the participatory word feature affected by the language context guides the participatory visual feature by multi-headed cross-attention, h ₁ →h ₂ 、V→h ₂ And h ₂ → W shows a decoder fusing visual feature and language feature and using the fused feature h ₂ Deducing the next predicted word W when using the probability of observation P (W | V, h) without causal intervention ₁ ) When used as a training target, the training target is determined by the confusion factor D ₁ And D ₂ The generative description model may learn some spurious correlations between the visual features V and the predicted words W, P (WV, h) to explain causal interventions in the image generation description ₁ ) Expressed as:

thus, the image generates a description model (P (W | do (V)), do (h) according to the adjustment in the formula ₁ ) ) the predicted output probability for the model) is forced to learn the true causal effect: v → W instead of by the visual confusion factor D ₁ And a language confusion factor D ₂ The resulting false association; due to D ₁ And D ₂ Are all unobserved and, beyond the goal of the image generation description, an approximate visual confusion factor dictionary D needs to be constructed ₁ And a language confusion factor dictionary D ₂ Constructing a visual matrix by setting each entry in the image visual features to the average RoI feature of the objects in the class in each image classification dataset

Where c is the number of classes in the training dataset and d _v Dimension representing each RoI feature while using d in a predefined vocabulary _e Uygur word embedding

Constructing a semantic space, N being the length of the vocabulary, d _e Is a word feature dimension; then training the description model to learn twoA linear projection

Will vision matrix V _r Word-and-phrase embedding W _e By the formula: d ₁ ＝V _r P _v ,D ₂ ＝W _e P _w Are converted into D respectively ₁ And D ₂ Calculating by using NWGM approximation:

P(W|do(V),do(h ₁ ))≈Softmax{g(h ₂ ,E _D1 [D ₁ ],E _D2 [D ₂ ])},

wherein g (-) represents the fully connected layer,

and

by setting D ₁ And D ₂ By fusing the features h ₂ Conditional on increasing the representation capability of the interferometric Transformer decoder; do (h) ₁ ) P (W | do (V)) represents the probability P (W | do (V), do (h)) of predicting the generated word after eliminating the visual confusion feature ₁ ) ) represents the predicted output probability after eliminating the language context confusion and visual confusion features.

4. The causal inference and transform-based image title automatic generation method of claim 2,

the non-alignment feature Transformer encoder comprises a FASTER R-CNN, an intervention target detector IOD and a standard Transformer encoder comprising a multi-head attention layer, a residual error standard layer and a feedforward neural network layer;

stacking parts consisting of a multi-head attention layer, a residual standard layer and a feedforward neural network layer of a non-alignment feature Transformer encoder for L times; the method comprises the following steps that (1) parts consisting of a covering attention layer, a multi-head attention layer, a residual error standard layer, a feedforward neural network layer and a causal intervention CI module of an interference type Transformer decoder are stacked for L times;

the transform decoder and the transform encoder both comprise a multi-head attention layer, a residual standard layer and a feedforward neural network layer, and the interventional transform decoder passes through a visual dictionary D ₁ And language dictionary D ₂ Causal intervention module CI will fuse features h ₂ With a visual confusion factor D ₁ And a language confusion factor D ₂ To predict a next word at each time step, a beginning marker being entered as text at the beginning of the prediction, and a last generated word being entered as text at each time step; namely, through back door adjustment, the link of the confusion factor is cut off, and the unobserved confusion factor elimination is effectively eliminated to realize causal intervention;

5. The causal inference and Transformer-based image caption automatic generation method of claim 4, wherein a causal inference method image caption CIIC framework is pre-trained:

wherein the reward r (-) is the CIDER-D score;

testing the trained causal reasoning method image caption CIIC framework: using beam search to generate sentences word by word in sequence, inputting images to be recognized by a trained model, inputting the images into a decoder through a series of processing, and considering the first k candidates in the first decoding step; generating k second words for the k first words, selecting the first k [ first words, second words ] combinations in consideration of the obtained scores, selecting k third words for the k second words, selecting the top k [ first words, second words, third words ] combinations, repeating each decoding step, and after the k sequences are finished, selecting the sequence with the best composite score to obtain the sequence with the highest probability in the last bundle.