CN115239944A - Image title automatic generation method based on causal reasoning - Google Patents

Image title automatic generation method based on causal reasoning Download PDF

Info

Publication number
CN115239944A
CN115239944A CN202210661517.XA CN202210661517A CN115239944A CN 115239944 A CN115239944 A CN 115239944A CN 202210661517 A CN202210661517 A CN 202210661517A CN 115239944 A CN115239944 A CN 115239944A
Authority
CN
China
Prior art keywords
visual
causal
image
features
confusion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210661517.XA
Other languages
Chinese (zh)
Inventor
刘兵
王栋
刘明明
陆劲夫
李穗
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Mining and Technology CUMT
Original Assignee
China University of Mining and Technology CUMT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Mining and Technology CUMT filed Critical China University of Mining and Technology CUMT
Priority to CN202210661517.XA priority Critical patent/CN115239944A/en
Publication of CN115239944A publication Critical patent/CN115239944A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/041Abduction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a causal reasoning image caption generating method based on a causal graph, which is suitable for being used in image captions. Constructing a detailed causal inference method image caption CIIC framework based on a causal graph, wherein the causal inference method image caption CIIC framework comprises a non-aligned feature transform encoder and an intervention type transform decoder, and the non-aligned feature transform decoder comprises a FASTERR-CNN, an intervention type target detector IOD and a standard transform encoder which are sequentially connected; the interference type Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder; the interventional object detector IOD and the interventional Transformer decoder ITD jointly control the visual confusion factor and the text confusion factor to encode and then decode the input image. Confusion can be eliminated through back door adjustment, the problem of visual characteristics entangled in a coded image in the traditional image description is effectively solved, and the image description has stronger robustness.

Description

Image title automatic generation method based on causal reasoning
Technical Field
The invention relates to an automatic generation method of an image caption, in particular to a causal reasoning image caption generation method based on a causal graph and used in the image caption.
Background
Existing image captioning methods generally follow an encoder-decoder architecture, where the features of an image extracted by a CNN are fed into an RNN (usually LSTM-based) to generate a corresponding sentence. Because RNN-based models are limited in their sequential structure, convolutional language models have explored ways to replace traditional RNNs. Therefore, unlike the convolution operation, the new transform-based caption model achieves a considerable result based on the multi-head attention paradigm.
However, most transform-based image description models may still learn the dataset bias due to hidden confounding factors, how to deal with language confusion in image captions caused by dataset bias due to visual and visual information still exists and has not yet been explored. In terms of visual performance, most models employ pre-trained detectors, which ignore the problem of entangled visual features in the image. In terms of model structure improvement, most current Transformer-based image descriptions look like two elusive confounding factors: visual and linguistic confusion, which typically results in bias in training, spurious correlations during testing and reduced generalization ability of the model. Therefore, a new method needs to be proposed to solve the false correlation and dataset bias caused by the confusion factor that may be learned by these image description models.
Today, there is a trend to introduce causal reasoning into different deep learning models. These efforts make it possible to impart a deep neural network with a learning causal effect. Causal effects significantly improve the performance of many Computer Vision (CV) and Natural Language Processing (NLP) models, including image classification, image semantic segmentation, visual feature representation, visual dialog, image captioning, and dialog generation. The existing research analyzes the false correlation between the visual features and the subtitles through the causal relationship, and provides a framework for solving the confusion factor by an image subtitle deconstruction method (DIC), but still has some limitations. In their causal graph, the entire data set is considered a confounding factor, which is a difficult hierarchy to determine, and must be eliminated through complex front gate adjustments based on additional moderators. DIC focuses on eliminating the aliasing factor of the decoder and neglecting the aliasing factor of the visual features in the encoder, resulting in severe performance degradation.
Disclosure of Invention
The technical problem is as follows: aiming at the defects of the technology, the image title automatic generation method based on causal reasoning is provided, visual and language confusion factors in the sentence generation process can be processed simultaneously, a more detailed causal graph is displayed, and the performance of an image description model based on a Transformer is improved remarkably.
The technical scheme is as follows: in order to achieve the above technical object, an image title automatic generation method based on causal reasoning and transform according to the present invention is characterized in that: constructing a detailed cause and effect reasoning method image caption CIIC framework based on a cause and effect diagram, wherein the cause and effect diagram comprises a visual confusion factor and a text confusion factor;
the causal inference method image caption CIIC framework comprises a sequentially connected non-aligned feature transform encoder UFT and an interventional transform decoder ITD, wherein the non-aligned feature transform decoder comprises a sequentially connected FASTER R-CNN, an interventional target detector IOD and a standard transform encoder; the interference type Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder; an intervention type target detector IOD and an intervention type Transformer decoder ITD jointly control a visual confusion factor and a text confusion factor to encode an input image and then decode the input image;
the method comprises the steps that a non-alignment feature Transformer encoder UFT sends confusion-removing visual features extracted by an IOD and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors, integration is carried out through self attention and cross attention, addForm operation and feedforward propagation operation in a traditional Transformer are carried out, obtained outputs are transmitted to a next coding block, and the obtained outputs are L blocks in total, namely L codes are overlapped; the ITD input of the interference-predictive Transformer decoder is a currently generated sentence part, the ITD is subjected to cross attention with the final output of an encoding end through position embedding and a mask layer, addMorm operation and feed-forward propagation operation are carried out, visual and language confusion in the decoding process is eliminated through a causal intervention CI module, then the AddMorm operation is carried out, and L-time decoding is repeated to obtain final prediction output; causal intervention CI module is to fuse visual and linguistic features h 2 With a visual confusion factor D 1 And a language confusion factor D 2 A combination of the expectations of;
the interventional object detector IOD separates the region-based visual features by eliminating visual confusion factors: separating the characteristics of the region of interest through an intervention type object sensor, and then combining the characteristics with the bottom-up characteristic of a FASTER regional convolutional neural network FASTER R-CNN to be used as the input of a transform encoder; the intervention target detector IOD integrates causal reasoning into the image features extracted by FASTER R-CNN to cope with the visual confusion extracted by the traditional pre-training model, so as to obtain a non-entangled representation based on regions; inputting a result generated in a decoding stage into an intervention type Transformer decoder ITD, introducing causal intervention into a Transformer decoder used for generating a traditional image title, and reducing visual and language confusion in a decoding process;
the visual and linguistic concepts are simultaneously established through the encoder and the decoder, unobserved aliasing factors between the intervention type target detector IOD and the intervention type Transformer decoder ITD are relieved, visual and linguistic aliasing is eliminated, pseudo-correlation occurring in visual feature representation and title generation is effectively eliminated, and finally a more real image title is generated.
The method comprises the following specific steps:
extracting image features of the image of the title to be generated through FASTER R-CNN respectively, and eliminating visual confusion regional features in the image features by using an intervention type target detector IOD;
in particular, since the FASTER R-CNN target detector uses likelihood estimation P (Y | X) as a training target for the classifier, resulting in spurious correlations caused by the aliasing factor Z,
P(Y|X)=∑ z P(Y|X,Z=z)P(Z=z|X)
wherein X is the regional visual characteristic based on the input image, the visual confusion factor of the Z image, and Y is the category label;
therefore, using causal inference intervention P (Y | do (X)) as a new classifier for object detection, where do operator do (·) acts to cut link Z → X, the training time is too long since actual training requires estimation of P (Y | do (X) by sampling, and therefore by applying normalized weighted geometric mean NWGM approximation, the class probability output by the interventional target detector is:
Figure RE-GDA0003857911720000031
where concat represents the splicing of the matrix,
Figure RE-GDA0003857911720000032
is the (i) th class label and,
Figure RE-GDA0003857911720000033
is x of the pre-trained classifier
Figure RE-GDA0003857911720000034
Probability output of the class; x represents the region feature and y in the specific input image i c Represents the corresponding feature of the region, X and Y represent X and Y i c Random variables of (2), x and y i c Expressed as a specific sample value;
approximating a confusion factor therein as a fixed confusion factor dictionary
Figure RE-GDA0003857911720000035
n denotes the class size in the data set, z i Representing the average RoI characteristics of the ith, wherein each RoI characteristic is pre-trained by FASTER R-CNN, and the specific IOD characteristic extractor method comprises the steps of firstly checking the interest region RoI on the characteristic diagram, using a FASTER regional convolution neural network FASTER R-CNN to extract the interest region RoI on the characteristic diagram, and using the characteristics of each interest region RoI to respectively predict the bounding box y B And class probability output label y with peripheral vision confusion factor interference C Outputting label y according to class probability C And a confusion dictionary Z for predicting final class labels y by performing do operator I To eliminate peripheral visual confusion factor interference;
extracting confusion-solving object features from candidate regions of all interest regions RoI by using an intervention type target detector IOD as features of the IOD, sending the IOD features and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors as the features of the IOD, wherein the bottom-up features extracted from the same image have the distinguishing capability of different target attributes, Q represents a query vector, K represents a vector of the correlation of query information and other information, and V represents a vector of queried information, and integrating self attention and cross attention to promote visual representation of a CIIC model; because the bottom-up features and the IOD features are not aligned, a multi-view Transformer encoder, i.e., a non-aligned feature Transformer encoder UFT, is introduced to adjust them, and the bottom-up features and IOD features are input to the UFT encoder for alignment and fusion operations:
let the bottom-up and IOD features extracted from the image be
Figure RE-GDA0003857911720000036
And
Figure RE-GDA0003857911720000037
where m ≠ n and d1 ≠ d2, X is mapped using two linear layers constructed in a transform network F And X I Converted into a common d-dimensional space and used separately
Figure RE-GDA0003857911720000038
And
Figure RE-GDA0003857911720000039
show, select
Figure RE-GDA00038579117200000310
As the main visual feature, and learning by the main visual feature using the following formula
Figure RE-GDA00038579117200000311
Cross attention of (c):
Figure RE-GDA0003857911720000041
where Multihead (-) represents the multi-headed attention function of a standard Transformer,
Figure RE-GDA0003857911720000042
is that
Figure RE-GDA0003857911720000043
Corresponding features in the above, likewise, are established in connection with
Figure RE-GDA0003857911720000044
The multi-head attention model of (1):
Figure RE-GDA0003857911720000045
three for
Figure RE-GDA0003857911720000046
Attention from the middle-finding multiple head, i.e. Q, K and V all come from
Figure RE-GDA0003857911720000047
Therefore, pay attention to
Figure RE-GDA0003857911720000048
All having the same shape, and then using a residual standard layer Addnorm pair
Figure RE-GDA0003857911720000049
Performing fusion packaging, wherein the fused characteristic information F is as follows:
Figure RE-GDA00038579117200000410
wherein LayerNorm represents layer standardization, and finally, the fused characteristic information F is sent to an FFN module serving as a feedforward neural network in a Transformer to generate a UFT coding result;
in order to relieve the false correlation between the participated visual characteristics and the words with corresponding meanings, a standard Transformer decoder structure is constructed, a causal intervention module CI is integrated into each Transformer decoder layer based on the standard Transformer decoder structure, a region-based non-entangled representation and text are obtained in an encoder as the input of a decoder, and visual and language confusion in the decoding process is eliminated through the causal intervention module CI, so that the final image title language description is generated.
Further, an SCM (structural causal model) is utilized to construct participation visual feature V and visual context D 1 Language context D 2 And partially generating participated word characteristics h of sentences 1 Fusion characteristics h 2 And causal relationships between predictor words W: where the true causal effect is V → W, visual context D 1 Language context D 2 Influencing the visual characteristics V and the predicted words W, respectively, the middle language context D 2 By participating in word features h 1 Influencing visual characteristics V, word characteristics h 1 Co-influencing the fusion feature h with the visual feature V 2 Finally, the predicted word W is influenced; in particular, the causal effect V → W represents that the visual feature in which it is involved results in the generation of the corresponding word, D 1 Causal Effect on V represents D 1 → V because when a title model is trained, some frequently occurring visual context can severely affect the participating visual features, and the causal effect D 1 → W means that visual context directly affects the frequency of occurrence of some related words in the generated description, D 2 →h 1 → V denotes that the participatory word feature affected by the language context guides the participatory visual feature by multi-headed cross-attention, h 1 →h 2 、V→h 2 And h 2 → W denotes that the decoder fuses the visual feature and the language feature and utilizes the fused feature h 2 Deducing the next predicted word W when using the probability of observation P (W | V, h) without causal intervention 1 ) When used as a training target, the training target is determined by the confusion factor D 1 And D 2 The generative description model may learn some spurious correlations between the visual features V and the predicted words W, P (WV, h) to explain causal interventions in the image generation description 1 ) Expressed as:
Figure RE-GDA00038579117200000411
wherein the aliasing factor D 1 And D 2 Usually by P (d) 1 I V) and P (d) 2 |h 1 ) Introduction of observation bias, use of causal interventions P (W | do (V), do (h) 1 ) Instead of the conventional image description training target P (W | V, h) 1 ) Thereby eliminating D 1 Causal effects on the visual characteristics V, D 2 For participatory word characteristics h 1 So that V ← D is blocked 1 → W and h 1 ←D 2 → W two back door paths, eliminating spurious associations; suppose a confusion factor D 1 And D 2 Can be layered respectively, then P (W | do (V), do (h) 1 ) Can be adjusted according to the back door as:
Figure RE-GDA0003857911720000051
thus, the image generates a description model (P (W | do (V)), do (h) according to the adjustment in the formula 1 ) ) the predicted output probability for the model) is forced to learn the true causal effect: v → W instead of by the visual confusion factor D 1 And a language confusion factor D 2 The resulting false association; due to D 1 And D 2 Are all unobserved and, beyond the goal of the image generation description, an approximate visual confusion factor dictionary D needs to be constructed 1 And a language confusion factor dictionary D 2 Constructing a visual matrix by setting each entry in the image visual characteristics to the average RoI characteristics of the objects in the category in each image classification dataset
Figure RE-GDA0003857911720000052
Where c is the number of classes in the training dataset and d v Dimension representing each of the RoI features while using d in a predefined vocabulary e Uygur word embedding
Figure RE-GDA0003857911720000053
Constructing a semantic space, N being the length of the vocabulary, d e Is a word feature dimension; the training description model then learns two linear projections
Figure RE-GDA0003857911720000054
Will vision matrix V r Word-of-sum embedding W e By the formula: d 1 =V r P v ,D 2 =W e P w Respectively converted into D 1 And D 2 Using the NWGM approximation to calculate:
P(W|do(V),do(h 1 ))≈Softmax{g(h 2 ,E D1 [D 1 ],E D2 [D 2 ])},
wherein g (-) represents the fully connected layer,
Figure RE-GDA0003857911720000055
and
Figure RE-GDA0003857911720000056
by setting D 1 And D 2 By fusing the features h 2 Conditional on increasing the representation capability of the interferometric Transformer decoder; do (h) 1 ) P (W | do (V)) represents the probability P (W | do (V)), do (h) of predicting the generated word after eliminating the visual confusion feature 1 ) ) represents the predicted output probability after eliminating the language context confusion and visual confusion features.
Further, the unaligned feature Transformer encoder comprises a FASTER R-CNN, an interventional object detector IOD and a standard Transformer encoder comprising a multi-head attention layer, a residual error standard layer and a feedforward neural network layer;
the interference-predictive Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder, wherein the standard Transformer decoder comprises a covering attention layer, a multi-head attention layer, a residual standard layer and a feedforward neural network layer;
wherein, the multi-head attention layer, the residual error standard layer and the feedforward neural network layer of the non-alignment feature Transformer coder are partially stacked for L times; the method comprises the following steps that (1) parts consisting of a covering attention layer, a multi-head attention layer, a residual error standard layer, a feedforward neural network layer and a causal intervention CI module of an interference type Transformer decoder are stacked for L times;
the transform decoder and the transform encoder both comprise a multi-head attention layer, a residual standard layer and a feedforward neural network layer, and the interventional transform decoder passes through a visual dictionary D 1 And language dictionary D 2 Causal intervention module CI will fuse features h 2 With a visual confusion factor D 1 And a language confusion factor D 2 To predict a next word at each time step, a beginning marker being entered as text at the beginning of the prediction, and a last generated word being entered as text at each time step; that is, by back-gate adjustment, cutting off the chain of confounding factors effectively eliminates the unobserved confounding factorsCausal intervention;
firstly, extracting bottom-up features from an input image through FASTER R-CNN, extracting confusion-solving object features from a RoI candidate region through an intervention target detector IOD, performing alignment and fusion operations by a UFT encoder by taking the bottom-up features and the IOD features as input, and performing integration by the intervention Transformer decoder by taking the integrated visual features as input, wherein the input word information is simultaneously combined at each time step, and the output of the last decoder layer is projected to an N-dimensional space by a linear embedded layer, wherein N is the size of a word table; finally, the probability of words in the vocabulary is predicted by using softmax operation to generate final predicted words, namely, each time step word is from a real labeled sentence during training, and the final predicted words are input as output words of the last time step.
Further, pre-training a causal inference method image caption CIIC framework:
firstly, pre-training is carried out by adopting word-level cross entropy, a training set comprises images and corresponding description sentences, and a loss function is as follows:
Figure RE-GDA0003857911720000061
theta is all parameters of a causal inference method image caption CIIC framework model, and comprises weight and bias, w 1:T Is the target true sequence, optimizes the irreducible measure of the model through Reinforcement Learning (RL), and adopts a variant of self-supervised sequence training (SCST) on the beam search sampling sequence, minimizing the negative expectation score:
Figure RE-GDA0003857911720000062
wherein the reward r (-) is a CIDER-D score;
testing the image caption CIIC framework of the trained causal reasoning method: sentences are generated word by word in sequence using beam search, the trained model is input into the image to be recognized, and then the image is input into the decoder through a series of processes, and in the first decoding step, the top k candidates are considered. K second words are generated for the k first words. The top k [ first word, second word ] combinations are selected taking into account the score obtained. For the k second words, k third words are selected, and the top k first, second, third word combinations are selected. Each decoding step is repeated. And after finishing the k sequences, selecting the sequence with the best comprehensive score to obtain the sequence with the highest probability in the last bundle.
Has the beneficial effects that:
1) According to the method, a new image description architecture CIIC based on a Transformer is adopted from the aspect of causal relationship, causal intervention is seamlessly combined into target detection and description generation, and confusion effects are jointly relieved. On the one hand, the proposed IOD effectively unwraps visual feature entanglement and facilitates de-obfuscation of image descriptions. On the other hand, the proposed ITD employs causal intervention to simultaneously process visual and language confounding factors in the sentence generation process;
2) The method decomposes the confusion factor into visual and textual confusion factors and displays a more detailed cause and effect diagram.
3) The method can obviously improve the performance of the image description model based on the Transformer, and obtains the best image description performance at present under the single model setting of the MS-COCO data set.
Drawings
Fig. 1 is a diagram of an image caption framework used in the automatic image caption generation method based on causal reasoning according to the present invention.
FIG. 2 is a schematic diagram of the structure of an interventional object detector used in the method of the invention.
FIG. 3 is a diagram illustrating causal intervention in the image description of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings of the specification:
as shown in fig. 1, the automatic image title generation method based on causal reasoning according to the present invention first divides the confusion factor, and divides the existing causal graph into two categories: visual confusion and text confusion. A causal inference method image caption (CIIC) framework structure based on a causal graph comprises the following steps: both types of aliasing are faced jointly by an Interventional Object Detector (IOD) and an Interventional Transformer Decoder (ITD). The IOD integrates causal reasoning into FASTER R-CNN to cope with visual confusion, with the aim of obtaining a region-based non-entangled representation. ITD eliminates visual and linguistic confusion by doing it at the transform decoder stage. Region-of-interest features are first separated by an interventional object perceptron (IOD) and then combined with the bottom-up property of FASTER R-CNN as input to the transform encoder. In CIIC, we propose a causal intervention module to deal with visual and linguistic confusion in word prediction. Our CIIC can effectively eliminate the pseudo-correlation that occurs in visual feature representation and title generation to obtain a more realistic image title.
Of Interventional Object Detectors (IODs), where FASTER R-CNN is used as a visual skeleton to extract regions of interest (RoI) on feature maps. Using each RoI feature to predict class probability output y separately C And bounding box y B . Output y according to class probability C And obfuscating dictionary Z, we perform do operator to predict the final class label y i
The region of interest features are separated by an intervening object perceptron and then combined with the bottom-up property of FASTER R-CNN as input to the transform encoder. In CIIC, a causal Intervention module, cash Intervention, is proposed to deal with visual and linguistic confusion in word prediction. The symbol "L ×" in fig. 1 indicates that the coding block (including the multi-head attention layer, the residual standard layer, and the feedforward neural network layer) and the decoding block (including the mask attention layer, the multi-head attention layer, the residual standard layer, and the feedforward neural network layer, and the causal intervention CI module) in the dashed box are stacked L times. CIIC can effectively eliminate the false correlation occurring in visual feature representation and title generation to obtain a more realistic image title.
Wherein specifically:
the image features of the title to be generated are respectively extracted through FASTER R-CNN, and the regional features which are proposed through an Intervention Object Detector (IOD) and eliminate visual confusion are provided.
The adopted method of the intervention type target detector is as follows:
conventional object detectors, such as FASTER R-CNN, basically use likelihood estimation P (Y | X) as a training object for the classifier, resulting in pseudo-correlation caused by the aliasing factor Z.
P(Y|X)=∑ z P(Y|X,Z=z)P(Z=z|X)
Where X is the regional visual features based on the input image, Z is the visual confusion factor for the image, and Y is the category label.
We propose to use causal inference intervention P (Y | do (X)) as a new classifier for object detection, where the do operator do (-) acts to cut the link Z → X. Since actual training requires time consuming and labor intensive sampling to estimate P (Y | do (X)), which would make training time prohibitive, the approximation of Normalized Weighted Geometric Mean (NWGM) is applied:
Figure RE-GDA0003857911720000081
(class probability of interventional target Detector output)
Where concat represents the concatenation of the matrices,
Figure RE-GDA0003857911720000082
is the i-th class label and,
Figure RE-GDA0003857911720000083
is x of the pre-trained classifier
Figure RE-GDA0003857911720000084
Probability output of the class; x represents the region feature and y in the specific input image i c Indicating the corresponding feature of the region. X and Y represent X and Y i c Random variables of (2), x and y i c Represented as a specific sample value.
Approximating a confusion factor therein as a fixed confusion factor dictionary
Figure RE-GDA0003857911720000085
n denotes the class size in the dataset, z i Representing the average RoI characteristics of the ith, each RoI characteristic is pre-trained by FASTER R-CNN, and the specific IOD characteristic extractor structure is shown in FIG. 2, wherein FASTER R-CNN (FASTER regional convolutional neural network) is used for extracting the region of interest RoI on the characteristic diagram, such as 'the upper body part of a blue clothes child', and each RoI characteristic is used for respectively predicting the class probability output label y C (with interference of peripheral visual confusion factor) and bounding box y B Outputting label y according to class probability C And a confusion dictionary Z for predicting the final class label y by executing do operator I (elimination of peripheral visual confusion factor interference);
and thirdly, extracting confusion-solving object features, namely IOD features, from all the RoI candidate regions by using an IOD extractor, and sending the IOD features and the bottom-up features extracted from the same image into two linear layers to map to generate Q, K and V vectors in consideration of the discrimination capability that the extracted bottom-up features have different target attributes, wherein Q represents a query vector, K represents a vector of the correlation between the query information and other information, and V represents a vector of the queried information. Integrating by self-attention and cross-attention to facilitate visual representation of the CIIC model; since the bottom-up features and the IOD features are misaligned, a multi-view Transformer encoder, i.e., a non-alignment feature Transformer encoder UFT, is introduced to adjust them, the UFT encoder taking as input the misaligned visual features (referring to the bottom-up features and the IOD features) while performing the alignment and fusion operations:
is provided with
Figure RE-GDA0003857911720000091
And
Figure RE-GDA0003857911720000092
representing bottom-up and IOD features extracted from an image, respectively, where m ≠ n and d1 ≠ d2, using two linear layers (constructed in a transform network) to map X to X F And X I Converted into a common d-dimensional space, respectively
Figure RE-GDA0003857911720000093
And
Figure RE-GDA0003857911720000094
show, select
Figure RE-GDA0003857911720000095
As the main visual feature, and learning by the main visual feature using the following formula
Figure RE-GDA0003857911720000096
Cross attention of (c):
Figure RE-GDA0003857911720000097
where MultiHead (-) represents the multi-headed attention function of a standard transducer,
Figure RE-GDA0003857911720000098
is that
Figure RE-GDA0003857911720000099
The corresponding features in the above, likewise, are established in connection with
Figure RE-GDA00038579117200000910
The multi-head attention model is:
Figure RE-GDA00038579117200000911
note that
Figure RE-GDA00038579117200000912
All have the same shape (the above formula is in three)
Figure RE-GDA00038579117200000913
Finding outThe multiple head is self-attentive, i.e. Q, K and V all come from
Figure RE-GDA00038579117200000914
) Then, encapsulation is performed by using AddNorm (residual standard layer), and the fused feature information F is as follows:
Figure RE-GDA00038579117200000915
the LayerNorm represents layer standardization, and finally, the fused characteristic information F is sent to an FFN module (a feedforward neural network in a Transformer) to generate a coding result of the UFT;
and fourthly, in order to relieve the false correlation between the participatory visual characteristics and the corresponding words, constructing a Transformer-based decoder structure, integrating a causal intervention module into each Transformer decoder layer based on the Transformer decoder structure so as to deal with visual and language confusion in the image description, as shown in the figure 1, introducing the causal intervention module into the traditional Transformer decoder, taking the non-entangled representation and the text based on the region obtained in the encoder as the input of the decoder, and eliminating the visual and language confusion in the decoding process through the causal intervention module to generate the final image description.
Causal intervention in image description: as shown in fig. 3, the real causal effect V → W is captured by cutting two links D2 → h1, D1 → V, respectively, to block the rear door path V ← h1 ← D2 → W, V ← D1 → W.
Construction of participation visual feature V and visual context D by using SCM (structural causal model) 1 Language context D 2 Partial generation of participatory word characteristics h of sentences 1 Fusion characteristics h 2 And causal relationships between the predicted words W:
in particular, the causal effect V → W indicates that the visual feature in which it is involved results in the generation of a corresponding word, D 1 Causal Effect on V represents D 1 → V, since some frequently occurring visual context can severely impact participating visuals when a title model is trainedThe sense characteristics, and the causal effects D 1 → W means that visual context directly affects the frequency of occurrence of some related words in the generated description, D 2 →h 1 → V represents the participatory word feature affected by the language context leading the participatory visual feature by multi-headed cross-attention, h 1 →h 2 、V→h 2 And h 2 → W denotes that the decoder fuses the visual feature and the language feature and utilizes the fused feature h 2 Deducing the next predicted word W, when using the observation probability P (W | V, h) 1 ) (observation probability without causal intervention) as training target due to the confusion factor D 1 And D 2 The generative description model may learn some false correlations between the visual features V and the predicted words W, P (W | V, h) to explain causal interventions in the image generation description 1 ) Expressed as:
Figure RE-GDA0003857911720000101
wherein the confounding factor generally passes through P (d) 1 I V) and P (d) 2 |h 1 ) Introduction of observation bias, use of causal interventions P (W | do (V), do (h) 1 ) Instead of the conventional image description training target P (W | V, h) 1 ) Thereby eliminating D 1 Causal effects on the visual characteristics V, D 2 For participatory word characteristic h 1 So that V ← D is blocked 1 → W and h 1 ←D 2 → W two back door paths, eliminating spurious associations; suppose confusion factor D 1 And D 2 Can be layered respectively, then P (W | do (V), do (h) 1 ) Can be adjusted according to the back door as:
Figure RE-GDA0003857911720000102
thus, the image generates a description model (P (W | do (V)), do (h) according to the adjustments in the formula 1 ) ) the predicted output probability for the model) is forced to learn the true causal effect: v → W instead of by the visual confusion factor D 1 And language confusionFactor D 2 The resulting false association; due to D 1 And D 2 Are all unobserved and, beyond the goal of the image generation description, an approximate visual confusion factor dictionary D needs to be constructed 1 And a language confusion factor dictionary D 2 (visual features and word embedding are obtained by linear projection, D 1 And D 2 Through coarse inclination, with D 1 And D 2 Different), a visual matrix is constructed by setting each entry in the image visual characteristics to the average RoI characteristics of the objects in each class (class in the image classification dataset)
Figure RE-GDA0003857911720000103
Where c is the number of classes in the training dataset (the common standard dataset) and d v Dimension representing each of the RoI features while using d in a predefined vocabulary e Uygur word embedding
Figure RE-GDA0003857911720000104
Constructing a semantic space, N being the length of the vocabulary, d e Is a word feature dimension; the description model is then trained to learn two linear projections
Figure RE-GDA0003857911720000105
Will vision matrix V r Word-and-phrase embedding W e By the formula: d 1 =V r P v ,D 2 =W e P w Are converted into D respectively 1 And D 2 Using the NWGM approximation to calculate:
P(W|do(V),do(h 1 ))≈Softmax{g(h 2 ,E D1 [D 1 ],E D2 [D 2 ])},
wherein g (-) represents the fully connected layer,
Figure RE-GDA0003857911720000106
and
Figure RE-GDA0003857911720000107
by setting D 1 And D 2 By fusing the features h 2 Conditional on increasingThe presentation capability of the ITD;
transformer decoder architecture: the transform decoder architecture is shown in fig. 1, where the non-aligned feature transform (UFT) encoder consists of FASTER R-CNN (FASTER regional convolutional neural network), an intervening target detector IOD, and a standard transform encoder (including multi-headed attention layer, residual standard layer, and feedforward neural network layer), and the intervening transform decoder consists of a causal intervening CI module inserted after the feedforward neural network layer module of the standard transform decoder (corresponding to the light red layer above the middle dashed box decoder in fig. 1), where the symbol "L ×" represents that the coding block (including multi-headed attention layer, residual standard layer, and feedforward neural network layer) and the decoding block (including masked attention layer, multi-headed attention layer, residual standard layer, and feedforward neural network layer, and causal intervening CI module) in the dashed box are stacked L times. A general Transformer decoder is similar to a Transformer encoder and comprises a multi-head attention layer, a residual standard layer and a feedforward neural network layer, except that a plurality of covering self-attention layers are stacked in sequence by L identical decoder layers 1 And language dictionary D 2 CI Module will fuse features h 2 With a visual confusion factor D 1 And a language confusion factor D 2 Combined to predict the next word at each time step (a beginning marker is entered as text at the beginning of the prediction, followed by the last generated word as text entry at each time step). I.e. it is in fact a causal intervention by backgate adjustment (elimination of confounders that would not have been observed by cutting the chain of confounders). Firstly, extracting bottom-up features from an input image through FASTER R-CNN, extracting confusion object features from a RoI candidate region through an intervention target detector IOD, performing alignment and fusion operations by using unaligned visual features (the bottom-up features and the IOD features) as input by a UFT (unified view transform) encoder, and performing integrated visual features on each object by using an Intervention Transform Decoder (ITD)The time step is combined with the input word information and the output of the last decoder layer is then projected by the linear embedding layer into an N-dimensional space, where N is the vocabulary size. Finally, the probability of predicting words in the vocabulary using softmax operation yields the final predicted words (each time step word from the real labeled sentence at training, the final prediction being the output word at the last time step at the input).
Pre-training a CIIC model, wherein the model is pre-trained by adopting word-level cross entropy (a training set comprises images and corresponding description sentences), and a loss function is as follows:
Figure RE-GDA0003857911720000111
where θ is all parameters (including weight and bias) of the model (CIIC model), w 1:T Is the target true sequence, optimizes the irreducible measure of the model through Reinforcement Learning (RL), and adopts a variant of self-supervised sequence training (SCST) on the beam search sampling sequence, minimizing the negative expectation score:
Figure RE-GDA0003857911720000112
wherein the reward r (-) is the CIDER-D score;
in the testing phase, sentences are generated word by word in sequence using beam search, the trained model is input into the image to be recognized, and then the image is input into the decoder through a series of processes, and in the first decoding step, the top k candidates are considered. K second words are generated for the k first words. The first k [ first word, second word ] combinations are selected, taking into account the score obtained. For the k second words, k third words are selected, and the top k first, second, third word combinations are selected. Each decoding step is repeated. And after finishing the k sequences, selecting the sequence with the best comprehensive score to obtain the sequence with the highest probability in the last bundle.
To represent image features, first of allThe proposed IOD is trained on the MSCOCO dataset to extract 1024-dimensional IOD features of the top 100 objects with the highest confidence. The 2048-dimensional bottom-Up features of the detected object are then extracted using a pre-trained Up-Down model. These two features are finally projected linearly to the model of input dimension d =512 and input to the UFT encoder. Words were represented in the experiment using one-hot vectors and pre-trained GloVe word embedding, respectively. Both projected linearly onto the 512-dimensional input vector of the ITD. To represent the word positions in the sentence, the input vectors and their sinusoidal position codes are added before the first decoding layer. Words outside the vocabulary are represented as all zero vectors. An Adam optimizer was used in the training phase, with a batch size of 10 and a bundle size of 5. A step decay scheme with preheat equal to 20000 is used to vary the learning rate. All models were first trained on 30 epochs with cross-entropy loss, then on another 30 with a learning rate of 5 × 10 with CIDER reward -6 Is further optimized. In the inference phase, we employ a bundle search strategy with bundle size 3.
In conclusion, causal intervention is seamlessly integrated into target detection and description generation by using a new transform-based image description architecture CIIC from the causal point of view, and confusion effects are jointly alleviated. In one aspect, the proposed IOD effectively unwraps visual feature entanglement and facilitates de-obfuscation of image descriptions. On the other hand, the proposed ITD adopts causal intervention to simultaneously process visual and language confusion factors in the sentence generation process, and experimental results show that the method remarkably improves the performance of a Transformer-based image description model and realizes a new advanced level in a single model structure of an MS-COCO data set.

Claims (5)

1. An automatic image title generation method based on causal reasoning and Transformer is characterized in that: constructing a detailed cause and effect reasoning method image caption CIIC framework based on a cause and effect diagram, wherein the cause and effect diagram comprises a visual confusion factor and a text confusion factor;
the causal inference method image caption CIIC framework comprises a sequentially connected non-aligned feature transform encoder UFT and an interventional transform decoder ITD, wherein the non-aligned feature transform decoder comprises a sequentially connected FASTER R-CNN, an interventional target detector IOD and a standard transform encoder; the interference type Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder; an intervention type target detector IOD and an intervention type Transformer decoder ITD jointly control a visual confusion factor and a text confusion factor to encode an input image, and then decode the input image;
the method comprises the steps that a non-alignment feature Transformer encoder UFT sends confusion-removing visual features extracted by an IOD and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors, integration is carried out through self attention and cross attention, addForm operation and feedforward propagation operation in a traditional Transformer are carried out, obtained outputs are transmitted to a next coding block, and the obtained outputs are L blocks in total, namely L codes are overlapped; the ITD input of the interference-predictive Transformer decoder is a currently generated sentence part, the ITD is subjected to cross attention with the final output of an encoding end through position embedding and a mask layer, addMorm operation and feed-forward propagation operation are carried out, visual and language confusion in the decoding process is eliminated through a causal intervention CI module, then the AddMorm operation is carried out, and L-time decoding is repeated to obtain final prediction output; causal intervention CI module is to fuse visual and linguistic features h 2 With a visual confusion factor D 1 And a language confusion factor D 2 A combination of the expectations of;
the interventional object detector IOD separates the region-based visual features by eliminating visual confusion factors: separating the characteristics of the region of interest through an intervention type object sensor, and then combining the characteristics with the bottom-up characteristic of a FASTER regional convolutional neural network FASTER R-CNN to be used as the input of a transform encoder; the intervention target detector IOD integrates causal reasoning into image features extracted by FASTER R-CNN to cope with visual confusion extracted by a traditional pre-training model, so as to obtain non-entangled expression based on regions; inputting the result generated in the decoding stage into an intervention type Transformer decoder ITD, introducing causal intervention into a Transformer decoder used for generating a traditional image title, and reducing visual and language confusion in the decoding process;
by simultaneously establishing visual and linguistic concepts through the encoder and the decoder, unobserved aliasing factors between the IOD and the ITD are relieved, visual and linguistic aliasing is eliminated, pseudo-correlation occurring in visual feature representation and title generation is effectively eliminated, and finally a truer image title is generated.
2. The causal inference and transform-based image title automatic generation method of claim 1, wherein: the method comprises the following specific steps:
extracting image features of the image of the title to be generated through FASTER R-CNN respectively, and eliminating visual confusion regional features in the image features by using an intervention type target detector IOD;
in particular, since the FASTER R-CNN target detector uses likelihood estimation P (Y | X) as a training target for the classifier, resulting in spurious correlations caused by the aliasing factor Z,
P(Y|X)=Σ z P(Y|X,Z=z)P(Z=z|X)
wherein X is the regional visual characteristic based on the input image, the visual confusion factor of the Z image, and Y is the category label;
therefore, using causal inference intervention P (Y | do (X)) as a new classifier for object detection, where do operator do (·) acts to cut link Z → X, the training time is too long since actual training requires estimation of P (Y | do (X) by sampling, and therefore by applying normalized weighted geometric mean NWGM approximation, the class probability output by the interventional target detector is:
Figure RE-FDA0003857911710000021
where concat represents the splicing of the matrix,
Figure RE-FDA0003857911710000022
is the i-th class label and,
Figure RE-FDA0003857911710000023
is x of the pre-trained classifier
Figure RE-FDA0003857911710000024
Probability output of the class; x denotes the region feature and y in the specific input image i c Represents the corresponding feature of the region, X and Y represent X and Y i c Random variables of (1), x and y i c Expressed as a specific sample value;
approximating the confusion factor therein to a fixed confusion factor dictionary
Figure RE-FDA0003857911710000025
n denotes the class size in the dataset, z i Representing the average RoI characteristics of the ith, wherein each RoI characteristic is pre-trained by FASTER R-CNN, the specific IOD characteristic extractor method comprises the steps of firstly checking the region of interest RoI on the characteristic diagram, using a FASTER regional convolutional neural network FASTER R-CNN to extract the region of interest RoI on the characteristic diagram, and using the characteristics of each region of interest RoI to respectively predict a bounding box y B And class probability output label y with peripheral vision confusion factor interference C Output label y according to class probability C And a confusion dictionary Z for predicting the final class label y by executing do operator I To eliminate peripheral visual confusion factor interference;
extracting confusion-solving object features from candidate regions of all interest regions RoI by using an intervention type target detector IOD as features of the IOD, sending the IOD features and bottom-up features extracted from the same image into two linear layers for mapping to generate Q, K and V vectors as the features of the IOD, wherein the bottom-up features extracted from the same image have the distinguishing capability of different target attributes, Q represents a query vector, K represents a vector of the correlation of query information and other information, and V represents a vector of queried information, and integrating self attention and cross attention to promote visual representation of a CIIC model; since the bottom-up features and the IOD features are not aligned, a multi-view transform encoder, i.e., a non-aligned feature transform encoder UFT, is introduced to adjust them, inputting the bottom-up features and IOD features into the UFT encoder for alignment and fusion operations:
let the bottom-up and IOD features extracted from the image be
Figure RE-FDA0003857911710000026
And
Figure RE-FDA0003857911710000027
where m ≠ n and d1 ≠ d2, X is represented by two linear layers constructed in a Transformer network F And X I Converted into a common d-dimensional space and used separately
Figure RE-FDA0003857911710000028
And
Figure RE-FDA0003857911710000029
show, select
Figure RE-FDA00038579117100000210
As the main visual feature, and learning by the main visual feature using the following formula
Figure RE-FDA00038579117100000211
Cross attention of (1):
Figure RE-FDA0003857911710000031
where Multihead (-) represents the multi-headed attention function of a standard Transformer,
Figure RE-FDA0003857911710000032
is that
Figure RE-FDA0003857911710000033
The corresponding features ofAs such, establish information about
Figure RE-FDA0003857911710000034
The multi-head attention model of (1):
Figure RE-FDA0003857911710000035
three are
Figure RE-FDA0003857911710000036
The attention of the middle-finding multiple heads, i.e., Q, K and V all come from
Figure RE-FDA0003857911710000037
Therefore, pay attention to
Figure RE-FDA0003857911710000038
All having the same shape, and then using a residual standard layer Addnorm pair
Figure RE-FDA0003857911710000039
Performing fusion packaging, wherein the fused characteristic information F is as follows:
Figure RE-FDA00038579117100000310
wherein LayerNorm represents layer standardization, and finally, the fused characteristic information F is sent to an FFN module serving as a feedforward neural network in a Transformer to generate a UFT coding result;
in order to relieve the false correlation between the participated visual characteristics and the words with corresponding meanings, a standard Transformer decoder structure is constructed, a causal intervention module CI is integrated into each Transformer decoder layer based on the standard Transformer decoder structure, a region-based non-entangled representation and text are obtained in an encoder as the input of a decoder, and visual and language confusion in the decoding process is eliminated through the causal intervention module CI, so that the final image title language description is generated.
3. The causal inference and transform-based image title automatic generation method of claim 2, wherein: construction of participatory visual feature V and visual context D by using structural causal model SCM 1 Language context D 2 And partially generating participated word characteristics h of sentences 1 Fusion characteristics h 2 And causal relationships between the predicted words W: where the true causal effect is V → W, visual context D 1 Language context D 2 Influencing the visual characteristics V and the predicted words W, respectively, the middle language context D 2 By participating in word features h 1 Influencing visual characteristics V, word characteristics h 1 Co-influencing the fusion feature h with the visual feature V 2 Finally, the predicted word W is influenced; in particular, the causal effect V → W indicates that the visual feature in which it is involved results in the generation of a corresponding word, D 1 Causal Effect on V represents D 1 → V, since some frequently occurring visual context can severely affect the participating visual features when a title model is trained, and the causal effect D 1 → W means that the visual context directly affects the frequency of occurrence of some related words in the generated description, D 2 →h 1 → V denotes that the participatory word feature affected by the language context guides the participatory visual feature by multi-headed cross-attention, h 1 →h 2 、V→h 2 And h 2 → W shows a decoder fusing visual feature and language feature and using the fused feature h 2 Deducing the next predicted word W when using the probability of observation P (W | V, h) without causal intervention 1 ) When used as a training target, the training target is determined by the confusion factor D 1 And D 2 The generative description model may learn some spurious correlations between the visual features V and the predicted words W, P (WV, h) to explain causal interventions in the image generation description 1 ) Expressed as:
Figure RE-FDA0003857911710000041
wherein the aliasing factor D 1 And D 2 Usually by P (d) 1 I V) and P (d) 2 |h 1 ) Introduction of observation bias, use of causal interventions P (W | do (V), do (h) 1 ) Instead of the conventional image description training target P (W | V, h) 1 ) Thereby eliminating D 1 Causal effects on the visual characteristics V, D 2 For participatory word characteristics h 1 So that V ← D is blocked 1 → W and h 1 ←D 2 → W two back door paths, eliminating spurious associations; suppose a confusion factor D 1 And D 2 Can be layered respectively, then P (W | do (V), do (h) 1 ) Can be adjusted according to the back door as:
Figure RE-FDA0003857911710000042
thus, the image generates a description model (P (W | do (V)), do (h) according to the adjustment in the formula 1 ) ) the predicted output probability for the model) is forced to learn the true causal effect: v → W instead of by the visual confusion factor D 1 And a language confusion factor D 2 The resulting false association; due to D 1 And D 2 Are all unobserved and, beyond the goal of the image generation description, an approximate visual confusion factor dictionary D needs to be constructed 1 And a language confusion factor dictionary D 2 Constructing a visual matrix by setting each entry in the image visual features to the average RoI feature of the objects in the class in each image classification dataset
Figure RE-FDA0003857911710000043
Where c is the number of classes in the training dataset and d v Dimension representing each RoI feature while using d in a predefined vocabulary e Uygur word embedding
Figure RE-FDA0003857911710000044
Constructing a semantic space, N being the length of the vocabulary, d e Is a word feature dimension; then training the description model to learn twoA linear projection
Figure RE-FDA0003857911710000045
Will vision matrix V r Word-and-phrase embedding W e By the formula: d 1 =V r P v ,D 2 =W e P w Are converted into D respectively 1 And D 2 Calculating by using NWGM approximation:
P(W|do(V),do(h 1 ))≈Softmax{g(h 2 ,E D1 [D 1 ],E D2 [D 2 ])},
wherein g (-) represents the fully connected layer,
Figure RE-FDA0003857911710000046
and
Figure RE-FDA0003857911710000047
by setting D 1 And D 2 By fusing the features h 2 Conditional on increasing the representation capability of the interferometric Transformer decoder; do (h) 1 ) P (W | do (V)) represents the probability P (W | do (V), do (h)) of predicting the generated word after eliminating the visual confusion feature 1 ) ) represents the predicted output probability after eliminating the language context confusion and visual confusion features.
4. The causal inference and transform-based image title automatic generation method of claim 2,
the non-alignment feature Transformer encoder comprises a FASTER R-CNN, an intervention target detector IOD and a standard Transformer encoder comprising a multi-head attention layer, a residual error standard layer and a feedforward neural network layer;
the interference-predictive Transformer decoder is formed by inserting a causal intervention CI module after a feedforward neural network layer module of a standard Transformer decoder, wherein the standard Transformer decoder comprises a covering attention layer, a multi-head attention layer, a residual standard layer and a feedforward neural network layer;
stacking parts consisting of a multi-head attention layer, a residual standard layer and a feedforward neural network layer of a non-alignment feature Transformer encoder for L times; the method comprises the following steps that (1) parts consisting of a covering attention layer, a multi-head attention layer, a residual error standard layer, a feedforward neural network layer and a causal intervention CI module of an interference type Transformer decoder are stacked for L times;
the transform decoder and the transform encoder both comprise a multi-head attention layer, a residual standard layer and a feedforward neural network layer, and the interventional transform decoder passes through a visual dictionary D 1 And language dictionary D 2 Causal intervention module CI will fuse features h 2 With a visual confusion factor D 1 And a language confusion factor D 2 To predict a next word at each time step, a beginning marker being entered as text at the beginning of the prediction, and a last generated word being entered as text at each time step; namely, through back door adjustment, the link of the confusion factor is cut off, and the unobserved confusion factor elimination is effectively eliminated to realize causal intervention;
firstly, extracting bottom-up features from an input image through FASTER R-CNN, extracting confusion-solving object features from a RoI candidate region through an intervention target detector IOD, performing alignment and fusion operations by a UFT encoder by taking the bottom-up features and the IOD features as input, and performing integration by the intervention Transformer decoder by taking the integrated visual features as input, wherein the input word information is simultaneously combined at each time step, and the output of the last decoder layer is projected to an N-dimensional space by a linear embedded layer, wherein N is the size of a word table; finally, the probability of words in the vocabulary is predicted by using softmax operation to generate final predicted words, namely, each time step word is from a real labeled sentence during training, and the final predicted words are input as output words of the last time step.
5. The causal inference and Transformer-based image caption automatic generation method of claim 4, wherein a causal inference method image caption CIIC framework is pre-trained:
firstly, pre-training is carried out by adopting word-level cross entropy, a training set comprises images and corresponding description sentences, and a loss function is as follows:
Figure RE-FDA0003857911710000051
theta is all parameters of a causal inference method image caption CIIC framework model, and comprises weight and bias, w 1:T Is the target true sequence, optimizes the irreducible measure of the model through Reinforcement Learning (RL), and adopts a variant of self-supervised sequence training (SCST) on the beam search sampling sequence, minimizing the negative expectation score:
Figure RE-FDA0003857911710000052
wherein the reward r (-) is the CIDER-D score;
testing the trained causal reasoning method image caption CIIC framework: using beam search to generate sentences word by word in sequence, inputting images to be recognized by a trained model, inputting the images into a decoder through a series of processing, and considering the first k candidates in the first decoding step; generating k second words for the k first words, selecting the first k [ first words, second words ] combinations in consideration of the obtained scores, selecting k third words for the k second words, selecting the top k [ first words, second words, third words ] combinations, repeating each decoding step, and after the k sequences are finished, selecting the sequence with the best composite score to obtain the sequence with the highest probability in the last bundle.
CN202210661517.XA 2022-06-13 2022-06-13 Image title automatic generation method based on causal reasoning Pending CN115239944A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210661517.XA CN115239944A (en) 2022-06-13 2022-06-13 Image title automatic generation method based on causal reasoning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210661517.XA CN115239944A (en) 2022-06-13 2022-06-13 Image title automatic generation method based on causal reasoning

Publications (1)

Publication Number Publication Date
CN115239944A true CN115239944A (en) 2022-10-25

Family

ID=83669874

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210661517.XA Pending CN115239944A (en) 2022-06-13 2022-06-13 Image title automatic generation method based on causal reasoning

Country Status (1)

Country Link
CN (1) CN115239944A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116307274A (en) * 2023-05-18 2023-06-23 北京航空航天大学 Urban area energy consumption prediction method considering causal intervention
CN116502958A (en) * 2023-05-10 2023-07-28 浙江大学 Spinning processing quality diagnosis method for aerospace thin-walled part based on causal analysis
CN116612365A (en) * 2023-06-09 2023-08-18 匀熵智能科技(无锡)有限公司 Image subtitle generating method based on target detection and natural language processing
CN116738345A (en) * 2023-08-15 2023-09-12 腾讯科技(深圳)有限公司 Classification processing method, related device and medium
CN116862000A (en) * 2023-09-01 2023-10-10 浪潮电子信息产业股份有限公司 Causal thinking chain generation method, device and equipment for generating artificial intelligence
CN116912851A (en) * 2023-07-25 2023-10-20 京东方科技集团股份有限公司 Image processing method, device, electronic equipment and readable storage medium
CN117273151A (en) * 2023-11-21 2023-12-22 杭州海康威视数字技术股份有限公司 Scientific instrument use analysis method, device and system based on large language model
CN117648633A (en) * 2024-01-29 2024-03-05 西南石油大学 Sensitive information identification method and system based on enhanced graph attention network

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116502958A (en) * 2023-05-10 2023-07-28 浙江大学 Spinning processing quality diagnosis method for aerospace thin-walled part based on causal analysis
CN116502958B (en) * 2023-05-10 2023-11-10 浙江大学 Spinning processing quality diagnosis method for aerospace thin-walled part based on causal analysis
CN116307274A (en) * 2023-05-18 2023-06-23 北京航空航天大学 Urban area energy consumption prediction method considering causal intervention
CN116307274B (en) * 2023-05-18 2023-08-18 北京航空航天大学 Urban area energy consumption prediction method considering causal intervention
CN116612365A (en) * 2023-06-09 2023-08-18 匀熵智能科技(无锡)有限公司 Image subtitle generating method based on target detection and natural language processing
CN116612365B (en) * 2023-06-09 2024-01-23 匀熵智能科技(无锡)有限公司 Image subtitle generating method based on target detection and natural language processing
CN116912851A (en) * 2023-07-25 2023-10-20 京东方科技集团股份有限公司 Image processing method, device, electronic equipment and readable storage medium
CN116738345A (en) * 2023-08-15 2023-09-12 腾讯科技(深圳)有限公司 Classification processing method, related device and medium
CN116738345B (en) * 2023-08-15 2024-03-01 腾讯科技(深圳)有限公司 Classification processing method, related device and medium
CN116862000A (en) * 2023-09-01 2023-10-10 浪潮电子信息产业股份有限公司 Causal thinking chain generation method, device and equipment for generating artificial intelligence
CN116862000B (en) * 2023-09-01 2024-01-23 浪潮电子信息产业股份有限公司 Causal thinking chain generation method, device and equipment for generating artificial intelligence
CN117273151A (en) * 2023-11-21 2023-12-22 杭州海康威视数字技术股份有限公司 Scientific instrument use analysis method, device and system based on large language model
CN117273151B (en) * 2023-11-21 2024-03-15 杭州海康威视数字技术股份有限公司 Scientific instrument use analysis method, device and system based on large language model
CN117648633A (en) * 2024-01-29 2024-03-05 西南石油大学 Sensitive information identification method and system based on enhanced graph attention network
CN117648633B (en) * 2024-01-29 2024-04-19 西南石油大学 Sensitive information identification method and system based on enhanced graph attention network

Similar Documents

Publication Publication Date Title
CN115239944A (en) Image title automatic generation method based on causal reasoning
CN110909673B (en) Pedestrian re-identification method based on natural language description
CN111783705B (en) Character recognition method and system based on attention mechanism
CN112633364A (en) Multi-modal emotion recognition method based on Transformer-ESIM attention mechanism
CN113486669B (en) Semantic recognition method for emergency rescue input voice
US20220076100A1 (en) Multi-Dimensional Deep Neural Network
CN116341562A (en) Similar problem generation method based on Unilm language model
CN113946706A (en) Image description generation method based on reference preposition description
CN113657115A (en) Multi-modal Mongolian emotion analysis method based on ironic recognition and fine-grained feature fusion
CN113919319A (en) Script event prediction method based on action scene reinforcement
CN110570484B (en) Text-guided image coloring method under image decoupling representation
CN115718815A (en) Cross-modal retrieval method and system
CN113656569A (en) Generating type dialogue method based on context information reasoning
CN116958700A (en) Image classification method based on prompt engineering and contrast learning
CN116108856B (en) Emotion recognition method and system based on long and short loop cognition and latent emotion display interaction
CN116595407A (en) Event argument detection method and system based on label sequence consistency modeling
CN115329776B (en) Semantic analysis method for network security co-processing based on less-sample learning
CN116452688A (en) Image description generation method based on common attention mechanism
CN114972795B (en) Garment image subtitle generating method combining attribute detection and visual perception
CN115828142A (en) Interpretable text classification system based on double-path encoder
CN115588486A (en) Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof
CN111813924A (en) Category detection algorithm and system based on extensible dynamic selection and attention mechanism
CN117746441B (en) Visual language understanding method, device, equipment and readable storage medium
CN117173530B (en) Target abnormality detection method and device
CN112765955B (en) Cross-modal instance segmentation method under Chinese finger representation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination