CN116912599A

CN116912599A - Image diversified description method based on conditional variation self-coding and contrast learning

Info

Publication number: CN116912599A
Application number: CN202311009413.1A
Authority: CN
Inventors: 刘明明; 刘兵; 徐静; 张海燕
Original assignee: Jiangsu Institute of Architectural Technology
Current assignee: Jiangsu Institute of Architectural Technology
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-10-20

Abstract

The invention provides an image diversified description method based on condition variation self-coding and contrast learning, which comprises the following steps: pre-training each image-description pair in the training set to be input into a single-branch condition variation self-encoder and decoder network model to obtain a pre-trained sequence hidden variable; introducing another single-branch conditional variation from the encoder and decoder network, calculating the variation lower bound loss function, obtaining by the branch conditional variation from the encoderObtaining coded sequence hidden variables; introducing word-level bidirectional comparison learning loss function and improving sequence hidden variable z _c Word level discrimination capability of (2); obtaining a global contrast learning objective function; and obtaining a joint optimization target, and improving the hidden representation discrimination capability and the fine-grained description statement decoding capability of the image diversity model, thereby obtaining high-quality diversified description statements.

Description

Image diversified description method based on conditional variation self-coding and contrast learning

Technical Field

The invention relates to a digital data processing technology, in particular to an image diversified description method based on conditional variation self-coding and contrast learning.

Background

The image description task is a basic multi-modal task in the cross field of computer vision and natural language processing. Early image description models ignored the diversity of the generated description and focused only on accuracy, resulting in the generated description having simple and repeatable characteristics, not conforming to the rich features of human language. Therefore, image diversity description has become a research hotspot for more and more researchers, and the task is to generate a plurality of description sentences with various words and sentence patterns for a given image under the premise of ensuring that accurate description is generated. The patent with application number 202211628528.4 provides an explanatory text classification system based on a double-way encoder, and the patent discloses that the problem of mismatching of query and attention results is solved by fusing semantic expressions of multi-head attention and a bidirectional gating circulation unit; however, each hidden variable in this patent is directly generated by a different single branch, so that hidden variables generated by these models lack distinguishing between paired and unpaired image-descriptions, resulting in descriptions that are all quite different from each other even for different image generation.

Disclosure of Invention

The invention aims to provide an image diversification description method based on conditional variation self-coding and contrast learning, which comprises the following steps:

step S100, pre-training each image-description pair in the training set into a single-branch condition variation self-encoder and decoder network model to obtain a pre-trained sequence hidden variable z _b ；

Step S200, introducing another single-branch conditional variation from the encoder and decoder network, calculating the variation lower bound loss function L ₁ (x, I) obtaining the encoded sequence hidden variable z from the encoder by the branch conditional variation _c ；

Step S300, introducing word-level bidirectional contrast learning loss function L _r (z _b ,z _c ) Lifting sequence hidden variable z _c Word level discrimination capability of (2);

step S400, utilizing the pre-trained sequence hidden variable z _b Generating corresponding description sentence C _N Using sequence hidden variables z _c Generating corresponding description sentence C _G Using the true sentence C _T And cross entropy loss function pair C _G Performing sentence-level global contrast learning to obtain a global contrast learning objective function L ₂ (x,I)；

Step S500, obtaining a joint optimization target L _total ＝λ ₁ L ₁ (x,I)+λ ₂ L ₂ (x,I)+λ ₃ L _r (z _b ,z _c ) Improving hidden representation discrimination capability and fine granularity descriptive statement decoding capability of an image diversification model, thereby obtaining high-quality diversified descriptive statements, wherein lambda ₁ 、λ ₂ and λ₃ To balance the weight parameters lost per section.

Further, in step S200, the lower bound loss function L is varied ₁ (x, I) is obtained by

wherein ,log-conditional likelihood representing decoder-generated statementsIs (are) desirable to be (are)>Representing a priori model ++in all time steps>Sum posterior modelP type _θ (z _t z _＜t ,x _＜t KL divergence of I), θ is a model parameter, x represents a description of T length, x _t For the word generated in step t, z is the hidden variable, z _t And (3) as the hidden variable of the T step, T epsilon T and I represent images.

Further, in step S300, word-level bi-directional contrast learning loss function L _r (z _b ,z _c ) Obtained by

Wherein m represents a forward boundary, (z) _b ,z _c ) Representing pairwise hidden variables obtained from the encoding of the same image-description pair in the same batch by the encoder from conditional variants of the double branches, respectively, (z) _b ',z _c) and (z_b ,z _c ') is an unpaired hidden variable in the same lot.

Further, in step S400, the global contrast learning objective function L ₂ (x, I) is obtained by

wherein ,L_XE Represents cross entropy loss, K represents batch size, and α is a super parameter.

Compared with the prior art, the invention has the following advantages: the double-branch conditional variation self-encoder provided by the invention has the advantages that the sequence variation self-encoding is combined with the contrast learning, and the diversity is obviously improved on the premise of ensuring the accuracy of model generation description

Drawings

FIG. 1 is a schematic diagram of the method of the present invention.

FIG. 2 is a schematic diagram of a graphical model of word generation in sequence hidden space in accordance with the present invention.

FIG. 3 is a qualitative comparison schematic of different image description model generation descriptions.

FIG. 4 is a schematic diagram of a qualitative example of a DS-CVAE generation description based on a test set.

Detailed Description

Referring to fig. 1, an image diversity description method based on condition variation self-coding and contrast learning includes the following steps:

Step S300, inputting the image-description in the same training batch into the condition variable self-encoders corresponding to the two branches respectively to obtain a sequence hidden variable z _b and z_c Introducing word-level bidirectional contrast learning loss function L _r (z _b ,z _c ) Lifting sequence hidden variable z _c Word level discrimination capability of (2);

Step S500, obtaining a joint optimization target L _total ＝λ ₁ L ₁ (x,I)+λ ₂ L ₂ (x,I)+λ ₃ L _r (z _b ,z _c), wherein λ₁ 、λ ₂ and λ₃ In order to balance the weight parameters of each part loss, the hidden representation discrimination capability and the fine-grained descriptive statement decoding capability of the image diversity model are improved through word level comparison loss, sentence level comparison loss and variable lower bound loss functions, so that high-quality diversified descriptive statements are obtained.

Concrete embodimentsThe object of the invention is, therefore, to generate a plurality of different description sets X from an encoder for a given image I. By maximising the conditional probability p _θ (x, z|I) achieves this goal.

Wherein θ is a model parameter, x represents a description of T length, x _t For the word generated in step t, z is the hidden variable, z _t And (3) T is a hidden variable of the T step, and T is E T.

In practice, a conditional probability distribution p is established by using a long and short time neural network LSTM _θ (x|I)

Wherein hidden layer state h _t-1 Including all x's prior to the current time step _＜t And x is _t-1 Will directly affect the current time step x _t Is generated. However, using word sequences generated before the current time step and the conditional probability distribution p _θ (x _t |x _＜t I) can only realize one-to-one mapping between images and descriptions, and can not generate diversified description sentences. For this purpose, the present embodiment introduces the hidden variable z _t The hidden variable provides a model with more varied generated word choices at each time step. At the same time, the conditional variation is from the lower variation bound required by the encoder

In equation (5), the first term represents the log-conditional likelihood of the decoder generation statementThe second term represents the prior model ++over all time steps>And posterior model p _θ (z _t |z _＜t ,x _＜t KL divergence of I). The prior model and the posterior model respectively correspond to the coding network of the double-branch conditional variation self-coder.

In step S100, the lower bound loss function L is obtained by maximizing the variance of the conditional probability distribution ₁ (x, I) training the conditional variable self-encoding and decoder network such that the a priori network approximates the a posteriori network, outputting a pre-trained hidden variable z _b . Given an image I, the prior model can be parameterized as the product p of a series of conditions _θ (z _t |z _＜t ,x _＜t I). In the test phase this embodiment proposes sampling hidden variables z from a priori model of a conditional variable self-encoder (DS-CVAE) _t As a condition for the decoder to generate the corresponding word. Specifically, at each time step t, the hidden variable z _t According to x of all previous time steps _＜t and z_＜t The obtained hidden variable z is then sampled _＜t As all x _＜t Word x co-predicting current time step _＜t . Thus, DS-CVAE can be derived from p _θ (z _t |z _＜t ,x _＜t Sampling a series of various hidden variables z in I) _t And input to decoder p _θ (x _t |x _＜t ,z _＜t The diversity of the generated description is implemented in I), as shown in fig. 2.

Further, the hidden variable z is pre-trained _b There is a lack of distinguishability between paired and unpaired image-descriptions. Thus, the present embodiment encodes each image-description pair in the training set by inputting it into the dual-branch conditional variation self-encoderThe diversity and the distinguishability of hidden variables are improved through the contrast learning between the two branches. On the basis, step S300 adopts (7) to carry out joint training

Wherein m represents a forward boundary; (z) _b ,z _c ) Representing pairwise hidden variables obtained from the encoding of the same image-description pair in the same batch by the encoder from conditional variants of the double branches, respectively, (z) _b ',z _c) and (z_b ,z _c ') is an unpaired hidden variable in the same lot. The first max term, z in formula (7) _c A posterior model, z, sampled from the lower leg Seq-CVAE in fig. 2 _b and z_b ' is a posterior model sampled from the pre-trained up leg Seq-CVAE; z _b Representation and z _c Positive samples from the same pair of image-description pairs, z _b ' means randomly sampling from unpaired image-description pairs in the current batch. The latter term max in formula (7) is the opposite.

The contrast learning of step S300 is essentially a local contrast learning aimed at improving the distinguishability of corresponding hidden variables for each word in the hidden space.

In order to further eliminate the re-read words and phrases in the image description, and alleviate the deviation existing in the training and testing stages, the embodiment adopts the cross entropy loss method to perform global contrast learning, and the specific process of step S400 is as follows:

step S401, regarding the description generated by the decoder prediction as positive samples, the description sampled by the pre-trained decoder is defined as negative samples;

step S402, designing global contrast loss L by utilizing positive and negative sample pairs and combining global contrast learning and cross entropy loss ₂ (x, I) global contrast learning;

wherein ,L_XE Representing cross entropy loss; k represents the batch size; alpha is a super parameter; c (C) _T and C_N Respectively representing real sentences of each given image I and negative sample sentences generated by the pre-training model; c (C) _G Is a descriptive statement generated by the current model decoder through a greedy strategy. The greedy strategy provides the generated words to the decoder one by one, and predicts the next word with the maximum probability value in sequence according to the cross entropy loss; from the sentence level, L ₂ (x, I) use of the hybrid training goals reduces the decoder's generation of common words while at the same time causing the decoder to generate accurate and diverse words contained in the positive samples.

Example 1

DS-CVAE uses the Faster-RCNN to extract 2048-D target features for each image. The Seq-CVAE coding network and decoder are then pre-trained 1630 iterations. In the joint training phase, the double-branch Seq-CVAE performs 20375 iterative training. Wherein the dimension of the hidden variable z is set to 128. The decoder takes as input at each time step a splice vector of image features and sequence hidden variables. DS-CVAE uses an SGD optimizer with a learning rate set to 0.015, momentum set to 0.9, and weight decay set to 0.001. To further highlight the effect of global contrast learning, m in equation 6 is set to 0.2, and α in equation 7 is set to 1.5. Lambda in joint optimization objective function ₁ 、λ ₂ and λ₃ Are respectively set to be 1, 1 and 1.2.

(1) Best-1 accuracy

Table 1 shows a performance comparison of accuracy (comparison between sample 20 and sample 100) of DS-CVAE after re-ordering using Oracle on the MSCOCO dataset "M-RNN" test set. The DS-CVAE performance at sample 100 is significantly better than current image diversity description methods, and is comparable to the current best performance baseline model COS-CVAE at sample 20. On SPICE index, DS-CVAE achieved the best results compared to other models, e.g., 0.294 and 0.337 scores were achieved when sampling 20 and 100, respectively, indicating that DS-CVAE is more prone to generating words or phrases with fine granularity and distinctiveness relative to common words or phrases.

TABLE 1

(2) Diversity evaluation

To evaluate the DS-CVAE model more fully, the present example compares the DS-CVAE model for diversity index as shown in Table 2. For each diversity evaluation index, the best five sentences selected were reordered based on the Consensus.

As can be seen from Table 2, the DS-CVAE gave better diversity evaluation results. DS-CVAE performed better than AG-CVAE and POS methods when sampling 20 and 100 samples. When 20 samples were sampled, the DS-CVAE was better than the baseline models Seq-CVAE and COS-CVAE at all but mBleu. When 100 samples are sampled, the DS-CVAE achieves the optimal results of three indexes among all five diversity indexes. In particular, the DS-CVAE performed significantly better than all other models for Div-1 and Div-2 indicators. For 20 and 100 samples, DS-CVAE was increased by 25% and 10% on the Div-2 score, respectively, compared to the COS-CVAE model. The scores of Div-1 and Div-2 indicate that DS-CVAE can generate words and word combinations with stronger fine granularity and more diversity.

TABLE 2

(3) Ablation experimental analysis

The present example performs a validity analysis for each component of the DS-CVAE model. Specifically, experimental verification is performed on the bidirectional contrast learning of the encoding stage and the global contrast learning of the decoding stage respectively. Tables 3 and 4 are the results of the ablation experiments with accuracy and diversity performed with 20 and 100 samples, respectively, where the baseline model is a pre-trained Seq-CVAE model without introducing contrast learning.

TABLE 3 Table 3

TABLE 4 Table 4

It can be seen from the table that the two-way contrast learning employed in the encoding phase is significantly improved in terms of both accuracy and diversity index compared to the baseline model. It can therefore be demonstrated that bi-directional contrast learning is crucial to promote diversity and differentiation of hidden variables. While global contrast learning at the decoding stage slightly reduces the accuracy at 20 and 100 samples, but the diversity increases significantly. Thus, DS-CVAE incorporating global contrast learning can achieve a balance between accuracy and diversity.

(4) Qualitative analysis of model performance

In order to more fully evaluate the diversity and accuracy of the DS-CVAE model, the present embodiment compares it with descriptions generated by existing image diversity description methods.

As shown in FIG. 3, for the first image, COS-CVAE generated a common n-gram, such as "on the top of", but not represented in the detailed description of the image, such as not representing the cat's color. Although AG-CVAE describes some detailed information of an image, such as "small gray and white", it cannot avoid generating common words or phrases. Div-BS, seq CVAE and POS also have a greater propensity to generate accurate and common words and phrases, failing to strike a balance between diversity and accuracy.

The description generated by the DS-CVAE method proposed by this embodiment is not only grammatically correct but also contains more details of the image, such as "white cat" and "red suitcase". In particular, DS-CVAE creates verbs that occur less frequently in the dataset but are more meaningful, such as "resting" and "lining", rather than verbs that occur more frequently, such as "sit" and "play". For the second image in fig. 3, some repeated phrases appear in the description of the Seq-CVAE generation, such as "some water near water", and the description of the image details is missing. The POS and Div-BS describe the image very similarly. The DS-CVAE model proposed by the present embodiment not only describes images with correct syntax, but also generates a more diverse and accurate description. In particular, DS-CVAE even generates compound sentences such as "that" clauses, which are clearly more consistent with human language habits. Furthermore, the description of DS-CVAE generation generally contains descriptions of important and details in the image, e.g., DS-CVAE model describes "red beans" that do not appear in any baseline model. By virtue of the flexible grammar and word usage, DS-CVAE outperforms the baseline model in terms of a balance of diversity and accuracy.

In fig. 4, the present embodiment further illustrates some examples of DS-CVAE generation for various test set images. As can be seen from the figure, most descriptions of DS-CVAE generation are natural, fluent and diverse for a given image. DS-CVAE can describe specific details of a given image, such as "blue_jacket", "holding a pair of skis", and so forth. For images of aircraft therein, DS-CVAE gives corresponding diversification and detailed description, for example, "Black air land", "military airplane", "small fighter jet" and "Black Fighter". Different descriptions of the same object are also more consistent with the diversity features of human natural language. Furthermore, as in the example of fig. 3, DS-CVAE generates more meaningful and flexible verbs, such as "holding", "threading", "limit" and "range", which make the description more distinctive. However, based on quantitative and qualitative experimental analysis, the DS-CVAE proposed in this example is superior to the existing image diversity description method in terms of diversity and accuracy.

Claims

1. An image diversity description method based on conditional variation self-coding and contrast learning is characterized by comprising the following steps:

2. The method according to claim 1, wherein in step S200, the lower bound loss function L is varied ₁ (x, I) is obtained by

wherein ,log-conditional likelihood representing decoder-generated statementsIs (are) desirable to be (are)>Representing a priori model ++in all time steps>And posterior model p _θ (z _t |z _＜t ,x _＜t KL divergence of I), θ is a model parameter, x represents a description of T length, x _t For the word generated in step t, z is the hidden variable, z _t And (3) as the hidden variable of the T step, T epsilon T and I represent images.

3. The method according to claim 2, wherein in step S300, word-level bi-directional contrast learning loss function L _r (z _b ,z _c ) Obtained by

4. A method according to claim 3, wherein in step S400, the global pairSpecific learning objective function L ₂ (x, I) is obtained by

5. The method of claim 4, wherein the joint optimization objective is

L _total ＝λ ₁ L ₁ (x,I)+λ ₂ L ₂ (x,I)+λ ₃ L _r (z _b ,z _c )

wherein λ₁ 、λ ₂ and λ₃ To balance the weight parameters lost per section.