CN116912599A - Image diversified description method based on conditional variation self-coding and contrast learning - Google Patents

Image diversified description method based on conditional variation self-coding and contrast learning Download PDF

Info

Publication number
CN116912599A
CN116912599A CN202311009413.1A CN202311009413A CN116912599A CN 116912599 A CN116912599 A CN 116912599A CN 202311009413 A CN202311009413 A CN 202311009413A CN 116912599 A CN116912599 A CN 116912599A
Authority
CN
China
Prior art keywords
image
description
hidden variable
model
variation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311009413.1A
Other languages
Chinese (zh)
Inventor
刘明明
刘兵
徐静
张海燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Institute of Architectural Technology
Original Assignee
Jiangsu Institute of Architectural Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Institute of Architectural Technology filed Critical Jiangsu Institute of Architectural Technology
Priority to CN202311009413.1A priority Critical patent/CN116912599A/en
Publication of CN116912599A publication Critical patent/CN116912599A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an image diversified description method based on condition variation self-coding and contrast learning, which comprises the following steps: pre-training each image-description pair in the training set to be input into a single-branch condition variation self-encoder and decoder network model to obtain a pre-trained sequence hidden variable; introducing another single-branch conditional variation from the encoder and decoder network, calculating the variation lower bound loss function, obtaining by the branch conditional variation from the encoderObtaining coded sequence hidden variables; introducing word-level bidirectional comparison learning loss function and improving sequence hidden variable z c Word level discrimination capability of (2); obtaining a global contrast learning objective function; and obtaining a joint optimization target, and improving the hidden representation discrimination capability and the fine-grained description statement decoding capability of the image diversity model, thereby obtaining high-quality diversified description statements.

Description

Image diversified description method based on conditional variation self-coding and contrast learning
Technical Field
The invention relates to a digital data processing technology, in particular to an image diversified description method based on conditional variation self-coding and contrast learning.
Background
The image description task is a basic multi-modal task in the cross field of computer vision and natural language processing. Early image description models ignored the diversity of the generated description and focused only on accuracy, resulting in the generated description having simple and repeatable characteristics, not conforming to the rich features of human language. Therefore, image diversity description has become a research hotspot for more and more researchers, and the task is to generate a plurality of description sentences with various words and sentence patterns for a given image under the premise of ensuring that accurate description is generated. The patent with application number 202211628528.4 provides an explanatory text classification system based on a double-way encoder, and the patent discloses that the problem of mismatching of query and attention results is solved by fusing semantic expressions of multi-head attention and a bidirectional gating circulation unit; however, each hidden variable in this patent is directly generated by a different single branch, so that hidden variables generated by these models lack distinguishing between paired and unpaired image-descriptions, resulting in descriptions that are all quite different from each other even for different image generation.
Disclosure of Invention
The invention aims to provide an image diversification description method based on conditional variation self-coding and contrast learning, which comprises the following steps:
step S100, pre-training each image-description pair in the training set into a single-branch condition variation self-encoder and decoder network model to obtain a pre-trained sequence hidden variable z b
Step S200, introducing another single-branch conditional variation from the encoder and decoder network, calculating the variation lower bound loss function L 1 (x, I) obtaining the encoded sequence hidden variable z from the encoder by the branch conditional variation c
Step S300, introducing word-level bidirectional contrast learning loss function L r (z b ,z c ) Lifting sequence hidden variable z c Word level discrimination capability of (2);
step S400, utilizing the pre-trained sequence hidden variable z b Generating corresponding description sentence C N Using sequence hidden variables z c Generating corresponding description sentence C G Using the true sentence C T And cross entropy loss function pair C G Performing sentence-level global contrast learning to obtain a global contrast learning objective function L 2 (x,I);
Step S500, obtaining a joint optimization target L total =λ 1 L 1 (x,I)+λ 2 L 2 (x,I)+λ 3 L r (z b ,z c ) Improving hidden representation discrimination capability and fine granularity descriptive statement decoding capability of an image diversification model, thereby obtaining high-quality diversified descriptive statements, wherein lambda 1 、λ 2 and λ3 To balance the weight parameters lost per section.
Further, in step S200, the lower bound loss function L is varied 1 (x, I) is obtained by
wherein ,log-conditional likelihood representing decoder-generated statementsIs (are) desirable to be (are)>Representing a priori model ++in all time steps>Sum posterior modelP type θ (z t z <t ,x <t KL divergence of I), θ is a model parameter, x represents a description of T length, x t For the word generated in step t, z is the hidden variable, z t And (3) as the hidden variable of the T step, T epsilon T and I represent images.
Further, in step S300, word-level bi-directional contrast learning loss function L r (z b ,z c ) Obtained by
Wherein m represents a forward boundary, (z) b ,z c ) Representing pairwise hidden variables obtained from the encoding of the same image-description pair in the same batch by the encoder from conditional variants of the double branches, respectively, (z) b ',z c) and (zb ,z c ') is an unpaired hidden variable in the same lot.
Further, in step S400, the global contrast learning objective function L 2 (x, I) is obtained by
wherein ,LXE Represents cross entropy loss, K represents batch size, and α is a super parameter.
Compared with the prior art, the invention has the following advantages: the double-branch conditional variation self-encoder provided by the invention has the advantages that the sequence variation self-encoding is combined with the contrast learning, and the diversity is obviously improved on the premise of ensuring the accuracy of model generation description
Drawings
FIG. 1 is a schematic diagram of the method of the present invention.
FIG. 2 is a schematic diagram of a graphical model of word generation in sequence hidden space in accordance with the present invention.
FIG. 3 is a qualitative comparison schematic of different image description model generation descriptions.
FIG. 4 is a schematic diagram of a qualitative example of a DS-CVAE generation description based on a test set.
Detailed Description
Referring to fig. 1, an image diversity description method based on condition variation self-coding and contrast learning includes the following steps:
step S100, pre-training each image-description pair in the training set into a single-branch condition variation self-encoder and decoder network model to obtain a pre-trained sequence hidden variable z b
Step S200, introducing another single-branch conditional variation from the encoder and decoder network, calculating the variation lower bound loss function L 1 (x, I) obtaining the encoded sequence hidden variable z from the encoder by the branch conditional variation c
Step S300, inputting the image-description in the same training batch into the condition variable self-encoders corresponding to the two branches respectively to obtain a sequence hidden variable z b and zc Introducing word-level bidirectional contrast learning loss function L r (z b ,z c ) Lifting sequence hidden variable z c Word level discrimination capability of (2);
step S400, utilizing the pre-trained sequence hidden variable z b Generating corresponding description sentence C N Using sequence hidden variables z c Generating corresponding description sentence C G Using the true sentence C T And cross entropy loss function pair C G Performing sentence-level global contrast learning to obtain a global contrast learning objective function L 2 (x,I);
Step S500, obtaining a joint optimization target L total =λ 1 L 1 (x,I)+λ 2 L 2 (x,I)+λ 3 L r (z b ,z c), wherein λ1 、λ 2 and λ3 In order to balance the weight parameters of each part loss, the hidden representation discrimination capability and the fine-grained descriptive statement decoding capability of the image diversity model are improved through word level comparison loss, sentence level comparison loss and variable lower bound loss functions, so that high-quality diversified descriptive statements are obtained.
Concrete embodimentsThe object of the invention is, therefore, to generate a plurality of different description sets X from an encoder for a given image I. By maximising the conditional probability p θ (x, z|I) achieves this goal.
Wherein θ is a model parameter, x represents a description of T length, x t For the word generated in step t, z is the hidden variable, z t And (3) T is a hidden variable of the T step, and T is E T.
In practice, a conditional probability distribution p is established by using a long and short time neural network LSTM θ (x|I)
Wherein hidden layer state h t-1 Including all x's prior to the current time step <t And x is t-1 Will directly affect the current time step x t Is generated. However, using word sequences generated before the current time step and the conditional probability distribution p θ (x t |x <t I) can only realize one-to-one mapping between images and descriptions, and can not generate diversified description sentences. For this purpose, the present embodiment introduces the hidden variable z t The hidden variable provides a model with more varied generated word choices at each time step. At the same time, the conditional variation is from the lower variation bound required by the encoder
In equation (5), the first term represents the log-conditional likelihood of the decoder generation statementThe second term represents the prior model ++over all time steps>And posterior model p θ (z t |z <t ,x <t KL divergence of I). The prior model and the posterior model respectively correspond to the coding network of the double-branch conditional variation self-coder.
In step S100, the lower bound loss function L is obtained by maximizing the variance of the conditional probability distribution 1 (x, I) training the conditional variable self-encoding and decoder network such that the a priori network approximates the a posteriori network, outputting a pre-trained hidden variable z b . Given an image I, the prior model can be parameterized as the product p of a series of conditions θ (z t |z <t ,x <t I). In the test phase this embodiment proposes sampling hidden variables z from a priori model of a conditional variable self-encoder (DS-CVAE) t As a condition for the decoder to generate the corresponding word. Specifically, at each time step t, the hidden variable z t According to x of all previous time steps <t and z<t The obtained hidden variable z is then sampled <t As all x <t Word x co-predicting current time step <t . Thus, DS-CVAE can be derived from p θ (z t |z <t ,x <t Sampling a series of various hidden variables z in I) t And input to decoder p θ (x t |x <t ,z <t The diversity of the generated description is implemented in I), as shown in fig. 2.
Further, the hidden variable z is pre-trained b There is a lack of distinguishability between paired and unpaired image-descriptions. Thus, the present embodiment encodes each image-description pair in the training set by inputting it into the dual-branch conditional variation self-encoderThe diversity and the distinguishability of hidden variables are improved through the contrast learning between the two branches. On the basis, step S300 adopts (7) to carry out joint training
Wherein m represents a forward boundary; (z) b ,z c ) Representing pairwise hidden variables obtained from the encoding of the same image-description pair in the same batch by the encoder from conditional variants of the double branches, respectively, (z) b ',z c) and (zb ,z c ') is an unpaired hidden variable in the same lot. The first max term, z in formula (7) c A posterior model, z, sampled from the lower leg Seq-CVAE in fig. 2 b and zb ' is a posterior model sampled from the pre-trained up leg Seq-CVAE; z b Representation and z c Positive samples from the same pair of image-description pairs, z b ' means randomly sampling from unpaired image-description pairs in the current batch. The latter term max in formula (7) is the opposite.
The contrast learning of step S300 is essentially a local contrast learning aimed at improving the distinguishability of corresponding hidden variables for each word in the hidden space.
In order to further eliminate the re-read words and phrases in the image description, and alleviate the deviation existing in the training and testing stages, the embodiment adopts the cross entropy loss method to perform global contrast learning, and the specific process of step S400 is as follows:
step S401, regarding the description generated by the decoder prediction as positive samples, the description sampled by the pre-trained decoder is defined as negative samples;
step S402, designing global contrast loss L by utilizing positive and negative sample pairs and combining global contrast learning and cross entropy loss 2 (x, I) global contrast learning;
wherein ,LXE Representing cross entropy loss; k represents the batch size; alpha is a super parameter; c (C) T and CN Respectively representing real sentences of each given image I and negative sample sentences generated by the pre-training model; c (C) G Is a descriptive statement generated by the current model decoder through a greedy strategy. The greedy strategy provides the generated words to the decoder one by one, and predicts the next word with the maximum probability value in sequence according to the cross entropy loss; from the sentence level, L 2 (x, I) use of the hybrid training goals reduces the decoder's generation of common words while at the same time causing the decoder to generate accurate and diverse words contained in the positive samples.
Example 1
DS-CVAE uses the Faster-RCNN to extract 2048-D target features for each image. The Seq-CVAE coding network and decoder are then pre-trained 1630 iterations. In the joint training phase, the double-branch Seq-CVAE performs 20375 iterative training. Wherein the dimension of the hidden variable z is set to 128. The decoder takes as input at each time step a splice vector of image features and sequence hidden variables. DS-CVAE uses an SGD optimizer with a learning rate set to 0.015, momentum set to 0.9, and weight decay set to 0.001. To further highlight the effect of global contrast learning, m in equation 6 is set to 0.2, and α in equation 7 is set to 1.5. Lambda in joint optimization objective function 1 、λ 2 and λ3 Are respectively set to be 1, 1 and 1.2.
(1) Best-1 accuracy
Table 1 shows a performance comparison of accuracy (comparison between sample 20 and sample 100) of DS-CVAE after re-ordering using Oracle on the MSCOCO dataset "M-RNN" test set. The DS-CVAE performance at sample 100 is significantly better than current image diversity description methods, and is comparable to the current best performance baseline model COS-CVAE at sample 20. On SPICE index, DS-CVAE achieved the best results compared to other models, e.g., 0.294 and 0.337 scores were achieved when sampling 20 and 100, respectively, indicating that DS-CVAE is more prone to generating words or phrases with fine granularity and distinctiveness relative to common words or phrases.
TABLE 1
(2) Diversity evaluation
To evaluate the DS-CVAE model more fully, the present example compares the DS-CVAE model for diversity index as shown in Table 2. For each diversity evaluation index, the best five sentences selected were reordered based on the Consensus.
As can be seen from Table 2, the DS-CVAE gave better diversity evaluation results. DS-CVAE performed better than AG-CVAE and POS methods when sampling 20 and 100 samples. When 20 samples were sampled, the DS-CVAE was better than the baseline models Seq-CVAE and COS-CVAE at all but mBleu. When 100 samples are sampled, the DS-CVAE achieves the optimal results of three indexes among all five diversity indexes. In particular, the DS-CVAE performed significantly better than all other models for Div-1 and Div-2 indicators. For 20 and 100 samples, DS-CVAE was increased by 25% and 10% on the Div-2 score, respectively, compared to the COS-CVAE model. The scores of Div-1 and Div-2 indicate that DS-CVAE can generate words and word combinations with stronger fine granularity and more diversity.
TABLE 2
(3) Ablation experimental analysis
The present example performs a validity analysis for each component of the DS-CVAE model. Specifically, experimental verification is performed on the bidirectional contrast learning of the encoding stage and the global contrast learning of the decoding stage respectively. Tables 3 and 4 are the results of the ablation experiments with accuracy and diversity performed with 20 and 100 samples, respectively, where the baseline model is a pre-trained Seq-CVAE model without introducing contrast learning.
TABLE 3 Table 3
TABLE 4 Table 4
It can be seen from the table that the two-way contrast learning employed in the encoding phase is significantly improved in terms of both accuracy and diversity index compared to the baseline model. It can therefore be demonstrated that bi-directional contrast learning is crucial to promote diversity and differentiation of hidden variables. While global contrast learning at the decoding stage slightly reduces the accuracy at 20 and 100 samples, but the diversity increases significantly. Thus, DS-CVAE incorporating global contrast learning can achieve a balance between accuracy and diversity.
(4) Qualitative analysis of model performance
In order to more fully evaluate the diversity and accuracy of the DS-CVAE model, the present embodiment compares it with descriptions generated by existing image diversity description methods.
As shown in FIG. 3, for the first image, COS-CVAE generated a common n-gram, such as "on the top of", but not represented in the detailed description of the image, such as not representing the cat's color. Although AG-CVAE describes some detailed information of an image, such as "small gray and white", it cannot avoid generating common words or phrases. Div-BS, seq CVAE and POS also have a greater propensity to generate accurate and common words and phrases, failing to strike a balance between diversity and accuracy.
The description generated by the DS-CVAE method proposed by this embodiment is not only grammatically correct but also contains more details of the image, such as "white cat" and "red suitcase". In particular, DS-CVAE creates verbs that occur less frequently in the dataset but are more meaningful, such as "resting" and "lining", rather than verbs that occur more frequently, such as "sit" and "play". For the second image in fig. 3, some repeated phrases appear in the description of the Seq-CVAE generation, such as "some water near water", and the description of the image details is missing. The POS and Div-BS describe the image very similarly. The DS-CVAE model proposed by the present embodiment not only describes images with correct syntax, but also generates a more diverse and accurate description. In particular, DS-CVAE even generates compound sentences such as "that" clauses, which are clearly more consistent with human language habits. Furthermore, the description of DS-CVAE generation generally contains descriptions of important and details in the image, e.g., DS-CVAE model describes "red beans" that do not appear in any baseline model. By virtue of the flexible grammar and word usage, DS-CVAE outperforms the baseline model in terms of a balance of diversity and accuracy.
In fig. 4, the present embodiment further illustrates some examples of DS-CVAE generation for various test set images. As can be seen from the figure, most descriptions of DS-CVAE generation are natural, fluent and diverse for a given image. DS-CVAE can describe specific details of a given image, such as "blue_jacket", "holding a pair of skis", and so forth. For images of aircraft therein, DS-CVAE gives corresponding diversification and detailed description, for example, "Black air land", "military airplane", "small fighter jet" and "Black Fighter". Different descriptions of the same object are also more consistent with the diversity features of human natural language. Furthermore, as in the example of fig. 3, DS-CVAE generates more meaningful and flexible verbs, such as "holding", "threading", "limit" and "range", which make the description more distinctive. However, based on quantitative and qualitative experimental analysis, the DS-CVAE proposed in this example is superior to the existing image diversity description method in terms of diversity and accuracy.

Claims (5)

1. An image diversity description method based on conditional variation self-coding and contrast learning is characterized by comprising the following steps:
step S100, pre-training each image-description pair in the training set into a single-branch condition variation self-encoder and decoder network model to obtain a pre-trained sequence hidden variable z b
Step S200, introducing another single-branch conditional variation from the encoder and decoder network, calculating the variation lower bound loss function L 1 (x, I) obtaining the encoded sequence hidden variable z from the encoder by the branch conditional variation c
Step S300, introducing word-level bidirectional contrast learning loss function L r (z b ,z c ) Lifting sequence hidden variable z c Word level discrimination capability of (2);
step S400, utilizing the pre-trained sequence hidden variable z b Generating corresponding description sentence C N Using sequence hidden variables z c Generating corresponding description sentence C G Using the true sentence C T And cross entropy loss function pair C G Performing sentence-level global contrast learning to obtain a global contrast learning objective function L 2 (x,I);
Step S500, obtaining a joint optimization target L total =λ 1 L 1 (x,I)+λ 2 L 2 (x,I)+λ 3 L r (z b ,z c ) Improving hidden representation discrimination capability and fine granularity descriptive statement decoding capability of an image diversification model, thereby obtaining high-quality diversified descriptive statements, wherein lambda 1 、λ 2 and λ3 To balance the weight parameters lost per section.
2. The method according to claim 1, wherein in step S200, the lower bound loss function L is varied 1 (x, I) is obtained by
wherein ,log-conditional likelihood representing decoder-generated statementsIs (are) desirable to be (are)>Representing a priori model ++in all time steps>And posterior model p θ (z t |z <t ,x <t KL divergence of I), θ is a model parameter, x represents a description of T length, x t For the word generated in step t, z is the hidden variable, z t And (3) as the hidden variable of the T step, T epsilon T and I represent images.
3. The method according to claim 2, wherein in step S300, word-level bi-directional contrast learning loss function L r (z b ,z c ) Obtained by
Wherein m represents a forward boundary, (z) b ,z c ) Representing pairwise hidden variables obtained from the encoding of the same image-description pair in the same batch by the encoder from conditional variants of the double branches, respectively, (z) b ',z c) and (zb ,z c ') is an unpaired hidden variable in the same lot.
4. A method according to claim 3, wherein in step S400, the global pairSpecific learning objective function L 2 (x, I) is obtained by
wherein ,LXE Represents cross entropy loss, K represents batch size, and α is a super parameter.
5. The method of claim 4, wherein the joint optimization objective is
L total =λ 1 L 1 (x,I)+λ 2 L 2 (x,I)+λ 3 L r (z b ,z c )
wherein λ1 、λ 2 and λ3 To balance the weight parameters lost per section.
CN202311009413.1A 2023-08-11 2023-08-11 Image diversified description method based on conditional variation self-coding and contrast learning Pending CN116912599A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311009413.1A CN116912599A (en) 2023-08-11 2023-08-11 Image diversified description method based on conditional variation self-coding and contrast learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311009413.1A CN116912599A (en) 2023-08-11 2023-08-11 Image diversified description method based on conditional variation self-coding and contrast learning

Publications (1)

Publication Number Publication Date
CN116912599A true CN116912599A (en) 2023-10-20

Family

ID=88354836

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311009413.1A Pending CN116912599A (en) 2023-08-11 2023-08-11 Image diversified description method based on conditional variation self-coding and contrast learning

Country Status (1)

Country Link
CN (1) CN116912599A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015389A (en) * 2023-10-30 2024-05-10 江苏建筑职业技术学院 Diversified image description generation method based on mixed condition variation self-coding

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021058710A1 (en) * 2019-09-25 2021-04-01 Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) Modelling method using a conditional variational autoencoder
CN112765317A (en) * 2021-01-19 2021-05-07 东南大学 Method and device for generating image by introducing text of class information
CN114896983A (en) * 2022-05-12 2022-08-12 支付宝(杭州)信息技术有限公司 Model training method, text processing device and computer equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021058710A1 (en) * 2019-09-25 2021-04-01 Helmholtz Zentrum München - Deutsches Forschungszentrum für Gesundheit und Umwelt (GmbH) Modelling method using a conditional variational autoencoder
CN112765317A (en) * 2021-01-19 2021-05-07 东南大学 Method and device for generating image by introducing text of class information
CN114896983A (en) * 2022-05-12 2022-08-12 支付宝(杭州)信息技术有限公司 Model training method, text processing device and computer equipment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李琪琦: "基于深度学习和多指标强化学习的图像描述生成", 中国优秀硕士学位论文全文数据库, no. 2021, 15 May 2021 (2021-05-15) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118015389A (en) * 2023-10-30 2024-05-10 江苏建筑职业技术学院 Diversified image description generation method based on mixed condition variation self-coding
CN118015389B (en) * 2023-10-30 2024-06-25 江苏建筑职业技术学院 Diversified image description generation method based on mixed condition variation self-coding

Similar Documents

Publication Publication Date Title
Zhang et al. Neural machine translation with deep attention
CN106569618B (en) Sliding input method and system based on Recognition with Recurrent Neural Network model
CN109492202A (en) A kind of Chinese error correction of coding and decoded model based on phonetic
CN107076567A (en) Multilingual image question and answer
CN111767718B (en) Chinese grammar error correction method based on weakened grammar error feature representation
CN111651589B (en) Two-stage text abstract generation method for long document
CN110457661B (en) Natural language generation method, device, equipment and storage medium
CN113408430B (en) Image Chinese description system and method based on multi-level strategy and deep reinforcement learning framework
CN116912599A (en) Image diversified description method based on conditional variation self-coding and contrast learning
CN117708339B (en) ICD automatic coding method based on pre-training language model
CN114238636A (en) Translation matching-based cross-language attribute level emotion classification method
Yang et al. ATT-BM-SOM: a framework of effectively choosing image information and optimizing syntax for image captioning
CN112906820A (en) Method for calculating sentence similarity of antithetical convolution neural network based on genetic algorithm
CN116910272B (en) Academic knowledge graph completion method based on pre-training model T5
Huang et al. Summarization with self-aware context selecting mechanism
CN116956940A (en) Text event extraction method based on multi-directional traversal and prompt learning
CN111414762A (en) Machine reading understanding method based on DCU (distributed channel Unit) coding and self-attention mechanism
CN113392629B (en) Human-term pronoun resolution method based on pre-training model
CN115588486A (en) Traditional Chinese medicine diagnosis generating device based on Transformer and application thereof
CN113657125B (en) Mongolian non-autoregressive machine translation method based on knowledge graph
CN114972907A (en) Image semantic understanding and text generation based on reinforcement learning and contrast learning
Cai et al. MTBC-BioNER: Multi-task Learning Using BioBERT and CharCNN for Biomedical Named Entity Recognition
CN114004238A (en) Chinese-transcendental neural machine translation quality estimation method integrating language differentiation characteristics
CN113779298A (en) Medical vision question-answering method based on composite loss
CN112185567B (en) Method and system for establishing traditional Chinese medicine clinical auxiliary syndrome differentiation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination