CN115393692A

CN115393692A - Generation formula pre-training language model-based association text-to-image generation method

Info

Publication number: CN115393692A
Application number: CN202211095848.8A
Authority: CN
Inventors: 鲍秉坤; 盛业斐; 陶明; 谭智一; 邵曦
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2022-09-08
Filing date: 2022-09-08
Publication date: 2022-11-25

Abstract

The invention discloses a method for generating an association text to an image based on a generative pre-training language model, which comprises the following steps: fine tuning the generative pre-training model based on the data set, so that the pre-training model obtains the existing text information with good semantic retention, and the fine-tuned pre-training model is obtained; taking ten sentences corresponding to each image in the original data set as input of the fine-tuned pre-training model to obtain a generated data set output by the model; performing constraint processing and semantic retention evaluation selection on the generated data set to obtain an association text data set; and generating an image which is consistent in cross-modal semantic features of the text image by using a DF-GAN-based confrontation generation network model based on the association text data set. The invention comprehensively utilizes the associative ability and rich semantic information of the generative pre-training model, and balances the problem of imbalance of text information and image information in the cross-mode generation task from text to image of the confrontation generation network to a certain extent.

Description

Generation formula pre-training language model-based association text-to-image generation method

Technical Field

The invention relates to the technical field of image generation, in particular to a method for generating an association text to an image based on a generative pre-training language model.

The background art comprises the following steps:

the world experience gradually changes from being monomodal to being multimodal with the development of multimedia technology. In brief, multimodal refers to information in multiple modalities, including: text, images, video, audio, etc. As the name implies, multimodal research is a matter of fusion of these different types of data.

Text-to-image generation is a potential and increasingly important task in the areas of multimodal machine learning and deep learning. The task has good application in image editing, video editing, stylized generation and user personalized customization, and can also help the work of design class in the future. For example, users can input requirements to draw images they need or help them perform specialized tasks such as designing clothing, modifying layouts, etc. based on text content.

In some existing applications, the training set and test set are from the CUB2011 bird dataset, with 10 descriptions per CUB image. The recent method considers that the prior method only selects one text as input to generate a matched target image; often, a piece of text describes only a part of an image, which complicates the generation process and results in poor image quality. It is necessary to retrieve text close to the input text as additional input to enrich the text information.

But the method of expanding the text volume by retrieving similar texts goes against the task of filling up the parts that a single text cannot describe; in addition, the existing method limits the retrieval range to ten corresponding sentences, however, "one image wins over thousand words", the text in the data set also has the problems of description error and incompleteness, and external knowledge which does not exist in the original text is needed for better describing the image. In addition, the conventional method only makes a self-attention mechanism of a text level, and ignores the interaction between the image information and the text information in the cross-modal task.

Disclosure of Invention

The invention aims to provide an association text-to-image generation method based on a generative pre-training language model, which associates and generates richer text information for text-to-image generation by introducing the generative pre-training model, and utilizes a complementary fine adjustment method and cross-modal attention based on texts and images to construct a confrontation generation network; the generated image has better semantic consistency expression.

The technical scheme is as follows: in order to solve the technical problems, the technical scheme adopted by the invention is as follows:

in a first aspect, a method for generating an associative text to an image based on a generative pre-trained language model is provided, which includes:

s1, fine-tuning a generative pre-training model based on a data set, so that the pre-training model obtains existing text information with good semantic retention, and obtains the fine-tuned pre-training model;

s2, taking ten sentences corresponding to each image in the original data set as the input of the fine-tuned pre-training model obtained in the step S1 to obtain a generated data set output by the model; performing constraint processing and semantic retention evaluation selection on the generated data set to obtain an association text data set;

and S3, generating a network model by using the countermeasure generation network based on DF-GAN based on the associated text data set obtained in the step S2, and generating an image which is consistent on the cross-modal semantic features of the text image.

In some embodiments, the step S1 comprises:

step S11: acquiring a data set, and compiling ten sentences corresponding to each image in the data set into a sentence string;

step S12: and inputting the sentence string of the data set into a pre-training model for training and fine-tuning to obtain a fine-tuned pre-training model.

In some embodiments, step S11, the arranging ten sentences corresponding to each image in the data set into a sentence string includes: the data set comprises a plurality of images, and each image corresponds to ten sentences; and compiling ten sentences corresponding to each image into sentence strings according to the following rules:

sentence strings are arranged as follows: "$ sentence a # sentence b # sentence c # - # sentence 9# sentence 10";

the sentence string is divided into two parts: the first part is initialized randomly, and the sentence a, the sentence b and the sentence c are three sentences initialized randomly from ten sentences corresponding to one image;

the second part is the sequential concatenation of the rest sentences, wherein "#" and "$" are respectively a separator and an initiator, GPT-2 generates a structured sentence string, the separator facilitates the disassembly of the generated sentence string, and the initiator is used for preventing the model from generating the too long or too short sentence string.

In some embodiments, step S12 includes:

wherein the pre-training model is a GPT-2 model; the GPT-2 model training and fine tuning method comprises the following steps:

let a sentence string of a given input be represented as a sequence of sentences [ x ] ₁ ,x ₂ ,...,xm]M is the mth sentence in the sentence string;

the loss functions of the GPT-2 model during pre-training and fine tuning are respectively L ₁ (X) and L ₂ (X) the formula is as follows:

wherein the loss function L is pre-trained ₁ (X) adopting a maximum likelihood function, wherein P () represents a conditional probability, and theta is a neural network modeling parameter; i is 0,1 \8230k, and k is the value obtained by traversing(ii) a k is smaller than m, which is the size of the sliding window;

the fine tuning process adopts supervised learning, and the training sample comprises a sentence sequence [ x ] ₁ ,x ₂ ,...,x _m ]And in the first sentence x ₁ As a label; in the fine adjustment process of the GPT-2 model, the sentence sequence [ x ] is determined ₁ ,x ₂ ,...,x _m ]The prediction class label is L ₂ (X)；

Optimization function L ₃ Is L ₁ And L ₂ Weighted sum of (c):

L ₃ ＝L ₂ +λL ₁

wherein λ is a hyperparameter, L ₁ And L ₂ The loss functions of the GPT-2 model during pre-training and fine tuning are respectively.

In some embodiments, in step S2, the constraint processing is performed on the generated data set, and includes:

and processing the generated data set by adopting a proximity principle, format regularization and sentence selection.

In some embodiments, in step S2, the semantic retention evaluation selection is performed on the generated data set, and includes:

evaluating the generated data set by adopting a bleu index, wherein the bleu index comprises sample bleu of different backgrounds of different postures in the same category ^a And greatly distinguishing samples bleu in different classes ^b Sample bleu similar to visual features but belonging to a different class ^c ：

Candidates denotes the sentences from which the data set was generated, reference is the sentences in the original data set, count denotes the Count, count _clip Representing the molecular cut-off count, wherein n-gram represents that the number of continuous words measured in candidates in reference is n, n-gram represents that the number of continuous words measured in candidates is n, and c' represent the number of sentences simultaneously selected from the data set; sigma _{c∈candidates} 、∑ _{c′∈candidates} Represents the inclusion of all candidates; sigma _n-gram∈c 、∑ _{n-gram′∈c′} Representing the number of all matched sentences in the candidate variables and the number of specific variables in the reference; count _clip (n-gram) represents the number of matching sentences in candidates that appear in reference; count (n-gram ') represents the number of sentences the n-gram' matches in candidates;

respectively calculating three indexes bleu of the generated data set and the original data set ^a 、bleu ^b And bleu ^c ；

And if the ratios of the three indexes of the generated data set and the original data set are consistent, the three indexes are consistent with the semantics of the original data set, and the generated data set consistent with the semantics of the original data set is selected as an association text data set.

In some embodiments, in step S3, the DF-GAN based countermeasure generates a network model comprising: a pre-trained text encoder, a generator and a discriminator;

a text encoder: all texts in the associated text data set are encoded by a text encoder, and output sentence vectors are stored in a text encoding library;

the generator has two inputs: the method comprises the steps that sentence vectors after being coded by a text coder and random noise adopted in normal distribution are converted into set sizes through a full-connection layer, image features are generated through a series of depth semantic fusion modules, in the depth semantic fusion module of each layer, a plurality of input sentences are interacted with feature maps of the current level, a cross-modal attention mechanism is calculated to distinguish weight distribution of the sentences in different generator layers, and then the image features are converted into images through a convolutional layer; wherein each depth semantic fusion module comprises: an upsampling layer, a residual block, and a text-image feature fusion block;

in the discriminator, a series of down-sampling layers are used for converting an image into image features, then the image features are connected with sentence vectors, and the confrontation loss is calculated through one-step generation to ensure the visual reality and semantic consistency;

the loss function of the generator and the arbiter is as follows:

wherein L is _D As a loss function of the discriminator, L _G To a loss function of the generator, L _D 、L _G Calculating hinge loss change loss, wherein z is a noise vector of Gaussian distribution sampling; d is a discriminator, G is a generator, G (z) represents an image generated by the generator, e is a sentence vector;

respectively representing a composite data distribution, a real data distribution and a mismatching data distribution,

and a hinge loss calculation function is represented, x represents a real image, D (x, e) represents a discrimination value of a real image input by the discriminator, and D (G (z), e) represents a discrimination value of an input generated image.

In some embodiments, in step S3, the processing procedure of the generator includes:

the generator adopts an attention mechanism and is provided with alpha _n The attention mechanism weight corresponding to the nth sentence;

wherein X is the input sentence vector, z is the input random noise, s is the attention score function, W represents several linear layers, and the sentence vector is mapped to the vector in the potential space; w (z) is a feature map of the image under the current layer of the generator, and given W (z) and X, alpha is calculated _n 。

In a second aspect, the present invention provides an apparatus for generating an associative text to an image based on a generative pre-trained language model, comprising a processor and a storage medium;

the storage medium is to store instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to the first aspect.

In a third aspect, the invention provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of the first aspect.

The invention has the advantages that: the method uses a complementary fine-tuning method when the pre-training model is subjected to fine tuning, so that the generated pre-training model associates information which is not in an original data set while keeping original semantic information, and the generalization capability of the model is improved; in addition, the relevance of the input text and the visual information of the image generated on the current generator layer is considered, a cross-modal attention mechanism based on the text and the image is constructed for generating the image, and the generated image has better semantic consistency performance and is closer to a real image. The association capability and rich semantic information of the generative pre-training model are comprehensively utilized, and the problem that the text information and the image information are not balanced on a text-to-image cross-mode generation task of the confrontation generation network is balanced to a certain extent. The method has the advantages that the missing information of a single text is supplemented, meanwhile, the structural data which does not appear in the training set is supplemented, a more effective generation-resisting network is designed to generate images, and the downstream tasks of natural language processing and text cross-modal generation are facilitated.

Drawings

FIG. 1 is a flowchart of a method according to an embodiment of the present invention.

FIG. 2 is a flowchart of a process and method for introducing a generative pre-training model according to an embodiment of the present invention;

FIG. 3 is a block flow diagram of a combination generative pre-training model and challenge generation network according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating examples of generative model generation effects and results of index studies according to an embodiment of the present invention;

FIG. 5 shows the comparison of the effect of the current model generation with other models (left: original, right: DF-GAN for the present invention).

Detailed Description

In order to make the technical means, the creation characteristics, the achievement purposes and the effects of the invention easy to understand, the invention is further described with the specific embodiments.

In the description of the present invention, the meaning of a plurality is one or more, the meaning of a plurality is two or more, and the above, below, exceeding, etc. are understood as excluding the present numbers, and the above, below, within, etc. are understood as including the present numbers. If there is a description of first and second for the purpose of distinguishing technical features only, this is not to be understood as indicating or implying a relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of technical features indicated.

In the description of the present invention, reference to the description of "one embodiment", "some embodiments", "illustrative embodiments", "examples", "specific examples", or "some examples", etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Example 1

A method for generating associated text to images based on a generative pre-trained language model, comprising:

s2, taking ten sentences corresponding to each image in the original data set as input of the fine-tuned pre-training model obtained in the S1 to obtain a generated data set output by the model; performing constraint processing and semantic retention evaluation selection on the generated data set to obtain an association text data set;

In some embodiments, the step S1 comprises:

step S11: acquiring a data set, and compiling ten sentences corresponding to each image in the data set into sentence strings;

sentence strings are arranged as follows: "$ sentence a # sentence b # sentence c # - # sentence 9# sentence 10$";

the second part is the sequential concatenation of the rest sentences, wherein "#" and "$" are separators and initials respectively, GPT-2 generates a structured sentence string, the separators facilitate the disassembly of the generated sentence string, and the initials are used for preventing the model from generating too long or too short sentence strings.

In some embodiments, step S12 includes:

let a sentence string of a given input be represented as a sequence of sentences x ₁ ,x ₂ ,...,x _m ]M is the mth sentence in the sentence string;

wherein the loss function L is pre-trained ₁ (X) adopting a maximum likelihood function, wherein P () represents a conditional probability, and theta is a neural network modeling parameter; i is 0,1 \8230k; k is smaller than m and is the size of the sliding window;

Optimization function L ₃ Is L ₁ And L ₂ Weighted sum of (c):

L ₃ ＝L ₂ +λL ₁

evaluating the generated data set by adopting a bleu index, wherein the bleu index comprises sample bleu of different backgrounds of different postures in the same category ^a And greatly distinguishing the samples bleu in different classes ^b Sample bleu similar to visual features but belonging to a different class ^c ：

Candidates stands for sentences in the generated dataset, reference is a sentence in the original dataset, count stands for Count, count _clip Representing the molecular truncation count, wherein n-gram represents that the number of continuous words measured in candidates in reference is n, n-gram 'represents that the number of continuous words measured in candidates is n, and c' are the number of sentences simultaneously selected from the data set; sigma _{c∈candidates} 、∑ _{c′∈candidates} Represents the inclusion of all candidates; sigma _n-gram∈c 、∑ _{n-gram′∈c′} Representing the number of all matched sentences in the candidate variables and the number of specific variables in the reference; count _clip (n-gram) represents the number of matching sentences in candidates that appear in reference; count (n-gram ') represents the number of sentences the n-gram' matches in candidates;

a discriminator converts an image into image characteristics by using a series of down-sampling layers, then connects the image characteristics with sentence vectors, and calculates the resistance loss through one-step generation to ensure the visual reality and semantic consistency;

the loss functions of the generator and the arbiter are as follows:

and a hinge loss calculation function is represented, x represents a real image, D (x, e) represents a discrimination value of a real image input by a discriminator, and D (G (z), e) represents a discrimination value of an input generated image.

the generator adopts an attention mechanism and is provided with alpha _n Is the n-thAttention mechanism weight corresponding to the sentence;

wherein X is the input sentence vector, z is the input random noise, s is the attention score function, W represents several linear layers, and the sentence vector is mapped to the vector in the potential space; w (z) is a feature map of the current sub-layer image of the generator, given W (z) and X, alpha is calculated _n 。

As mentioned above, the image generation quality of the text generation image is general due to the lack of text information, and in order to solve the problem, the invention provides a method for introducing a generative pre-training model for text association and enrichment and for text-to-image generation.

In some embodiments, as shown in fig. 1, a whole process from a generative pre-training model to a generation network of an countermeasure finally generating an image is explained, it can be seen that in the present application, an original text is input into the pre-training model and then fine-tuned, and finally the fine-tuned pre-training model is used to compile a final data set. After obtaining the text encoding, the present application performs text-to-image mapping generation using a cross-modal attention mechanism based countermeasure generation network.

Thus, the method of the invention comprises three steps: s1, fine-tuning a generative pre-training model based on a small-scale data set, so that the pre-training model obtains existing text information with good semantic retention; s2, according to the obtained pre-training model, because the generative pre-training model can generate data which does not exist in the original data set, more generalized semantic information can be generated through association and inference, association from the original data set to rich information is established by means of rich association capability, and more rich text information is associated from a single text of the existing data. And S3, selectively biasing the weights of different sentences in different levels of the model generator by using an antagonistic generation network and by means of an attention mechanism, so as to generate images consistent in cross-modal semantic features of the text images. The following will be described in detail:

s1: the generative pre-training model is finely adjusted based on a small-scale data set, so that the pre-training model obtains existing text information with good semantic retention

According to the method, information related to corresponding text information needs to be selectively searched from an external knowledge base, and a generative pre-training model is trained by utilizing a sentence prompting and filling type fine-tuning method. The missing text information is supplemented; and meaningful textual information beyond the data set can be generated.

Step S11: candidate text selection and sentence string editing

There are ten sentences corresponding to each image in the existing data set, i.e. one image corresponds to ten sentences. In the step, one text is used as input, the model generates a sentence string containing ten sentences by the target, and the model convergence is that the model reserves the original ten sentence information.

In the whole view, the sentence string is divided into two parts, the first part is initialized randomly, and the sentences a, b and c (not marked) are three sentences initialized randomly from ten sentences corresponding to one image, because for the model like GPT-2, the generation is only limited to generating the later sentence according to the former sentence, and the former sentence cannot be recalled according to the later sentence, the sequence of the ten sentences is disturbed, and the model tries to restore the original semantic information from three different sentences during each training, so that the error of the model caused by the position information can be weakened. The second part is the sequential concatenation of the rest sentences, wherein "#" $ "is a separator and an initial symbol, and as the GPT-2 model can generate the structured text data, the sentence string obtained still needs to be disassembled after the model is generated, the separator can better help the sentence string generated by the parsing, and the initial symbol prevents the model from generating the too long or too short sentence string.

Step S12: generative pre-training model training process

The process of fine tuning the pre-training model is an inference process of the GPT-2 model, the model is divided into two parts to be trained, firstly, the language model is subjected to an unsupervised pre-training process, and the GPT-2 model adopted by the application is pre-trained on a Sakebab script data set in advance.

Inputting a sentence string in the text, requiring the original sentence string to be restored through the input sentence to ensure the generating effect of the model, and giving input [ x1, x 2.,. Xm ]]The loss function of the model during pre-training and fine-tuning is L ₁ (X) and L ₂ (X) formula, wherein L ₁ (X) is a maximum likelihood function, where [ X1, X2.,. Xm]Is an input sentence. The fine tuning process adopts supervised learning, and the training sample comprises a sentence sequence [ x1, x 2., xm]And x1 is used as a label. In the GPT fine adjustment process, the [ x1, x 2., xm ] is adjusted according to the sentence sequence]The prediction class label is L ₂ (X)。

Step S13: generative pre-training model inference process

After obtaining the pre-trained GPT-2 model, the model may restore the remaining ten words in the original data set by a given word, the GPT-2 model may generate structured text data, and the final generated result may retain separators and initials.

The dataset used herein is the CUB dataset, containing 11788 birds from 200 different species, each image corresponding to ten sentences. The application simulates the process of generating a general countermeasure generation network, randomly extracts a sentence from ten sentences as an input sentence, and requires that GPT-2 can restore the part which cannot be described by a text in the original description, namely, the generation of a complementary mode.

The input sentences are finally output in the form of sentence strings and stored so as to facilitate the training of the network generated by the subsequent splitting and input confrontation. It should be noted that this section only uses the training set section (8855 images from 150 different birds and their corresponding texts) segmented in the CUB dataset to train the GPT-2 model, and when inferring that the text is generated, a test set (2933 images from 50 different birds and their corresponding texts) is generated in order to ensure fairness.

S2, according to the obtained pre-training model, establishing association from an original data set to rich information by means of rich association capability of the pre-training model, and associating the rich text information from a single text of the existing data.

And taking the whole data set as input, and generating a new text by using the obtained fine-tuned pre-training model. The generated text is compiled into a new data set finally, due to the large amount of added text information, the pre-compiling into the data set can greatly accelerate the training of resisting the generated network, and meanwhile, the generated data set is ensured to have better semantic information.

S21, deducing and generating rich text according to the fine-tuned pre-training model and the original data set

And inputting ten sentences corresponding to one image in the original data set into a GPT-2 model, traversing the original ten sentences and inputting the original ten sentences into the GPT-2 model for generation, and finally generating a plurality of text data sets with richer semantic information.

When an output result is selected, the method adopts the following three principles to screen and sort the sentences generated by association:

1. the principle of the approach is as follows: the GPT-2 model is a language model for predicting the next word with the maximum posterior probability, and as the side length of the sentence generated by the model decreases, the feasibility of the sentence at the end of the word generation decreases. Therefore, the method and the device cut the result generated by the GPT-2 model every time to different degrees, and ensure that sentences which are arranged at the front end and meet the requirements can be input into the associated text in the data set.

2. Formatting and regularizing: the sentences generated by the GPT-2 model often have larger errors from the expected results due to a more generalized knowledge base, and most basically, the sentences are constrained in format, the generated sentences are required to retain the initial characters and the separators defined by the application, and if the initial characters and the separators are not satisfied, the sentences are abandoned and re-iterated to be generated.

3. Sentence selection: because birds in a data set are often greatly distinguished, a GPT-2 model is difficult to ensure better fitting for all kinds of birds under the same training round number in the fine adjustment process, under-fitting and over-fitting conditions often exist, and in birds with unsatisfactory fitting effects, the method is independently optimized, and the L2 distance between the model and an original sentence is added, so that the credibility of text generation under special conditions is ensured.

S22, evaluating semantic retention degree of generated data set

Ensuring sentence semantics is a difficult task to measure. In the method of the present application, it is necessary to semantically analyze the data set generated by the generative pre-training model, and it is required to maintain semantic information at the same level as the original data set.

The index adopted in the part is a bleu-2 index, a specific formula is shown as follows, the bleu index is an index for measuring the relevance of the statement, and in view of the particularity of the data set text, the generated result is evaluated by the bleu-2 index, namely that two adjacent words are the same, namely the score is obtained.

The following three cases were chosen as reference,

1-samples bleu of different backgrounds in different poses in the same category ^a

2-distinguish the samples bleu among the larger different classes ^b

3-sample bleu whose visual characteristics are similar but belong to different classes ^c

According to basic intuition, sentences corresponding to the same kind of birds should have closer distance semantically; in fact, even images of the same species of birds, the final visual characteristics are greatly affected by pose, background, environment, and the like. Herein the application considers bleu ^a Will be a more compromised value.

While birds with larger visual features are distinguished, their texts often have larger semantic difference, bleu ^b This application is considered a small value. The application searches for long-phase proximity from different birdsBird-like, in order to verify that sentences corresponding to birds of similar growth phase are semantically large in relation, bleu ^c It will exhibit a larger value. The measurement was performed based on this, and it was judged that the generated data set retained the semantic meaning of consistency (see table 1). As shown in Table 1, all three indices are semantically consistent with the original data set.

TABLE 1

Index (es)	CUB	Multi-CUB
			bleu ^a	0.230	0.317
bleu ^b	0.212	0.298
			bleu ^c	0.317	0.378

And S3, selectively biasing the weights of different sentences in different levels of the model generator by using an antagonistic generation network and by means of an attention mechanism, so as to generate images consistent in cross-modal semantic features of the text images.

The image generation is carried out by adopting a DF-GAN-based countermeasure generation network, and the mode is a single-stage text-to-image generation mode, so that a better generation effect is realized under a smaller parameter quantity. The application inputs the generated associative text data set into the model and adds a cross-modal attention mechanism based on text and images to optimize the image generation process.

Step S31: text coder pre-training and pre-storage

Since the capacity of the associated text data set is improved by more than ten times compared with the original data set, the original strategy is used for encoding the input texts one by one, which consumes a great deal of training time. All texts are processed by a text encoder in advance and the output texts are stored into a code book, so that the subsequent model training and optimization are facilitated.

Step S32: text-to-image generation countermeasure network based on multi-text associative generation of generative pre-trained language models

The DF-GAN based confrontation generation network model abandons the prior stacking structure and only uses a generator, a discriminator and a pre-trained text encoder.

The generator has two inputs: the text encoder encodes the resulting sentence vector and random noise from the normal distribution. The noise is first fed into a full-link layer and reshaped to the required size, and then the image features are generated by a series of DF blocks, each DF block comprising: an upsampling layer, a residual block and a text-image feature fusion block, and finally a convolution layer converts the image features into an image.

A series of downsampling layers are used in the discriminator to convert images into image features, the image features are connected with sentence vectors, and then the confrontation loss is calculated through one-step generation to guarantee visual reality and semantic consistency.

Step S33: text-to-image generator network based on cross-modal attention mechanism

Since the GAN network realizes the generation of images by combining the feature maps of text and noise conversion, when W (z) is the feature map of the current generation level, the model will be adaptively fitted with the current text plus noise vector and the corresponding image vector, so that sentence vectors in each layer of generators have different weights. For example, when a creator creates an image of a bird, fine-grained information such as bird feathers, beaks, colors, etc. often take more weight in the later layers of the creator, while coarse-grained information such as background, outlines, etc. take more weight in the earlier layers of the creator.

The effectiveness of the method is proved by adopting an example, the experiment is carried out on the basis of a CUB bird data set, each image corresponds to ten sentences, when the corresponding image is generated, one text is randomly extracted from the images for generation, and the specific generation effect is shown in figure 5.

Example 2

In a second aspect, the present embodiment provides an apparatus for generating an association text based on a generative pre-trained language model to an image, including a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to embodiment 1.

Example 3

In a third aspect, the present embodiment provides a storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method of embodiment 1.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be appreciated by those skilled in the art that the invention can be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The embodiments disclosed above are therefore to be considered in all respects as illustrative and not restrictive. All changes which come within the scope of or equivalence to the invention are intended to be embraced therein.

Claims

1. A method for generating an associative text to an image based on a generative pre-trained language model, comprising:

2. The method for generating associative texts to images based on generative pre-trained language models according to claim 1, wherein the step S1 comprises:

3. The method as claimed in claim 2, wherein the step S11 of organizing ten sentences corresponding to each image in the data set into a sentence string comprises: the data set comprises a plurality of images, and each image corresponds to ten sentences; and compiling ten sentences corresponding to each image into sentence strings according to the following rules:

the sentence string is arranged as follows: "$ sentence a # sentence b # sentence c # - # sentence 9# sentence 10";

4. The method for generating an associative text according to claim 2, wherein step S12 comprises:

let a sentence string of a given input be represented as a sequence of sentences [ x ] ₁ ,x ₂ ,...,x _m ]M is the mth sentence in the sentence string;

the loss functions of the GPT-2 model during pre-training and fine tuning are respectively L ₁ (X) and L ₂ (X), the formula is as follows:

the fine tuning process adopts supervised learning, and the training sample comprises a sentence sequence [ x ] ₁ ,x ₂ ,...,x _m ]And in the first sentence x ₁ As a label; in the fine adjustment process of the GPT-2 model, a sentence sequence [ x ] is selected ₁ ,x ₂ ,...,x _m ]The prediction class label is L ₂ (X)；

Optimization function L ₃ Is L ₁ And L ₂ Weighted sum of (c):

L ₃ ＝L ₂ +λL ₁

5. The method of claim 1, wherein constraining the generated data set comprises:

6. The method as claimed in claim 1, wherein the semantic retention evaluation selection of the generated data set comprises:

evaluating the generated data set by adopting a bleu index, wherein the bleu index comprises sample bleu of different backgrounds of different postures in the same category ^a And greatly distinguishing samples bleu in different classes ^b Sample bleu similar to visual features but belonging to a different category ^c ：

Candidates denotes the sentences from which the data set was generated, reference is the sentences in the original data set, count denotes the Count, count _clip Representing the molecular cut-off count, wherein n-gram represents that the number of continuous words measured in candidates in reference is n, n-gram represents that the number of continuous words measured in candidates is n, and c' represent the number of sentences simultaneously selected from the data set; sigma _{c∈candidates} 、∑ _{c′∈candidates} Representation includes all candidates; sigma _n-gram∈c 、∑ _{n-gram′∈c′} Representing the number of all matched sentences in the candidate variables and representing the number of specific variables in the reference; count _clip (n-gram) represents the number of matching sentences in candidates that appear in reference; count (n-gram ') represents the number of sentences for which n-gram' matches in candidates;

7. The method of generating associated text-to-image based on generative pre-trained language model according to claim 1, wherein the DF-GAN based confrontation generates a network model comprising: a pre-trained text encoder, a generator and a discriminator;

the loss functions of the generator and the arbiter are as follows:

wherein L is _D As a loss function of the discriminator, L _G Z is a noise vector sampled by Gaussian distribution in a loss function formula of the generator; d is a discriminator, G is a generator, G (z) represents an image generated by the generator, e is a sentence vector;

8. The method as claimed in claim 7, wherein the generator processing comprises:

9. An apparatus for generating an associative text to an image based on a generative pre-trained language model, comprising a processor and a storage medium;

the storage medium is used for storing instructions;

the processor is configured to operate in accordance with the instructions to perform the steps of the method according to any one of claims 1 to 7.

10. A storage medium having a computer program stored thereon, the computer program, when being executed by a processor, performing the steps of the method of any one of claims 1 to 7.