CN117475038A

CN117475038A - Image generation method, device, equipment and computer readable storage medium

Info

Publication number: CN117475038A
Application number: CN202311827290.2A
Authority: CN
Inventors: 范宝余; 张润泽; 李仁刚; 赵雅倩; 郭振华; 王丽
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-12-28
Filing date: 2023-12-28
Publication date: 2024-01-30
Anticipated expiration: 2043-12-28
Also published as: CN117475038B

Abstract

The invention relates to the field of artificial intelligence, and particularly discloses an image generation method, an image generation device, image generation equipment and a computer readable storage medium.

Description

Image generation method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of artificial intelligence, and in particular, to an image generating method, apparatus, device, and computer readable storage medium.

Background

The generated artificial intelligence (Artificial Intelligence Generated Content, AIGC) is a technology for generating related content with an appropriate generalization ability by learning and recognizing existing data based on a technical method for generating artificial intelligence such as a countermeasure network and a large-scale pre-training model. The core idea of the generated artificial intelligence technology is to generate content with certain originality and quality by using an artificial intelligence algorithm. By training the model and learning a large amount of data, content related thereto can be generated according to the inputted conditions or instructions. For example, by inputting keywords, descriptions, or samples, articles, images, audio, etc. that match the keywords, descriptions, or samples can be generated.

The draft graph is a big branch in the generation type artificial intelligence technology, can convert the text modal information into the image mode for display by inputting a section of text description, has extremely high display effect and has wide application prospect. However, the current text-to-image technology has poor image generation quality, and is difficult to meet the expected requirements of users.

The quality of the generated pictures of the text-generated graph model is improved, and the method is a technical problem which needs to be solved by a person skilled in the art.

Disclosure of Invention

The invention aims to provide an image generation method, device and equipment and a computer readable storage medium, which are used for improving the quality of pictures generated by a text-to-image model and improving user experience.

In order to solve the above technical problems, the present invention provides an image generating method, including:

acquiring a first sample text, an initial draft graph model and a draft graph model;

starting from the initial text-to-text graph model, inputting the first text sample into the intermediate text-to-text graph model, outputting a first intermediate image, inputting the first intermediate image into the text-to-text graph model, outputting a first predicted text, constructing a reinforcement learning reward function according to the text similarity scores of the first predicted text and the first text sample, and updating model parameters of the intermediate text-to-text graph model by using the reinforcement learning reward function until the iterative training is finished, so as to obtain the text-to-text graph model;

inputting the text to be processed into the draft graph model, and outputting a result image.

In some implementations, the inputting the first intermediate image into the iconic text model, outputting a first predicted text, constructing a reinforcement learning reward function with a text similarity score of the first predicted text and the first text sample, comprising:

Inputting the first intermediate image into the icon text model to perform entity detection to obtain an entity corresponding to the first intermediate image;

generating the first predicted text by using the entity corresponding to the first intermediate image, and calculating the similarity between the first predicted text and the first sample text to obtain a first text similarity score;

detecting and dividing the first intermediate image according to the entity corresponding to the first intermediate image to obtain an entity mask corresponding to the first intermediate image;

calculating scene similarity between the entity mask and the first intermediate image to obtain a scene similarity score;

calculating a second text similarity score according to the first text similarity score and the scene similarity score;

constructing the reinforcement learning reward function according to the second text similarity score;

wherein the type of the entity comprises at least one of an object, a relationship, an attribute, and a behavior.

In some implementations, the generating the first predicted text with the entity corresponding to the first intermediate image includes:

and generating the first prediction text by using a screening entity tag obtained by detecting and dividing the first intermediate image according to the entity corresponding to the first intermediate image.

In some implementations, the constructing the reinforcement learning reward function according to the second text similarity score includes:

inputting the intermediate image into a multi-modal conversion model to obtain a second predicted text;

calculating the similarity between the second predicted text and the first sample text to obtain a third text similarity score;

and constructing the reinforcement learning reward function according to the second text similarity score and the third text similarity score.

In some implementations, inputting the first intermediate image into the context model for entity detection to obtain an entity corresponding to the first intermediate image includes:

inputting the first intermediate image into an image recognition model in the icon text model to obtain image characteristics of the first intermediate image;

identifying the entity contained in the first intermediate image according to the image characteristics of the first intermediate image by utilizing an identification marking model in the icon text model;

marking each entity by using the identification marking model to generate an entity text label list corresponding to the first intermediate image;

the generating the first predicted text by using the entity corresponding to the first intermediate image includes:

And generating the first predicted text according to the entity text labels in the entity text label list by using a text generation model in the icon text model.

In some implementations, the training process of the pictographic model includes:

acquiring an image text pair and an initial icon text model;

and inputting a sample image in the image text pair into the image recognition model in the intermediate image text model from the initial image text model to obtain image characteristics of the sample image, recognizing intermediate entities contained in the sample image by using the recognition marking model in the intermediate image text model according to the image characteristics of the sample image, marking the intermediate entities to generate an intermediate entity text label list corresponding to the sample image, generating a third predicted text according to intermediate entity text labels in the intermediate entity text label list, and updating model parameters of the intermediate image text model by using text difference degrees of a second sample text corresponding to the sample image and the third predicted text until the iterative training is finished to obtain the image text model.

In some implementations, the training the intermediate graph text model in each iteration, inputting the sample image in the image text pair into the image recognition model in the intermediate graph text model to obtain the image feature of the sample image, recognizing the intermediate entity included in the sample image according to the image feature of the sample image by using the recognition and marking model in the intermediate graph text model, marking the intermediate entity, generating an intermediate entity text label list corresponding to the sample image, generating a third prediction text according to the intermediate entity text label in the intermediate entity text label list, and updating the model parameters of the intermediate graph text model by using the text difference degree between the second sample text corresponding to the sample image and the third prediction text, including:

taking an image feature encoder with freezing parameters as the image recognition model, and inputting the sample image into the image feature encoder to obtain an image feature code of the sample image; extracting the intermediate entity text labels in the second sample text using a contrast language-image pre-training text encoder;

Assigning the feature weight of the intermediate entity text label to a learnable text query parameter;

inputting the assigned leachable text query parameters and the image feature codes into the identification marking model to obtain a predicted entity feature vector and an analysis tag, and training the identification marking model by using the predicted entity feature vector and the analysis tag to perform cross entropy loss by taking the intermediate entity text tag as a true value;

inputting the entity tag and the image feature code into the text generation model, outputting a text analysis feature, taking the second sample text as a true value, and training the text generation model according to the text analysis feature and the second sample text by cross entropy loss.

In some implementations, for different ones of the first intermediate images generated in one iteration of the text-to-image model, generating image features of the first intermediate images using the image recognition model, generating the list of entity text labels using the recognition and marking model, and generating the first predicted text from the entity text labels in the list of entity text labels using the text generation model are performed in parallel in multiple passes.

In some implementations, the computing the similarity of the first predicted text to the first sample text to obtain a first text similarity score includes:

generating a first scoring task by the similarity of the first predicted text and the first sample text, and inputting the first scoring task into an artificial intelligence large model to obtain a first text similarity score;

the calculating the scene similarity between the entity mask and the first intermediate image to obtain a scene similarity score includes:

generating a second scoring task by the scene similarity of the entity mask and the first intermediate image, and inputting the second scoring task into the artificial intelligent large model to obtain the scene similarity score;

the calculating the similarity between the second predicted text and the first sample text to obtain a third text similarity score includes:

and generating a third scoring task by using the similarity between the second predicted text and the first sample text, and inputting the third scoring task into the artificial intelligence large model to obtain the third text similarity score.

In some implementations, the constructing the reinforcement learning reward function from the second text similarity score and the third text similarity score includes:

Sorting the second text similarity scores corresponding to the first intermediate images to obtain a second text similarity score matrix corresponding to the first sample text;

sorting the third text similarity scores corresponding to the first intermediate images to obtain a third text similarity score matrix corresponding to the first sample text;

and taking the sum of the second text similarity scoring matrix and the third text similarity scoring matrix as the reinforcement learning reward function.

In some implementations, the training the intermediate text-to-text graph model for each iteration, inputting the first sample into the intermediate text-to-text graph model, outputting a first intermediate image, inputting the first intermediate image into the text-to-text model, outputting a first predicted text, constructing a reinforcement learning reward function with a text similarity score of the first predicted text and the first sample, updating model parameters of the intermediate text-to-text graph model with the reinforcement learning reward function, comprising:

sampling from a first sample set to obtain first batch training data of the current iterative training;

inputting the first sample text into the intermediate text graph model for each first sample text in the first batch of training data, and outputting a first intermediate image set corresponding to the first sample text;

Inputting each first intermediate image in the first intermediate image set into the context model, outputting the first predicted text, and constructing the reinforcement learning reward function corresponding to the first sample text according to the text similarity score of the first predicted text corresponding to the first intermediate image set corresponding to the first sample text and the first text sample;

and updating model parameters of the intermediate relic graph model by using the reinforcement learning reward function corresponding to each first sample text in the first batch of training data.

In some implementations, the sampling the first sample set from the first sample set to obtain the first batch of training data of the current iterative training includes:

and when the model parameters of the intermediate culture chart model are updated based on the first batch of training data in the current iterative training, the step of sampling from the first sample set of the text in the next iterative training to obtain first batch of training data of the current iterative training is executed in parallel.

In some implementations, the initial meridional graph model is comprised of a text encoder, an image self-decoder, a cross-attention downsampling layer, a cross-attention upsampling layer, and a cross-attention layer;

The cross attention downsampling layer, the cross attention upsampling layer and the cross attention layer are all provided with an adaptation layer arranged on a full-connection layer of the cross attention module;

the adaptation layer is composed of an R downsampling full-connection module and an R upsampling full-connection module, wherein the R downsampling full-connection module is used for reducing an input characteristic channel to an R dimension, and the R upsampling full-connection module is used for restoring the R dimension characteristic to an original input characteristic dimension.

In some implementations, the updating model parameters of the intermediate meristem model with the reinforcement learning reward function includes:

and freezing model parameters of a network except the R down-sampling full-connection module and the R up-sampling full-connection module in the intermediate culture chart model, and updating the model parameters of the R down-sampling full-connection module and the model parameters of the R up-sampling full-connection module by utilizing the reinforcement learning reward function.

In some implementations, the inputting the first sample into the intermediate text graph model, outputting a first intermediate image, includes:

carrying out quantization processing on model parameters of the intermediate graph model to obtain the quantized intermediate graph model;

Inputting the first text sample into the quantized intermediate text graph model, and outputting the first intermediate image;

the updating the model parameters of the intermediate literature graph model by using the reinforcement learning reward function comprises the following steps:

and when the model parameters of the intermediate relic graph model are updated by using the reinforcement learning reward function to perform back propagation calculation, performing inverse quantization processing on the model parameters of the intermediate relic graph model to obtain the updated intermediate relic graph model.

In some implementations, the quantizing the model parameters of the intermediate graph model to obtain the quantized intermediate graph model includes:

performing four-bit standard floating point number quantization processing on model parameters and data type score of the intermediate culture diagram model to normalize the model parameters and data type score of the intermediate culture diagram model to [ -1,1];

performing double quantization on the model parameters subjected to four-bit standard floating point number quantization and the original 32-bit floating point number in the intermediate culture chart model, and converting the model parameters of the intermediate culture chart model into quantized quantization constants 2 and second-stage quantization constants to obtain the quantized intermediate culture chart model;

And in the process of executing the quantization processing of the model parameters of the intermediate culture chart model, processing the storage positions of the intermediate culture chart model by using a paging optimizer.

In some implementations, the inputting the text to be processed into the text-to-text graph model, outputting a resulting image, includes:

inputting the text to be processed into the draft graph model, and outputting a second intermediate image;

acquiring a feedback result of a user on the second intermediate image;

and regenerating the second intermediate image according to the feedback result and the text to be processed by using the text generation graph model until the second intermediate image meeting the feedback requirement is acquired, and outputting the second intermediate image meeting the feedback requirement as the result image.

In some implementations, the inputting the text to be processed into the meridional graph model, outputting a second intermediate image, includes:

inputting the text to be processed into the draft graph model, and outputting a plurality of second intermediate images;

the obtaining the feedback result of the user on the second intermediate image comprises the following steps:

the second intermediate image with the positive feedback result is listed in a positive feedback set, and the second intermediate image with the negative feedback result is listed in a negative feedback set;

And regenerating the second intermediate image by using the text-to-text graph model according to the feedback result and the text to be processed until the second intermediate image meeting the feedback requirement is acquired, and outputting the second intermediate image meeting the feedback requirement as the result image, wherein the method comprises the following steps:

generating a feedback sample according to the positive feedback set and the negative feedback set, and regenerating the second intermediate image by using the feedback sample and the text to be processed until the second intermediate image meeting the feedback requirement is obtained, and outputting the second intermediate image meeting the feedback requirement as the result image.

In some implementations, the meridional graph model is a meridional graph diffusion model;

generating a feedback sample according to the positive feedback set and the negative feedback set, and regenerating the araneographic model to generate the second intermediate image by using the feedback sample and the text to be processed until the second intermediate image meeting the feedback requirement is obtained, and outputting the second intermediate image meeting the feedback requirement as the result image, wherein the method comprises the following steps:

acquiring random Gaussian noise;

Generating a feedback noise adding vector according to the positive feedback set and the negative feedback set;

inputting the feedback noise adding vector, the current time step variable and the text to be processed into the text-to-text graph diffusion model to obtain a feedback prediction variable;

dividing the feedback prediction variable into positive hidden vectors and negative hidden vectors according to the positive feedback set and the negative feedback set;

taking the characteristics generated after the characteristics pass through a residual error network in each sub-network of the meridional graph diffusion model as extension hidden vectors, and recording hidden sub-layer characteristics corresponding to each layer of the extension hidden vectors in the meridional graph diffusion model, wherein the hidden sub-layer characteristics are divided into positive sample hidden sub-layer characteristics and negative sample hidden sub-layer characteristics according to the positive feedback set and the negative feedback set;

updating model weight vectors of the venturi graph diffusion model according to model parameters in a self-attention module of the venturi graph diffusion model and characteristics output by a residual module of the venturi graph diffusion model, generating positive sample denoising vectors according to positive sample model weight vectors in the self-attention module based on an attention mechanism, and generating negative sample denoising vectors according to negative sample model weight vectors in the self-attention module based on the attention mechanism;

Generating a current denoising vector according to the positive sample denoising vector and the negative sample denoising vector;

if the denoising times of the diffusion model of the venturi chart do not reach the set denoising times, returning to the step of generating a feedback noise adding vector according to the positive feedback set and the negative feedback set after Gaussian noise denoising is performed by using the current denoising vector;

and if the denoising times of the diffusion model of the venturi image reach the set denoising times, taking the current denoising vector as a final denoising vector, and inputting the final denoising vector into a decoder to obtain the result image.

In order to solve the above technical problem, the present invention further provides an image generating apparatus, including:

the first acquisition unit is used for acquiring a first sample text, an initial draft graph model and a draft graph model;

the first training unit is used for inputting the first text sample into the intermediate text graph model from the initial text graph model, outputting a first intermediate image, inputting the first intermediate image into the text graph model, outputting a first predicted text, constructing a reinforcement learning reward function according to the text similarity scores of the first predicted text and the first text sample, and updating model parameters of the intermediate text graph model by using the reinforcement learning reward function until the iterative training is finished, so as to obtain the text graph model;

The first output unit is used for inputting the text to be processed into the draft graph model and outputting a result image.

a memory for storing a computer program;

a processor for executing the computer program, which when executed by the processor implements the steps of the image generation method according to any one of the above.

To solve the above technical problem, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image generation method according to any one of the above.

According to the image generation method provided by the invention, the intermediate text-generated graph model in each iteration training is started from the initial text-generated graph model, the first text sample is input into the intermediate text-generated graph model, the first intermediate image is output, the first intermediate image is input into the text-generated graph model, the first predicted text is output, the text similarity score of the first predicted text and the first text sample is used for constructing the reinforcement learning reward function, the model parameters of the intermediate text-generated graph model are updated by using the reinforcement learning reward function until the iteration training is finished, and the text-generated graph model is obtained.

The invention also provides an image generating device, equipment and a computer readable storage medium, which have the beneficial effects and are not repeated here.

Drawings

For a clearer description of embodiments of the invention or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an image generating method according to an embodiment of the present invention;

FIG. 2 is a flowchart of an overall evaluation scoring process according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a fine-grained evaluation scoring according to an embodiment of the invention;

FIG. 4 is a schematic diagram of an end-to-end recognition detection segmentation and text generation model application provided in an embodiment of the present invention;

FIG. 5 is a schematic diagram of a structure of a text-generated graph model according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a cross-attention layer model according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a model structure of a self-attention module adaptation layer according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a model structure of an adaptation layer of a feedforward network module according to an embodiment of the present invention;

FIG. 9 is a schematic diagram of a model structure of a cross-attention module adaptation layer according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide an image generation method, an image generation device and a computer readable storage medium, which are used for improving the quality of pictures generated by a text-to-image model and improving user experience.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes an embodiment of the present invention.

For ease of understanding, a system architecture to which the present invention is applicable will first be described. The image generation method provided by the embodiment of the invention can be applied to the computing equipment with the accelerator, and can also be applied to the accelerator cluster and the heterogeneous accelerator cluster. The accelerator may employ, but is not limited to, a graphics processor (Graphics Processing Unit, GPU), a field programmable gate array (Field Programmable Gate Array, FPGA), or the like.

On the basis of the above architecture, an image generating method according to an embodiment of the present invention is described below with reference to the accompanying drawings.

The second embodiment of the present invention will be described below.

Fig. 1 is a flowchart of an image generating method according to an embodiment of the present invention.

As shown in fig. 1, the image generating method provided by the embodiment of the invention includes:

s101: and acquiring a first sample text, an initial draft graph model and a draft graph model.

S102: starting from an initial text-to-graph model, inputting a first sample into the intermediate text-to-graph model, outputting a first intermediate image, inputting the first intermediate image into the text-to-graph model, outputting a first predicted text, constructing a reinforcement learning reward function according to the text similarity scores of the first predicted text and the first sample, and updating model parameters of the intermediate text-to-graph model by using the reinforcement learning reward function until the iterative training is finished, thereby obtaining the text-to-graph model.

S103: inputting the text to be processed into the text generation graph model, and outputting a result image.

In specific implementation, the image generation method provided by the embodiment of the invention is suitable for artificial intelligent drawing, the acquired text to be processed can be text input by a user or text generated by voice recognition conversion of the user, and a text-to-text model is adopted to convert the text to be processed from a text mode to an image mode. The embodiment of the invention can perform optimization training on the basis of the initial venturi figure model based on the first sample text and the venturi figure model without being based on the sample image when training the venturi figure model.

The text-generated graph model adopted by the embodiment of the invention can be a text-generated graph diffusion model. The principle of the diffusion model of the meristematic map is as follows: given an imageThe literature graph diffusion model first generates a series of Markov chain hidden vectors +.>，……，/>This series of hidden vectors passes the progressive original image +.>The formula for adding gaussian noise is as follows:

；

wherein,is a progressive slave->To->Is added with the noise formula;tis a time step; />The variance used for each step (step) is between 0 and 1. For the diffusion model of the venturi, the variance of the desynchronization is set to variance schedule or noise schedule, and in general, the later steps will take a larger variance, i.e. satisfy +.>Under a designed variance schedule, if the number of diffusion steps is sufficiently large, the final +.>The original data is completely lost and becomes a random noise.

Based on hidden vectorsThe specific formula of the denoising process is as follows:

；

wherein,is a progressive slave->To->Is a denoising formula; />Parameters learned for the neural network;is the following.

In particular, during the generation of an image, the image is initiallyFrom Gaussian noise->Random sampling, gradually reducing noise to obtain real image +. >. Usually, in a specific practical process, the deviation of the prediction noise is adopted as a specific optimized training function instead of the real image, and the specific function is as follows:

；

wherein,to be from Gaussian noise->Random sampling, < >>Is under the input conditioncInputting time stepstAndtDeviation of noisy image at moment inferred by neural network, < >>Is a parameter learned by the neural network, wherecRepresenting various conditional inputs, in embodiments of the present invention, text descriptions are meant; />Representation->Is in accordance with->The distribution of the particles is carried out,indicating that the additive noise profile is gaussian>Representing the initial image +.>Compliance withThe distribution and the added noise distribution are Gaussian distribution and the time step istIs a case of (2).

In the embodiment of the invention, the loss function adopted by the diffusion model of the meristematic map is a loss function which combines deviation prediction and actual value prediction as a regular term, and the overall loss function is as follows:

；

at this time, theThe latter term, representing the number of parameters that can be learned currently, serves as a regularization term to reduce the risk of overfitting, especially during model self-evolution training.

The method comprises the steps of performing iterative training of a text-to-text graph model from an initial text-to-text graph model, inputting a first text sample into the intermediate text-to-text graph model for each iterative training, outputting a first intermediate image, inputting the first intermediate image into the text-to-text graph model, outputting a first predicted text, calculating a text similarity score of the first predicted text and the first text sample, and constructing a reinforcement learning reward function by the text similarity score.

In each iterative training, the intermediate meristematic map model may employ N random seeds to generate N different first intermediate images. N may be 32.

And ending the iterative training of the draft graph model when the preset iteration times are reached or the text similarity scores of the first predicted text and the first text sample reach a text similarity threshold value, and outputting the draft graph model to be the optimized draft graph model. And carrying out text generation and graph processing on the text to be processed input by the user by utilizing the text generation and graph model, and outputting a result image corresponding to the text to be processed, wherein the result image is closer to the text description of the text to be processed and meets the user requirements.

According to the image generation method provided by the embodiment of the invention, the intermediate text-generated graph model in each iteration training is input into the intermediate text-generated graph model, the first intermediate image is output, the first intermediate image is input into the text-generated graph model, the first predicted text is output, the reinforcement learning reward function is built according to the text similarity scores of the first predicted text and the first text, the model parameters of the intermediate text-generated graph model are updated by the reinforcement learning reward function until the iteration training is finished, and the text-generated graph model is obtained.

The following describes a third embodiment of the present invention.

FIG. 2 is a schematic illustration of a fine-grained evaluation scoring according to an embodiment of the invention; fig. 3 is a flowchart of overall evaluation scoring according to an embodiment of the present invention.

At present, general model evaluation indexes of a text-to-Image Pre-training (CLIP) model are difficult to evaluate the matching degree between a real text description and a generated Image, so that compared with the traditional scoring mode using text similarity as a basis, the embodiment of the invention can improve the accuracy of scoring an intermediate Image.

Based on the above embodiments, the present invention provides a way to generate a text similarity score. In the embodiment of the invention, one or more first intermediate images can be generated by using an intermediate text-generated graph model according to a first sample text; for each first intermediate image, one or more first predictive texts may be generated using a graphing model. The embodiment of the invention is described by one of the first predictive texts.

In the embodiment of the invention, in order to improve the accuracy of scoring evaluation, the embodiment of the invention generates a plurality of angles from detection, segmentation, recognition and text generation to analyze and generate image fine-grained composition. Inputting the first intermediate image into the pictogram model, outputting a first predicted text, and constructing a reinforcement learning reward function according to the text similarity scores of the first predicted text and the first text, wherein the reinforcement learning reward function comprises:

inputting the first intermediate image into a text-in-graph model to perform entity detection to obtain an entity corresponding to the first intermediate image;

generating a first predicted text by using an entity corresponding to the first intermediate image, and calculating the similarity between the first predicted text and the first text sample to obtain a first text similarity score;

calculating scene similarity of the entity mask and the first intermediate image to obtain a scene similarity score;

constructing a reinforcement learning reward function according to the second text similarity score;

Wherein the type of entity includes at least one of an object, a relationship, an attribute, and a behavior.

In a specific implementation, an embodiment of the present invention provides a fine-grained text similarity scoring method, as shown in fig. 2, where the first text description generates an image through a text graph modelEach image is firstly subjected to entity marking by using an identification marking model to obtain +.>Personal entity tags, whereinnRepresenting the number of tags. At this time, the text generation model may be used to generate a first predicted text according to each entity tag, and calculate the similarity between the first predicted text and the first text, to obtain a first text similarity score.

Inputting the entity tag list and the first intermediate image into a detection segmentation model, and obtaining an entity mask of an entity on the first intermediate image. The detection segmentation model can also play a role in screening entities, errors may exist in the entity labels identified by the identification marking model, and only the entity labels screened by the detection segmentation model are actually existing in the generated image. Therefore, after the detection of the segmentation model, the screened entity mask and the corresponding screened entity label are obtained, and finally the method is obtained And (3) an entity tag.

Generating the first predicted text using the entity corresponding to the first intermediate image may include: and generating a first prediction text by using a screening entity label obtained by detecting and dividing the first intermediate image according to the entity corresponding to the first intermediate image. When the first text sample is generated, the entity marking can be directly carried out by adopting the identification marking model to obtainPersonal entity tags can be generated, or screening entity tags +.>To be generated.

Two evaluable detection results of the first intermediate image are obtained at this time: one is a first predicted text generated by entity tags obtained by entity marking according to the identification marking model or a first predicted text generated by screening the entity tags, and the other is a screening entity mask obtained by detection segmentation.

At this time, scoring of the two detection results may be accomplished using an artificial intelligence large model.

Calculating the similarity of the first predicted text to the first text sample to obtain a first text similarity score may include: and generating a first scoring task by the similarity between the first predicted text and the first text sample, and inputting the first scoring task into the artificial intelligence large model to obtain a first text similarity score.

Calculating the scene similarity between the entity mask and the first intermediate image to obtain a scene similarity score may include: and generating a second scoring task by the scene similarity of the entity mask and the first intermediate image, and inputting the second scoring task into the artificial intelligent large model to obtain a scene similarity score.

The artificial intelligence large model can adopt a GPT4 API, and specifically can call the GPT4 API to set prompt words for the two detection results respectively.

For the first scoring task, the text description hint may be "suppose you are a very strong text similarity scoring model, please score the following two sentences with a scoring interval between 0 and 1).

For the second scoring task, the entity mask hint word may be "please evaluate the described scene based on the provided hint, and determine the likelihood that each tag appears in the scene".

And adding and normalizing the first text similarity score and the scene similarity score to [0,1] through scoring of a large model GPT4 API, so as to obtain fine granularity evaluation scoring of the first intermediate image.

In order to further enhance the accuracy of text similarity scoring, the embodiment of the invention also provides a text similarity scoring mode obtained by combining the overall text similarity scoring mode given by the multi-mode conversion model, namely scoring evaluation indexes of two dimensions of the overall evaluation index and the fine-grain evaluation index. Constructing a reinforcement learning reward function based on the second text similarity score may include:

Inputting the intermediate image into a multi-mode conversion model to obtain a second predicted text;

calculating the similarity between the second predicted text and the first text to obtain a third text similarity score;

and constructing a reinforcement learning reward function according to the second text similarity score and the third text similarity score.

In implementations, as shown in FIG. 3, the multimodal transformation model may employ a model such as BLIP 2. The first text sample is output through the intermediate text graph model to form a plurality of first intermediate images, and n second predicted texts (second predicted text 1, second predicted text 2 and … …, second predicted text n) are generated for each first intermediate image by using the graph text model, wherein n may be 5.

The third text similarity score may also be obtained using an artificial intelligence large model. Calculating the similarity of the second predicted text to the first text sample to obtain a third text similarity score may include: and generating a third scoring task by the similarity between the second predicted text and the first text, and inputting the third scoring task into the artificial intelligence large model to obtain a third text similarity score. The text description "now, assuming you are a very severe text similarity scoring model, please score the following two sentences, with a scoring interval between 0 and 1," can be entered into the artificial intelligence large model, causing the artificial intelligence large model to output a text similarity score between the second predicted text and the first text. For n second predicted texts, n third text similarity scores are generated.

After passing through GPT4 API, n scoring values are obtained, and the average value can be calculated to obtain the integral scoring of the first intermediate imageWherein the superscriptiRepresent the firstiA random number of seeds, each of which is selected from the group consisting of,grepresenting the whole.

The fourth embodiment of the present invention will be described below.

Fig. 4 is a schematic diagram of an end-to-end recognition detection segmentation and text generation model application according to an embodiment of the present invention.

In the above embodiment, the first predicted text is generated from the first intermediate image using the primitive Wen Moxing. The context model may employ an existing context model. In order to improve the accuracy of the mode conversion of the graph-text model and further improve the accuracy of scoring the first intermediate image, the embodiment of the invention provides an inference principle and a corresponding training method of the graph-text model.

In executing the scoring flow provided in the third embodiment of the present invention, the related models include an image recognition model, a recognition marking model, a detection segmentation model and a text generation model. Wherein, the detection segmentation model can adopt the existing model. In order to improve reasoning efficiency, the embodiment of the invention innovatively unifies three modules of an image recognition model, an identification marking model and a text generation model into a network structure, and the whole network structure is shown in fig. 4. In fig. 4, the solid line represents the training flow, and the broken line represents the reasoning flow. The flow adopts a main stream backbone image extraction network and three sub-module branch networks, each image only needs to extract image characteristics once, the three sub-module branch networks have few calculation parameters, classification, detection segmentation and image description functions can be completed at the same time, classification identification, detection segmentation and image description tasks can be interacted with each other, model robustness is improved, and the unification of functions and efficiency is considered in practical application.

In the context model provided by the embodiment of the invention, an image characteristic encoder, such as a ViT-G model, can be adopted as the image identification model; the detection segmentation model may employ a SAM segmentation model.

In the image generating method provided by the embodiment of the present invention, as shown in the dotted line part in fig. 4, the step of inputting the first intermediate image into the image text model to perform entity detection to obtain the entity corresponding to the first intermediate image includes: inputting the first intermediate image into an image recognition model in the context model to obtain the image characteristics of the first intermediate image; identifying entities contained in the first intermediate image according to the image characteristics of the first intermediate image by utilizing an identification marking model in the icon text model; and marking each entity by using the identification marking model, and generating an entity text label list corresponding to the first intermediate image.

Generating a first predicted text using an entity corresponding to the first intermediate image, comprising: and generating a first prediction text according to the entity text labels in the entity text label list by using a text generation model in the icon text model.

If the fine granularity evaluation index and the overall evaluation index provided by the embodiment of the invention are adopted, the entity mask is required to be screened when the fine granularity evaluation index is generated, and then the detection segmentation model is applied to obtain the screened entity mask.

The input picture passes through a large-scale image feature encoder to obtain the encoded image features, and the subsequent operation does not need to repeatedly extract the features. The image features pass through three subtask heads: namely, a text generation model, a recognition marking model and a detection segmentation model.

The training process of the pictographic model may include, as shown in the solid line portion of fig. 4:

acquiring an image text pair and an initial icon text model;

and inputting a sample image in the image text pair into an image recognition model in the intermediate image text model from the initial image text model in each iteration training to obtain image characteristics of the sample image, recognizing intermediate entities contained in the sample image by utilizing a recognition marking model in the intermediate image text model according to the image characteristics of the sample image, marking the intermediate entities to generate an intermediate entity text label list corresponding to the sample image, generating a third prediction text according to the intermediate entity text labels in the intermediate entity text label list, and updating model parameters of the intermediate image text model by utilizing the text difference degree between the second sample text corresponding to the sample image and the third prediction text until the iteration training is finished to obtain the image text model.

The image encoder and the detection segmentation model load the model weights of a SAM segmentation model and an open-set model detection algorithm (grouping-DINO), and the frozen model parameters are not updated, so that any detection segmentation model in an open field can be adopted.

Inputting a sample image in the image text pair into an image recognition model in the intermediate image text model to obtain image characteristics of the sample image, recognizing intermediate entities contained in the sample image according to the image characteristics of the sample image by using a recognition marking model in the intermediate image text model, marking the intermediate entities to generate an intermediate entity text label list corresponding to the sample image, generating a third predicted text according to the intermediate entity text labels in the intermediate entity text label list, and updating model parameters of the intermediate image text model by using the text difference degree of the second sample text corresponding to the sample image and the third predicted text, wherein the method comprises the following steps:

taking the image feature encoder with the freezing parameters as an image recognition model, inputting the sample image into the image feature encoder to obtain the image feature code of the sample image; extracting intermediate entity text labels in the second sample text by using a contrast language-image pre-training text encoder;

Assigning feature weights of the intermediate entity text labels to a learnable text query (query) parameter;

inputting the assigned leachable text inquiry parameters and image feature codes into an identification marking model to obtain a predicted entity feature vector and an analysis tag, and carrying out cross entropy loss training on the identification marking model by using the predicted entity feature vector and the analysis tag by taking the intermediate entity text tag as a true value;

and inputting the entity tag and the image feature code into a text generation model, outputting a text analysis feature, taking the second sample text as a true value, and performing cross entropy loss training text generation model according to the text analysis feature and the second sample text.

The recognition standard model may be a decoder architecture based on a Transfomer (a two-layer decoder architecture may be used in the embodiment of the present invention).

When training the text-generated graph model, the reasoning process of the identification marking model is to input the image characteristics of the first intermediate image obtained by the identification of the image identification model and the learned learning text query parameters into the identification marking model and output the entity marking label corresponding to the first intermediate image.

As shown in fig. 4, for example, the entity tag is extracted from the second sample text: the "building", "snow", "walk", "man", "Christmas", "square" inputs the identification marking model for the identification marking model to mark the entity identified from the first intermediate image with the entity standard mark.

The text generation model may employ a 12-layer encoder-decoder architecture. The training process takes the identification marking tag and the image feature code obtained by analyzing the second sample text as input, inputs the identification marking tag and the image feature code into a text generation model and outputs text analysis features. With the second sample text as the true value (Ground Truth), the penalty function generates a model for cross entropy penalty optimization text. And in the reasoning process of the text generation model, the entity tag and the image feature code obtained by the recognition standard model reasoning are input into the trained text generation model to obtain a third predicted text by reasoning.

And finally, obtaining the text similarity score of the first intermediate image under the random seed by the overall evaluation index and the fine granularity evaluation index. The text similarity score allows for both understanding of the overall semantics of the generated Image and for adequately controlling the detail of the generated Image, and is more accurate than a contrast language-Image Pre-training (CLIP) score or a Frieche Initial Distance (FID) score.

In order to improve the reasoning efficiency, when training a text-to-graph model, performing image characteristics of different first intermediate images generated in one iteration training of the text-to-graph model by using an image recognition model, generating an entity text label list by using a recognition marking model and generating a first prediction text according to entity text labels in the entity text label list by using a text generation model in parallel.

The fifth embodiment of the present invention will be described below.

FIG. 5 is a schematic diagram of a structure of a text-generated graph model according to an embodiment of the present invention; FIG. 6 is a schematic diagram of a cross-attention layer model according to an embodiment of the present invention; fig. 7 is a schematic diagram of a model structure of a self-attention module adaptation layer according to an embodiment of the present invention; FIG. 8 is a schematic diagram of a model structure of an adaptation layer of a feedforward network module according to an embodiment of the present invention; fig. 9 is a schematic diagram of a model structure of a cross-attention module adaptation layer according to an embodiment of the present invention.

On the basis of the model evaluation scoring scheme provided by the embodiment, the training process of the text-generated graph model is continuously described.

First, a process of generating the reinforcement learning bonus function will be described. In an embodiment of the present invention, constructing a reinforcement learning reward function according to the second text similarity score and the third text similarity score includes:

sequencing the second text similarity scores corresponding to the first intermediate images to obtain a second text similarity score matrix corresponding to the first sample text;

And taking the sum of the second text similarity scoring matrix and the third text similarity scoring matrix as a reinforcement learning reward function.

Wherein, the second text similarity scoring result can be obtained for the first intermediate image generated in the current iteration training of the text graph model(i.e., a fine-grain scoring index), wherein,Nfor the number of random seeds,tfor index (I)>Is the firsttA first intermediate image generated from a random seed. Meanwhile, a third text similarity scoring result corresponding to the first intermediate image generated in the current iterative training can be obtained>(i.e., an overall score indicator). Respectively pair->Is->Ordering to obtain->Is->。

The final reinforcement learning reward function formula is:

。

after the enhanced reward function is determined, an optimal generated image can be obtained according to the ranking. This allows for the selection of better quality training samples. In the embodiment of the invention, the number of the high-quality training samples can be designed, and the number is smaller than the number of the generated first intermediate images, namely, the first few optimal first intermediate images are selected as the optimal training samples to enter the next iterative training.

In S102, for the intermediate text-to-text graph model in each iteration training, inputting the first sample into the intermediate text-to-text graph model, outputting the first intermediate image, inputting the first intermediate image into the text-to-text graph model, outputting the first predicted text, constructing a reinforcement learning reward function according to the text similarity score of the first predicted text and the first sample, and updating the model parameters of the intermediate text-to-text graph model by using the reinforcement learning reward function may include:

inputting the first sample text into an intermediate text-to-graph model for each first sample text in the first batch of training data, and outputting a first intermediate image set corresponding to the first sample text;

inputting each first intermediate image in the first intermediate image set into a graphic text model, outputting a first predicted text, and constructing a reinforcement learning reward function corresponding to the first sample text according to the text similarity scores of the first predicted text corresponding to the first intermediate image set corresponding to the first sample text and the first sample text;

and updating model parameters of the intermediate text-generated graph model by using reinforcement learning reward functions corresponding to the first text samples in the first batch of training data.

Specifically, the first set of text samples is the followingSample->Samples (+)>Batch size).

For the followingEach of the samples of (1)xThrough a meridional graph model->Generating sample set->The method comprises the steps of carrying out a first treatment on the surface of the Wherein,αrepresenting temperature coefficients for controlling the diversity of the generated images, generating samplesyIs regarded as based onxIs a conditional distribution of (a).

Pair aggregationYEach sample in (a) calculating a score value according to a reinforcement learning reward function formula RAnd (5) collecting.

AggregationRRanking obtainsYMiddle and high quality training sample。

Using samplesTraining model->Obtain->。

Repeating the above stepsTSecondary until the final text-to-image model is obtained。TMay be set to 1000.

In order to improve training efficiency, fully utilizing computing resources of the device, sampling from the first sample set to obtain first batch training data of the current iterative training may include: and when the model parameters of the intermediate text graph model are updated based on the first batch of training data in the current iterative training, the step of sampling from the first sample set in the next iterative training to obtain the first batch of training data of the current iterative training is executed in parallel.

As shown in FIG. 5, the initial meridional graph model selected for use in embodiments of the present invention may be comprised of a text encoder, an image self-decoder, a cross-attention downsampling layer, a cross-attention upsampling layer, and a cross-attention layer;

the cross attention downsampling layer, the cross attention upsampling layer and the cross attention layer are respectively provided with an adaptation layer which is arranged on a full-connection layer of the cross attention module;

the adaptation layer comprises an R down-sampling full-connection module and an R up-sampling full-connection module, wherein the R down-sampling full-connection module is used for reducing an input characteristic channel to an R dimension, and the R up-sampling full-connection module is used for restoring the R dimension characteristic to an original input characteristic dimension.

In the context diagram backbone model shown in fig. 5, each cross-attention downsampling layer is used to double downsampling the feature diagram and each cross-attention upsampling layer is used to double upsampling the features.

The network structure of each cross-attention layer may be as shown in fig. 6, i.e., the cross-attention downsampling layer, the cross-attention upsampling layer, and the cross-attention layer may each be composed of a residual module, a self-attention module, a feed-forward network module, and a cross-attention module. The residual module may be a basic network module of a residual network ResNet. The self-attention module performs self-feature learning updating by taking image features as learnable text query parameters and key-value (key-value); the cross attention module takes image features as a learnable text query parameter, and takes text vectors as key values to carry out self-feature learning updating.

As shown in fig. 7, the self-attention module is provided with a self-attention module adaptation layer; as shown in fig. 8, the feedforward network module is provided with a feedforward network module adaptation layer; as shown in fig. 9, the cross-attention module is provided with a cross-attention module adaptation layer. The self-attention module adaptation layer, the feedforward network module adaptation layer and the cross-attention module adaptation layer all comprise an R downsampling full-connection module and an R upsampling full-connection module which are arranged on full-connection layers (of a Q linear layer, a K linear layer and a V linear layer), wherein the R downsampling full-connection module is used for reducing an input characteristic channel to R dimension, and the R upsampling full-connection module is used for restoring R dimension characteristics to original input characteristic dimension. Wherein R may be 4.

To further improve model training efficiency, updating model parameters of the intermediate text-to-graph model with the reinforcement learning reward function may include: and freezing the model parameters of the network except the R down-sampling full-connection module and the R up-sampling full-connection module in the intermediate culture chart model, and updating the model parameters of the R down-sampling full-connection module and the model parameters of the R up-sampling full-connection module by using the reinforcement learning rewarding function. That is, in training of the text-generated graph model, only the model parameters of the R up-sampling fully-connected module and the R down-sampling fully-connected module are finely adjusted, so that resources required for model training are effectively reduced.

The image generation method provided by the embodiment of the invention provides a self-evolving draft image model updating scheme, and is combined with a low-rank decomposition and adapter optimization module, so that the computing resources and the memory resources are effectively managed in the data sampling process and the training process.

The sixth embodiment of the present invention will be described.

The embodiment of the invention provides a mode of low-rank decomposition and adapter fine tuning, so that the computational resources and memory resources required by training the text-to-graphic model are reduced. However, although the adapter fine tuning parameters only occupy about two thousandths of the whole model training, as the model scale of the text-generated graph model increases, how to further reduce the requirement of the text-generated graph model on hardware resources becomes a problem to be solved. In order to further reduce the consumption of the video memory, the embodiment of the invention provides a model optimization scheme based on a quantization scheme.

In an embodiment of the present invention, the inputting the first sample into the intermediate text-generated graph model in S102 and outputting the first intermediate image may include:

carrying out quantization processing on model parameters of the intermediate text-generated graph model to obtain a quantized intermediate text-generated graph model;

inputting the first text sample into a quantized intermediate text graph model, and outputting a first intermediate image;

updating model parameters of the intermediate text-generated graph model with the reinforcement learning reward function, comprising:

and when model parameters of the intermediate graph model are updated by using the reinforcement learning reward function to perform back propagation calculation, performing inverse quantization processing on the model parameters of the intermediate graph model to obtain an updated intermediate graph model.

In specific implementation, the embodiment of the invention can adopt four-bit standard floating point number (normfloat) quantization, double quantization and paging optimizer optimization schemes aiming at the text-generated graph model.

The four-bit standard floating point number is an information theory optimal data type and is obtained through quantisation. The model parameters and data type scores are normalized to [ -1,1]. In the embodiment of the invention, the original model parameters and data types are FP16 types, and are mapped to 4 bits after four-bit standard floating point number quantization operation.

The quantization processing is performed on the model parameters of the intermediate text-generated graph model to obtain a quantized intermediate text-generated graph model, which may include:

performing four-bit standard floating point number quantization processing on model parameters and data type score of the intermediate text-generated graph model to normalize the model parameters and the data type score of the intermediate text-generated graph model to [ -1,1];

performing double quantization on the model parameters subjected to four-bit standard floating point number quantization and the original 32-bit floating point number in the intermediate text-generated graph model, and converting the model parameters of the intermediate text-generated graph model into quantized quantization constants 2 and second-stage quantization constants to obtain a quantized intermediate text-generated graph model;

in the process of executing the quantization processing of the model parameters of the intermediate text-generated graph model, the memory locations of the intermediate text-generated graph model are processed by a paging optimizer.

Specifically, estimate theoryDistributed->The number of bits to obtain a normal distributionkThe bit quantiles quantitate the data type.

Normalized to [ -1,1] using a four-bit standard floating point number data type.

Normalizing the input weight tensor to [ -1,1 by absolute maximum rescaling]Range. The standard deviation of the four-bit data type is matched by rescaling the standard deviation of the weight tensor. Estimating data type Is->The quantization value formula of (c) is as follows. Wherein->Is a standard normal distribution->Is a fractional function of (a).

。

Double quantization takes a four-bit standard floating point number quantization constant FP32 as the input for the second quantization. The second step produces quantized quantization constants 2 and a second level quantization constant c1. An 8-bit floating point number of block size 256 is used for the second quantization because 8-bit quantization does not result in performance degradation.

The paging optimizer uses NVIDIA unified memory functionality that automatically performs page transfers between the central processor (Central Processing Unit, CPU) and graphics processor(s) in the event of occasional under-memory by the graphics processor (Graphics Processing Unit, GPU) to achieve error-free graphics processor processing. This function is similar to conventional memory paging between the random access memory (Random Access Memory, RAM) and disk of the central processing unit. The function is used to allocate paged memory for the optimizer states, which automatically transition to the ram of the cpu when the graphics processor is out of memory, and again paged into the graphics processor memory during the optimizer update step.

The seventh embodiment of the present invention will be described.

After the low-rank decomposition text-generated graph model self-evolution training process provided by the embodiment, the embodiment of the invention can obtain an evolved text-generated graph model without consuming more calculation power, and can specifically generate image content similar to text description semantics. In the actual reasoning application process, the interactive feedback from the user side is also an important link, and the user is sometimes unsatisfied with the result generated by the meristematic graph model and can conduct further guiding feedback. The traditional text graph model is difficult to add user feedback as guidance in reasoning, so that the experience of a user in the using process is greatly reduced. The embodiment of the invention provides a novel self-evolution reasoning algorithm of a text-generated graph model based on multi-round human evaluation feedback, which can perform self-evolution reasoning of the text-generated graph model based on human feedback.

In the embodiment of the present invention, in S103, inputting the text to be processed into the text-to-text graph model, and outputting the result image may include:

inputting the text to be processed into a text-to-text graph model, and outputting a second intermediate image;

acquiring a feedback result of a user on the second intermediate image;

and regenerating a second intermediate image according to the feedback result and the text to be processed by using the text generation graph model until the second intermediate image meeting the feedback requirement is obtained, and outputting the second intermediate image meeting the feedback requirement as a result image.

In implementations, the self-attention mechanism in the meridional graph model can capture other pixels in the image, so feedback evaluations from humans can be back-injected into the self-attention module by human feedback evaluations of the generated image. The specific method of the embodiment of the invention is to inject the human feedback image information into a Key (Key) module and a Value (Value) module in a self-attention module of a text-generated graph, and store relevant characteristic information in the self-attention module. When specific reasoning denoising operation is executed, the flow contains human feedback image information and corresponding semantic information, and the finally generated image is controlled by the human feedback image, so that the user satisfaction of the result image is improved.

In the embodiment of the invention, inputting the text to be processed into the text-to-text graph model and outputting the second intermediate image can comprise:

inputting the text to be processed into a text-to-text graph model, and outputting a plurality of second intermediate images;

obtaining a feedback result of the user on the second intermediate image, including:

And regenerating a second intermediate image according to the feedback result and the text to be processed by using the text generation graph model until the second intermediate image meeting the feedback requirement is acquired, and outputting the second intermediate image meeting the feedback requirement as a result image, wherein the method comprises the following steps of:

generating a feedback sample according to the positive feedback set and the negative feedback set, regenerating the text-to-image model by using the feedback sample and the text to be processed until a second intermediate image meeting the feedback requirement is obtained, and outputting the second intermediate image meeting the feedback requirement as a result image.

In a specific implementation, the super parameter at the time of test is set first. N represents the number of iterations of human feedback, N represents the number of intermediate pictures produced by each round of model,and representing the optimized meridional graph model. N=3 and n=4 in the algorithm.

The reasoning process of the text-generated graph model provided by the embodiment of the invention comprises a preparation flow and a reasoning denoising flow.

The preparation flow mainly comprises obtaining the text to be processed input by the user and using the text-generated graph modelThe model generates a second intermediate image and obtains a positive feedback set after human feedback>Negative feedback set +.>. This step provides input to the inference denoising process. The preparation process specifically comprises the following steps:

Acquiring a text description;

text description is used as condition, and text-generated graph model is usedModel reasoning generates n second intermediate images;

human feedback is obtained for the n generated second intermediate images and divided into forward feedback setsNegative feedback set +.>，/>A set of second intermediate images representing human feedback as forward direction,>representing a set of second intermediate images for which the human feedback is negative. Positive sense is that the picture content is matched with the text description and the generated effect meets the requirements of people, and negative sense is opposite.

Repeating the above steps for N times to obtain a final forward feedback setNegative feedback set +.>。

As described in the above embodiments of the present invention, the meridional graph model may employ a meridional graph diffusion model.

Generating a feedback sample according to the positive feedback set and the negative feedback set in the reasoning denoising process of the text-generated graph model, and regenerating the text-generated graph model into a second intermediate image by using the feedback sample and the text to be processed until the second intermediate image meeting the feedback requirement is obtained, and outputting the second intermediate image meeting the feedback requirement as a result image, wherein the method comprises the following steps of:

acquiring random Gaussian noise;

Inputting the feedback noise adding vector, the current time step variable and the text to be processed into a text-to-text graph diffusion model to obtain a feedback prediction variable;

dividing a feedback prediction variable into positive hidden vectors and negative hidden vectors according to the positive feedback set and the negative feedback set;

taking the characteristics generated after the characteristics pass through the residual error network in each sub-network of the diffusion model of the venturi chart as extension hidden vectors, and recording hidden sub-layer characteristics corresponding to each layer of the extension hidden vectors in the diffusion model of the venturi chart, wherein the hidden sub-layer characteristics are divided into positive sample hidden sub-layer characteristics and negative sample hidden sub-layer characteristics according to a positive feedback set and a negative feedback set;

updating model weight vectors of the text-generated graph diffusion model according to model parameters in a self-attention module of the text-generated graph diffusion model and characteristics output by a residual module of the text-generated graph diffusion model, generating positive sample denoising vectors according to positive sample model weight vectors in the self-attention module based on an attention mechanism, and generating negative sample denoising vectors according to negative sample model weight vectors in the self-attention module based on the attention mechanism;

If the denoising times of the diffusion model of the venturi chart do not reach the set denoising times, carrying out Gaussian noise denoising by using the current denoising vector, and returning to the step of generating a feedback denoising vector according to the positive feedback set and the negative feedback set;

if the denoising times of the diffusion model of the venturi image reach the set denoising times, taking the current denoising vector as a final denoising vector, and inputting the final denoising vector into a decoder to obtain a result image.

In particular, random Gaussian noise is obtained；

Traversing the positive feedback set and the negative feedback set according to the formulaGenerating a feedback noise adding vector for each second intermediate image according to the following formula; wherein (1)>，/>Adding a noise vector to the feedback +.>For adjusting the vector of noise intensity according to the time step variable, < >>For feedback sample, ++>As a set of positive feedback values,for negative feedback set, ++>For a gaussian random noise sample,tfor time step variable, ++>，TSetting denoising times for a diffusion model of a text-generated graph;

obtaining a feedback prediction variable by using a feedback noise adding vector, a current time step variable and a text input text-to-be-processed text-to-text graph diffusion model, storing the feedback prediction variable in a feedback prediction variable list, and dividing the feedback prediction variable list into a positive hidden vector list and a negative hidden vector list according to a positive feedback set and a negative feedback set; storing the characteristics generated by the characteristics after passing through the residual network in each sub-network of the text-generated graph diffusion model by adopting an extended hidden vector list;

Recording the extended hidden vector list in the diffusion model of the meristematic mapiHidden sub-layer features corresponding to the layers; the hidden sub-layer features are divided into positive sample hidden sub-layer features and negative sample hidden sub-layer features according to the positive feedback set and the negative feedback set;

record the first of the diffusion model of the meristematic imageiModel weights in the self-attention module of the layer model; the model weights comprise a model weight of a Q linear layer, a model weight of a K linear layer, a model weight of a V linear layer, a model weight of low-rank decomposition corresponding to the Q linear layer, a model weight of low-rank decomposition corresponding to the K linear layer and a model weight of low-rank decomposition corresponding to the V linear layer in the self-attention module;

calculating an updated model weight vector according to the following formula:

；/>

；

wherein the model weights include、/>、/>，/>Model weight for Q linear layer, +.>Model weight for K linear layer, +.>Model weights for V-linear layer, +.>Model weights for low-rank decomposition of the corresponding Q-liner layer, +.>Model weights for low-rank decomposition of the corresponding K-linear layer, +.>For model weights corresponding to low-rank decomposition of the V linear layer,Qthe model weight vector for the Q linear layer,Kis the model weight vector for the K linear layer,Vis the model weight vector of the V linear layer, Inputting features for diffusion model of text-to-graphiczCharacteristics of rear output->To->Output characteristics and->Splicing (I)>To conceal sub-layer features->Denoising vector for positive sample, ++>For a negative-sample denoising vector,positive sample model weight vector for Q linear layer, +.>Positive sample model weight vector for K-linear layer, +.>Positive sample model weight vector for V linear layer, +.>Negative sample model weight vector for Q linear layer, < ->Negative sample model weight vector for K-linear layer, < ->Negative sample model weight vector for V-linear layer, < ->For the number of attention heads in the attention mechanism, < +.>Calculating a formula for an attention mechanism;

according toThe denoising vector +.>；

If the current time step variable is not 0,then use the denoising vector when the time step variable is t-1Returning to traverse the positive feedback set and the negative feedback set according to the formula +.>Generating a feedback noise adding vector for each second intermediate image according to the following steps;

if the current time step variable is 0, taking the current denoising vector as a final denoising vector, and inputting the final denoising vector into a decoder to obtain a result image.

The embodiment of the invention provides a self-evolution reasoning algorithm of a text-generated graph based on human feedback, which gets rid of the defect that the traditional text-generated graph algorithm can not add human feedback in reasoning measurement, fully considers human feedback opinion in a specific practical landing process, and greatly enhances the human-computer interaction capability of a text-generated graph model.

Various embodiments corresponding to the image generation method are described in detail above, and on the basis of the embodiments, the invention also discloses an image generation device, equipment and a computer readable storage medium corresponding to the method.

The eighth embodiment of the present invention will be described.

Fig. 10 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention.

As shown in fig. 10, an image generating apparatus provided by an embodiment of the present invention includes:

a first obtaining unit 1001, configured to obtain a first sample text, an initial text-to-image model, and a text-to-image model;

the first training unit 1002 is configured to, from an initial context graph model, input a first sample into the intermediate context graph model, output a first intermediate image, input the first intermediate image into the context graph model, output a first predicted text, construct a reinforcement learning reward function according to a text similarity score between the first predicted text and the first sample, and update model parameters of the intermediate context graph model by using the reinforcement learning reward function until the iterative training is completed, thereby obtaining the context graph model;

a first output unit 1003, configured to input a text to be processed into the text-to-text graph model, and output a result image.

In some implementations, the first training unit 1002 inputs the first intermediate image into a graphically generated text model, outputs a first predicted text, constructs a reinforcement learning reward function with a text similarity score of the first predicted text to the first text sample, comprising:

In some implementations, the first training unit 1002 generates the first predicted text using the entity corresponding to the first intermediate image, including:

And generating a first prediction text by using a screening entity label obtained by detecting and dividing the first intermediate image according to the entity corresponding to the first intermediate image.

In some implementations, the first training unit 1002 constructs a reinforcement learning reward function from the second text similarity score, including:

In some implementations, the first training unit 1002 inputs the first intermediate image into the context model for entity detection, to obtain an entity corresponding to the first intermediate image, including:

inputting the first intermediate image into an image recognition model in the context model to obtain the image characteristics of the first intermediate image;

identifying entities contained in the first intermediate image according to the image characteristics of the first intermediate image by utilizing an identification marking model in the icon text model;

The first training unit 1002 generates a first predicted text using an entity corresponding to the first intermediate image, including:

and generating a first prediction text according to the entity text labels in the entity text label list by using a text generation model in the icon text model.

In some implementations, the training process of the graph-text model includes:

acquiring an image text pair and an initial icon text model;

In some implementations, for an intermediate graph text model in each iteration training, inputting a sample image in an image text pair into an image recognition model in the intermediate graph text model to obtain image features of the sample image, recognizing an intermediate entity contained in the sample image according to the image features of the sample image by using a recognition marking model in the intermediate graph text model, marking the intermediate entity to generate an intermediate entity text label list corresponding to the sample image, generating a third predicted text according to the intermediate entity text labels in the intermediate entity text label list, and updating model parameters of the intermediate graph text model by using text difference degrees of a second sample text corresponding to the sample image and the third predicted text, including:

assigning feature weights of the intermediate entity text labels to the learnable text query parameters;

In some implementations, the first training unit 1002 performs in parallel, in multiple passes, generating image features of the first intermediate image using the image recognition model, generating a list of entity text labels using the recognition and marking model, and generating the first predicted text from entity text labels in the list of entity text labels using the text generation model on different first intermediate images generated in one iteration of training of the meridional graph model.

In some implementations, the first training unit 1002 calculates a similarity of the first predicted text to the first text sample, resulting in a first text similarity score, comprising:

generating a first scoring task by similarity between the first predicted text and the first text sample, and inputting the first scoring task into an artificial intelligence large model to obtain a first text similarity score;

the first training unit 1002 calculates a scene similarity between the entity mask and the first intermediate image, to obtain a scene similarity score, including:

Generating a second scoring task by the scene similarity of the entity mask and the first intermediate image, and inputting the second scoring task into the artificial intelligent large model to obtain a scene similarity score;

the first training unit 1002 calculates a similarity between the second predicted text and the first text, to obtain a third text similarity score, including:

and generating a third scoring task by the similarity between the second predicted text and the first text, and inputting the third scoring task into the artificial intelligence large model to obtain a third text similarity score.

In some implementations, the first training unit 1002 constructs a reinforcement learning reward function from the second text similarity score and the third text similarity score, including:

In some implementations, the first training unit 1002 samples a first set of training data from a first set of samples to obtain a first batch of training data for a current iteration of training, including:

And when the model parameters of the intermediate text graph model are updated based on the first batch of training data in the current iterative training, the step of sampling from the first sample set in the next iterative training to obtain the first batch of training data of the current iterative training is executed in parallel.

In some implementations, the first training unit 1002 updates model parameters of the intermediate text-to-graph model with the reinforcement learning reward function, including:

and freezing the model parameters of the network except the R down-sampling full-connection module and the R up-sampling full-connection module in the intermediate culture chart model, and updating the model parameters of the R down-sampling full-connection module and the model parameters of the R up-sampling full-connection module by using the reinforcement learning rewarding function.

In some implementations, the first training unit 1002 inputs the first sample into an intermediate text graph model, outputs a first intermediate image, comprising:

In some implementations, the first training unit 1002 quantizes model parameters of the intermediate text-to-graph model to obtain a quantized intermediate text-to-graph model, including:

In some implementations, the first output unit 1003 inputs the text to be processed into the meridional chart model, outputs a resultant image, including:

acquiring a feedback result of a user on the second intermediate image;

In some implementations, the first output unit 1003 inputs the text to be processed into the meridional chart model, outputting a second intermediate image, including:

the first output unit 1003 generates a feedback sample according to the positive feedback set and the negative feedback set, regenerates the text-to-image model to generate a second intermediate image by using the feedback sample and the text to be processed until the second intermediate image meeting the feedback requirement is acquired, and outputs the second intermediate image meeting the feedback requirement as a result image, including:

acquiring random Gaussian noise;

Since the embodiments of the apparatus portion and the embodiments of the method portion correspond to each other, the embodiments of the apparatus portion are referred to the description of the embodiments of the method portion, and are not repeated herein.

The following describes an embodiment nine of the present invention.

As shown in fig. 11, an image generating apparatus provided by an embodiment of the present invention includes:

a memory 1110 for storing a computer program 1111;

a processor 1120 for executing a computer program 1111, the computer program 1111 when executed by the processor 1120 implementing the steps of the image generating method according to any one of the embodiments described above.

Processor 1120 may include one or more processing cores, such as a 3-core processor, an 8-core processor, and the like, among others. The processor 1120 may be implemented in at least one hardware form of digital signal processing DSP (Digital Signal Processing), field programmable gate array FPGA (Field-Programmable Gate Array), programmable logic array PLA (Programmable Logic Array). Processor 1120 may also include a main processor, which is a processor for processing data in an awake state, also referred to as central processor CPU (Central Processing Unit), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 1120 may be integrated with an image processor GPU (Graphics Processing Unit), a GPU for use in responsible for rendering and rendering of the content required for display by the display screen. In some embodiments, the processor 1120 may also include an artificial intelligence AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 1110 may include one or more computer-readable storage media, which may be non-transitory. Memory 1110 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 1110 is at least used for storing a computer program 1111, where the computer program 1111, when loaded and executed by the processor 1120, is capable of implementing the relevant steps in the image generating method disclosed in any one of the foregoing embodiments. In addition, the resources stored in the memory 1110 may further include an operating system 1111 and data 1113, and the storage manner may be transient storage or permanent storage. The operating system 1111 may be Windows. The data 1113 may include, but is not limited to, data related to the methods described above.

In some embodiments, the image generation device may further include a display 1130, a power source 1140, a communication interface 1150, an input-output interface 1160, a sensor 1170, and a communication bus 1180.

Those skilled in the art will appreciate that the structure shown in fig. 11 does not constitute a limitation of the image generating apparatus and may include more or less components than illustrated.

The image generating device provided by the embodiment of the invention comprises the memory and the processor, and the processor can realize the image generating method when executing the program stored in the memory, so that the effects are the same.

The following describes embodiments of the present invention.

It should be noted that the apparatus and device embodiments described above are merely exemplary, and for example, the division of modules is merely a logic function division, and there may be other division manners in actual implementation, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms. The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium for performing all or part of the steps of the method according to the embodiments of the present invention.

To this end, an embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements steps such as an image generation method.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (ram) RAM (Random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The computer program included in the computer-readable storage medium provided in this embodiment can implement the steps of the image generation method described above when executed by a processor, and the same effects are achieved.

The image generation method, apparatus, device and computer readable storage medium provided by the present invention are described in detail above. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The apparatus, device and computer readable storage medium of the embodiments are described more simply because they correspond to the methods of the embodiments, and the description thereof will be given with reference to the method section. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An image generation method, comprising:

2. The image generation method according to claim 1, wherein the inputting the first intermediate image into the context model, outputting a first predicted text, and constructing a reinforcement learning reward function with a text similarity score of the first predicted text and the first text sample, comprises:

3. The image generation method according to claim 2, wherein the generating the first predicted text using the entity corresponding to the first intermediate image includes:

4. The image generation method of claim 2, wherein said constructing the reinforcement learning reward function from the second text similarity score comprises:

5. The method for generating an image according to any one of claims 2 to 4, wherein inputting the first intermediate image into the context model for entity detection to obtain an entity corresponding to the first intermediate image includes:

6. The image generation method according to claim 5, wherein the training process of the context model includes:

acquiring an image text pair and an initial icon text model;

7. The method for generating an image according to claim 6, wherein the training the intermediate context model for each iteration, inputting the sample image in the image text pair into the image recognition model in the intermediate context model to obtain the image feature of the sample image, recognizing the intermediate entity included in the sample image according to the image feature of the sample image by using the recognition and marking model in the intermediate context model, marking the intermediate entity, generating an intermediate entity text label list corresponding to the sample image, generating a third predicted text according to the intermediate entity text label in the intermediate entity text label list, and updating the model parameters of the intermediate image context model by using the text difference between the second sample text corresponding to the sample image and the third predicted text, including:

8. The image generation method according to claim 5, wherein generating image features of the first intermediate image using the image recognition model, generating the list of entity text labels using the recognition and marking model, and generating the first predicted text from the entity text labels in the list of entity text labels using the text generation model are performed in parallel using a plurality of processes for different first intermediate images generated in one iteration of training of the meridional graph model.

9. The method of image generation according to claim 4, wherein said calculating the similarity of the first predicted text and the first sample text to obtain a first text similarity score comprises:

10. The image generation method of claim 4, wherein said constructing the reinforcement learning reward function from the second text similarity score and the third text similarity score comprises:

11. The image generation method according to claim 1, wherein the training the intermediate text graph model for each iteration, inputting the first sample into the intermediate text graph model, outputting a first intermediate image, inputting the first intermediate image into the text graph model, outputting a first predicted text, constructing a reinforcement learning reward function with a text similarity score of the first predicted text and the first sample, and updating model parameters of the intermediate text graph model with the reinforcement learning reward function, includes:

12. The method of claim 11, wherein sampling the first batch of training data from the first set of samples to obtain the current iteration of training comprises:

13. The image generation method according to claim 1, wherein the initial meridional graph model is composed of a text encoder, an image self-decoder, a cross-attention downsampling layer, a cross-attention upsampling layer, and a cross-attention layer;

14. The image generation method according to claim 13, wherein the updating of the model parameters of the intermediate meristem model using the reinforcement learning reward function includes:

15. The image generation method according to claim 1, wherein said inputting the first sample into the intermediate text graph model outputs a first intermediate image, comprising:

16. The image generating method according to claim 15, wherein the quantizing the model parameters of the intermediate meristematic image model to obtain the quantized intermediate meristematic image model includes:

17. The image generating method according to claim 1, wherein inputting the text to be processed into the text-to-text model, outputting a result image, comprises:

acquiring a feedback result of a user on the second intermediate image;

18. The image generation method according to claim 17, wherein the inputting the text to be processed into the meridional figure model, outputting a second intermediate image, comprises:

19. The image generation method according to claim 18, wherein the meridional graph model is a meridional graph diffusion model;

acquiring random Gaussian noise;

20. An image generating apparatus, comprising:

21. An image generating apparatus, characterized by comprising:

a memory for storing a computer program;

a processor for executing the computer program, which when executed by the processor realizes the steps of the image generation method according to any one of claims 1 to 19.

22. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the image generation method according to any one of claims 1 to 19.