CN117474796B

CN117474796B - Image generation method, device, equipment and computer readable storage medium

Info

Publication number: CN117474796B
Application number: CN202311813617.0A
Authority: CN
Inventors: 张润泽; 李仁刚; 赵雅倩; 郭振华; 范宝余; 刘璐
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: Inspur Electronic Information Industry Co Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-04-05
Anticipated expiration: 2043-12-27
Also published as: CN117474796A

Abstract

The invention relates to the technical field of image generation, and particularly discloses an image generation method, an image generation device, image generation equipment and a computer readable storage medium, wherein in the process of carrying out denoising processing on a text to be processed for a preset number of times by utilizing a text graph diffusion model, image recognition is carried out on a generated intermediate image in each denoising processing, a first-stage denoising vector of the text graph diffusion model is updated according to an image recognition result and a content error of the text to be processed, a second-stage denoising vector is obtained, and the second-stage denoising vector is used as a denoising vector of the denoising processing; and generating a result image corresponding to the text to be processed by using the final denoising vector of the second stage. Through two-stage denoising, the control of detail information contained in the text to be processed in each denoising process is enhanced, and the generated result image can accurately describe the detail information contained in the text to be processed, so that the accuracy of text image mode conversion is improved.

Description

Image generation method, device, equipment and computer readable storage medium

Technical Field

The present invention relates to the field of image generation technologies, and in particular, to an image generation method, apparatus, device, and computer readable storage medium.

Background

The text-to-image technology is one of image generation technologies, and can convert text mode information into image modes for display by inputting a section of text description, so that the display effect is extremely high. Meanwhile, more and more people are researching the related text-generated graph technology of expansibility and customization, and the field has wide application prospect currently.

The current text-to-image model is also in the development stage, and a large number of generated images are not matched with input text. Improving the accuracy of text image mode conversion is a technical problem which needs to be solved by the person skilled in the art.

Disclosure of Invention

The invention aims to provide an image generation method, an image generation device, image generation equipment and a computer readable storage medium, which are used for improving the accuracy of text image mode conversion.

In order to solve the above technical problems, the present invention provides an image generating method, including:

acquiring a text to be processed;

denoising the text to be processed for preset times by using a text graph diffusion model;

in each denoising process, performing image recognition on an intermediate image, and updating a first-stage denoising vector of the text-to-text diffusion model according to an image recognition result and a content error of the text to be processed to obtain a second-stage denoising vector, wherein the second-stage denoising vector is used as a denoising vector of the denoising process;

And generating a result image corresponding to the text to be processed by utilizing the final denoising vector of the second stage.

In some implementations, the performing image recognition on the intermediate image, updating a first-stage denoising vector of the text to be processed according to an image recognition result and a content error of the text to be processed, to obtain a second-stage denoising vector, including:

performing part-of-speech analysis on the text to be processed to obtain object types and corresponding object numbers;

performing object recognition on the intermediate image according to the object type to obtain a counting result of the intermediate image;

calculating to obtain an object number loss value of the intermediate image according to the counting result of the intermediate image and the number of objects in the text to be processed;

and updating the first-stage denoising vector by using the object quantity loss value to obtain the second-stage denoising vector.

In some implementations, the performing object recognition on the intermediate image according to the object type to obtain a counting result of the intermediate image includes:

calling a text image matching model to identify a text image matching result matched with the text to be processed from the intermediate image;

Calling a segmentation model to carry out segmentation recognition based on the text-to-graph matching result to obtain a first segmentation recognition result;

and counting the number of the objects corresponding to the object type based on the first segmentation recognition result to obtain a counting result of the intermediate image.

In some implementations, further comprising:

performing binarization processing on the text-to-image matching result to obtain a binarized image;

dividing sub-frames in the binarized image to obtain a rectangular frame set;

the calling of the segmentation model to carry out segmentation recognition based on the text-to-graph matching result to obtain a first segmentation recognition result comprises the following steps:

and calling the segmentation model to carry out segmentation recognition based on the rectangular frame set, so as to obtain the first segmentation recognition result.

In some implementations, the invoking the segmentation model to perform segmentation recognition based on the graph-text matching result to obtain a first segmentation recognition result includes:

inputting the text-to-graph matching result into the segmentation model, and outputting a first initial segmentation mask and a first whole graph characteristic;

inputting the first initial segmentation mask and the first whole graph feature into a Hadamard matrix product operation to obtain a first object of interest image feature;

And calculating cosine similarity according to the first object image feature of interest and the first whole graph feature to obtain the first segmentation recognition result.

In some implementations, further comprising:

calling the segmentation model to carry out segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image;

the counting of the number of the objects corresponding to the object type based on the first segmentation recognition result is performed to obtain a counting result of the intermediate image, and the counting result comprises the following steps:

and counting the number of the objects corresponding to the object type based on the first segmentation recognition result and the second segmentation recognition result to obtain a counting result of the intermediate image.

In some implementations, the invoking the segmentation model to perform segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image includes:

inputting the intermediate image into the segmentation model, and outputting a second initial segmentation mask and a second whole graph characteristic;

inputting the second initial segmentation mask and the second whole graph characteristic into a Hadamard matrix product operation to obtain a second object of interest image;

and calculating cosine similarity according to the second object image of interest and the second whole graph feature to obtain the second segmentation recognition result.

sampling the intermediate image to obtain a sampling result of the intermediate image;

and calling the segmentation model to carry out segmentation recognition according to the first segmentation recognition result and the sampling result to obtain the second segmentation recognition result.

In some implementations, the invoking the segmentation model to perform segmentation recognition according to the first segmentation recognition result and the sampling result to obtain the second segmentation recognition result includes:

performing binarization processing on the first segmentation recognition result to obtain a similarity matrix;

extracting first attention object image features corresponding to the first segmentation recognition result;

and inputting the sampling result, the similarity matrix and the first object image feature of interest into the segmentation model to obtain the second segmentation recognition result.

In some implementations, the sampling the intermediate image to obtain a sampling result of the intermediate image includes:

and sampling the intermediate image according to the principle that the sampling density of the positive sample position with the median value of 1 in the similarity matrix is higher than that of the negative sample position with the median value of 0 in the similarity matrix, so as to obtain the sampling result.

batch sampling is carried out on the intermediate images, and a plurality of batches of sampling results are obtained in sequence;

the calling the segmentation model to carry out segmentation recognition according to the first segmentation recognition result and the sampling result to obtain the second segmentation recognition result comprises the following steps:

sequentially calling the segmentation model to carry out segmentation recognition according to the first segmentation recognition result and the sampling results of each batch to obtain the second segmentation recognition result corresponding to the sampling results of each batch;

and taking the second segmentation recognition result corresponding to the sampling result of each batch as the final second segmentation recognition result.

In some implementations, the batch sampling the intermediate image sequentially obtains a plurality of batches of the sampling results, including:

and determining the number of sampling points of each batch according to the size of a memory divided by the equipment for the text-generated graph task, and carrying out batch sampling on the intermediate image according to the number of the sampling points of each batch to sequentially obtain a plurality of batches of sampling results.

And filtering sampling points of the current batch according to the segmentation mask result of the previous batch to obtain the sampling result of the current batch.

calculating attention force diagrams of the object types by using the culture map diffusion model;

dividing the intermediate image according to the attention map to obtain divided images corresponding to the object types;

identifying and obtaining the object number identification results corresponding to the segmented images;

and taking the object number identification result corresponding to each object type as the counting result of the intermediate image.

In some implementations, the calculating obtains the object number loss value of the intermediate image according to the counting result of the intermediate image and the number of objects in the text to be processed, specifically by the following formula:

；

updating the first-stage denoising vector by using the object quantity loss value to obtain the second-stage denoising vector, wherein the second-stage denoising vector is obtained by the following calculation:

；

wherein,for the object quantity loss value, +. >Is the firstiSaid segmented image of each said object type,C() For the counting model, ++>Is the firstiThe number of objects of each of said object types, +.>For the second stage denoising vector, +.>For said first stage denoising vector, < > and>、/>is super-parameter (herba Cinchi Oleracei)>And (3) the derivative of the Gaussian noise of the current denoising step number for the counting model.

In some implementations, the computing an attention map for each of the object types using the meridional graph diffusion model includes:

inputting Gaussian noise and the current denoising step number into the venturi graph diffusion model, and calculating to obtain an initial first-stage denoising vector and an initial attention map corresponding to each object type in the intermediate image;

re-weighting each initial attention map to obtain the attention map corresponding to each object type;

said re-weighting each of said initial attempts to pay attention is specifically obtained by:

；

wherein,is the firstiThe first degree of the attention map for each of the object typesj，k) The response value in terms of the coordinates,is the firstiA first degree of the initial attention profile of each of the object typesj，k) The response value in terms of the coordinates,min() In order to take the minimum value calculation formula, max() The formula is calculated for taking the maximum value.

In some implementations, the first stage denoising vector and the intermediate image are specifically obtained by:

calculating a loss of attention of the attention profile;

updating the initial first-stage denoising vector according to the attention loss calculation to obtain the first-stage denoising vector;

calculating according to the first-stage denoising vector to obtain a first-stage denoising hidden vector;

decoding and calculating the denoising hidden vector in the first stage to obtain the intermediate image;

wherein, the attention loss of the attention map is calculated by the following formula:

；

the initial first stage denoising vector is updated according to the attention loss calculation, the first stage denoising vector is obtained, and the first stage denoising vector is obtained specifically through the following calculation:

；

the first-stage denoising hidden vector is obtained by calculating according to the first-stage denoising vector, and is specifically obtained by calculating according to the following formula:

；

the first-stage denoising hidden vector is decoded and calculated to obtain the intermediate image, and the intermediate image is obtained by the following calculation:

；

wherein,for said first stage denoising vector, < > and>denoising vector for last denoising step, +. >、/>Is super-parameter (herba Cinchi Oleracei)>Gaussian noise for counting the number of current denoising steps for the model>Is used for the purpose of determining the derivative of (c),Lattention loss for said attention profile, < >>In order to provide the intermediate image with a picture,Dec() For the decoding function +.>To go to the first stageNoise hidden vector.

performing part-of-speech analysis on the text to be processed to obtain an object type;

performing image recognition on the intermediate image to obtain object attributes of the intermediate image;

calculating an object attribute loss value of the intermediate image according to the object type and the object attribute;

and updating the first-stage denoising vector by using the object attribute loss value to obtain the second-stage denoising vector.

In some implementations, the image identifying the intermediate image to obtain the object attribute of the intermediate image includes:

calling a segmentation model to carry out segmentation recognition on the intermediate image to obtain a third segmentation recognition result of the intermediate image;

And carrying out object attribute identification based on the third segmentation identification result to obtain the object attribute of the object contained in the intermediate image.

In order to solve the above technical problem, the present invention further provides an image generating apparatus, including:

the acquisition unit is used for acquiring the text to be processed;

the denoising unit is used for denoising the text to be processed for preset times by using a text-to-text graph diffusion model; in each denoising process, performing image recognition on an intermediate image, and updating a first-stage denoising vector of the text-to-text diffusion model according to an image recognition result and a content error of the text to be processed to obtain a second-stage denoising vector, wherein the second-stage denoising vector is used as a denoising vector of the denoising process;

and the output unit is used for generating a result image corresponding to the text to be processed by utilizing the final denoising vector of the second stage.

a memory for storing a computer program;

a processor for executing the computer program, which when executed by the processor implements the steps of the image generation method according to any one of the above.

To solve the above technical problem, the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the image generation method according to any one of the above.

In the image generation method provided by the invention, in the process of carrying out denoising treatment on a text to be treated by utilizing a text graph diffusion model for a preset number of times, carrying out image recognition on a generated intermediate image in each denoising treatment, updating a first-stage denoising vector of the text graph diffusion model according to an image recognition result and a content error of the text to be treated to obtain a second-stage denoising vector, and taking the second-stage denoising vector as a denoising vector of the denoising treatment; and generating a result image corresponding to the text to be processed by using the final denoising vector of the second stage. Through two-stage denoising, the control of detail information contained in the text to be processed in each denoising process is enhanced, and the generated result image can accurately describe the detail information contained in the text to be processed, so that the accuracy of text image mode conversion is improved.

The invention also provides an image generating device, equipment and a computer readable storage medium, which have the beneficial effects and are not repeated here.

Drawings

For a clearer description of embodiments of the invention or of the prior art, the drawings that are used in the description of the embodiments or of the prior art will be briefly described, it being apparent that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained from them without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an image generating method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a diffusion model of a text-generated graph according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a cross-attention layer model according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of the recognition principle of a segmentation model according to an embodiment of the present invention;

FIG. 5 is a flow chart of a counting process for an intermediate image according to an embodiment of the present invention;

FIG. 6 is a flowchart of segmentation recognition of a segmentation model according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an image generating apparatus according to an embodiment of the present invention.

Detailed Description

The core of the invention is to provide an image generation method, an image generation device and a computer readable storage medium, which are used for improving the accuracy of text image mode conversion.

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following describes an embodiment of the present invention.

For ease of understanding, a system architecture to which the present invention is applicable will first be described. The image generation method provided by the embodiment of the invention can be applied to the computing equipment with the accelerator, and can also be applied to the accelerator cluster and the heterogeneous accelerator cluster. The accelerator may employ, but is not limited to, a graphics processor (Graphics Processing Unit, GPU), a field programmable gate array (Field Programmable Gate Array, FPGA), or the like.

On the basis of the above architecture, an image generating method according to an embodiment of the present invention is described below with reference to the accompanying drawings.

The second embodiment of the present invention will be described below.

As shown in fig. 1, the image generating method provided by the embodiment of the invention includes:

s101: and acquiring a text to be processed.

S102: denoising the text to be processed for preset times by using a text graph diffusion model; in each denoising process, image recognition is carried out on the intermediate image, a first-stage denoising vector of the text-to-image diffusion model is updated according to an image recognition result and content errors of the text to be processed, a second-stage denoising vector is obtained, and the second-stage denoising vector is used as a denoising vector of the denoising process.

S103: and generating a result image corresponding to the text to be processed by using the final denoising vector of the second stage.

The image generation method provided by the embodiment of the invention is suitable for artificial intelligent drawing, the acquired text to be processed can be text input by a user or text generated by voice recognition conversion of the user, and the text to be processed is converted from a text mode to an image mode by adopting a text-based image diffusion model.

The principle of the diffusion model of the meristematic map is as follows: given an image, the meridional graph diffusion model first generates a series of Markov chain hidden vectors，……，/>The series of hidden vectors add gaussian noise to the original image progressively, the formula for adding gaussian noise is as follows:

；

Wherein,is a progressive slave->To->Is added with the noise formula;tis a time step; />The variance used for each step (step) is between 0 and 1. For the diffusion model of the venturi, the variance of the desynchronization is set to variance schedule or noise schedule, and in general, the later steps will take a larger variance, i.e. satisfy +.>Under a designed variance schedule, if the number of diffusion steps is sufficiently large, the final +.>The original data is completely lost and becomes a random noise.

Based on hidden vectorsThe specific formula of the denoising process is as follows:

；

wherein,is a progressive slave->To->Is a denoising formula; />Parameters learned for the neural network; />Is the following.

In particular, during the generation of an image, the image is initiallyFrom Gaussian noise->Random sampling, gradually reducing noise to obtain real image +.>. Usually, in a specific practical process, the deviation of the prediction noise is adopted as a specific optimized training function instead of the real image, and the specific function is as follows:

；

wherein,to be from Gaussian noise->Random sampling, < >>Is under the input conditioncInputting time stepstAndtDeviation of noisy image at moment inferred by neural network, < >>Is a parameter learned by the neural network, where cRepresenting various conditional inputs, in embodiments of the present invention, text descriptions are meant; />Representation->Compliance withDistribution of->Indicating that the additive noise profile is gaussian>Representing the initial image +.>Is in accordance with->In the case where the distribution, the added noise distribution is gaussian, and the time step is t.

The text-based graph Diffusion model in the embodiment of the invention can adopt a Stable-Diffusion or other types of text-based graph Diffusion models. The model architecture of Stable-Diffusion is shown in fig. 2, and mainly consists of a text encoder, an image self-decoder, a cross-attention downsampling layer, a cross-attention upsampling layer and a cross-attention layer. Each cross-attention layer network structure is shown in fig. 3, wherein a cross-attention downsampling layer is used for downsampling twice a feature map, a cross-attention upsampling layer is used for upsampling twice a feature, and the network structure is identical in composition and mainly comprises a residual module, a self-attention mechanism module, a feed-forward network module and a cross-attention module. The residual error module adopts a basic network module of a residual error network ResNet. The self-attention module adopts a self-attention network module and a cross-attention network module in a transducer network structure. The text encoder is used for inputting text (sample text or text to be processed).

In the image generation method provided by the embodiment of the invention, two-stage denoising is innovatively adopted when the text-to-image mode conversion is performed by applying the text-to-image diffusion model. The first stage of denoising is to input text (text to be processed in the embodiment of the invention) and random Gaussian noiseGenerating a first stage denoising vector>One-stage denoising hidden vector>One-stage denoising hidden vector +.>The input decoder can obtain intermediate pictures +.>Intermediate image ∈>The method and the device generally do not meet the detail requirements of the number of objects or the attribute of the objects and the like described in the text to be processed, and the denoising vector +_ in the first stage is updated through the image recognition result and the content error of the text to be processed>Obtaining a second-stage denoising vector +.>Through the steps ofTThe round iteration finally generates a second stage denoising vector +.>The second stage denoising vector +.>And inputting the text description into a decoder to obtain a generated image which finally accords with the text description of the text to be processed.

In the image generation method provided by the embodiment of the invention, in the process of carrying out denoising treatment on a text to be treated for a preset number of times by utilizing a text graph diffusion model, carrying out image recognition on a generated intermediate image in each denoising treatment, updating a first-stage denoising vector of the text graph diffusion model according to an image recognition result and a content error of the text to be treated to obtain a second-stage denoising vector, and taking the second-stage denoising vector as a denoising vector of the denoising treatment; and generating a result image corresponding to the text to be processed by using the final denoising vector of the second stage. Through two-stage denoising, the control of detail information contained in the text to be processed in each denoising process is enhanced, and the generated result image can accurately describe the detail information contained in the text to be processed, so that the accuracy of text image mode conversion is improved.

The following describes a third embodiment of the present invention.

On the basis of the embodiment, the embodiment of the invention provides a scheme for optimizing the number of objects of an image generated by a diffusion model of a text-generated graph.

The result image generated by adopting the traditional text-generated graph diffusion model often has the condition that the number of objects is inconsistent with the number of objects described in the text to be processed, for example, ten eggs are input, and only seven eggs are generated in the generated result image.

In contrast, in the image generating method provided by the embodiment of the present invention, in S102, image recognition is performed on the intermediate image, and the first-stage denoising vector of the diffusion model of the text-to-text graph is updated according to the image recognition result and the content error of the text to be processed, so as to obtain the second-stage denoising vector, which may include: part-of-speech analysis is carried out on the text to be processed to obtain object types and corresponding object numbers; performing object recognition on the intermediate image according to the object type to obtain a counting result of the intermediate image; calculating to obtain an object number loss value of the intermediate image according to the counting result of the intermediate image and the number of objects in the text to be processed; and updating the first-stage denoising vector by using the object quantity loss value to obtain a second-stage denoising vector.

In specific implementation, a text image matching model, a segmentation model and a statistics module can be adopted to construct a counting model, objects which accord with the object types described in the generation processor text in the intermediate image are identified through the counting model, and the number of the objects is counted to obtain a counting result of the intermediate image. Object recognition is performed on the intermediate image according to the object type to obtain a counting result of the intermediate image, which may include: calling a text image matching model to identify a text image matching result matched with the text to be processed from the intermediate image; calling a segmentation model to carry out segmentation recognition based on a text-to-graph matching result to obtain a first segmentation recognition result; and counting the number of the objects corresponding to the object type based on the first segmentation recognition result to obtain a counting result of the intermediate image.

Invoking the segmentation model to perform segmentation recognition based on the graph-text matching result to obtain a first segmentation recognition result may include: inputting the text-to-graph matching result into a segmentation model, and outputting a first initial segmentation mask and a first whole graph characteristic; inputting the first initial segmentation mask and the first whole image feature into a Hadamard matrix product operation to obtain a first object image feature; and calculating cosine similarity according to the image features of the first object of interest and the first whole image features to obtain a first segmentation recognition result.

For example, the text to be processed is twenty-five strawberries, the object type is analyzed to be the "strawberries", and the number of the objects is twenty-five. And inputting the strawberry serving as an analysis entity and the intermediate image into a text image matching model, and obtaining a text image matching result, namely a text similarity matching image. On the basis of the text similarity matching graph, the object is segmented by using the segmentation model so as to accurately count. And finally, calling a statistics module to count the segmentation results to obtain a counting result corresponding to the intermediate image.

In order to further improve accuracy of object segmentation, the image generation method provided by the embodiment of the invention may further include: binarization processing is carried out on the text-to-image matching result to obtain a binarized image; and carrying out frame dividing treatment on the binarized image to obtain a rectangular frame set. Calling the segmentation model to perform segmentation recognition based on the text-to-graph matching result to obtain a first segmentation recognition result, including: and calling the segmentation model to carry out segmentation recognition based on the rectangular frame set, so as to obtain a first segmentation recognition result.

The binarization processing is performed on the text-to-image matching result to obtain a binarized image, which may include: and carrying out binarization processing on the text-to-image matching result by adopting a maximum inter-class variance method (OTSU) to obtain a binarized image.

Dividing sub-frames in the binarized image to obtain a rectangular frame set, which may include: selecting a maximum connected subgraph from the binarized image as a basic target area; dividing a sub-graph from a basic target area according to the outline, and dividing a maximum bounding rectangular frame for each sub-graph; and eliminating overlapped redundant frames in each maximum surrounding rectangular frame to obtain a rectangular frame set. The overlapping redundant frames in each maximum bounding rectangle frame are removed, and a Non-maximum value suppression (Non-maximum suppression, NMS) method can be adopted for removing the overlapping redundant frames.

The text Image matching model in embodiments of the present invention may employ contrast language-Image Pre-training (CLIP).

The segmentation model may employ a SAM generic segmentation model. The SAM general segmentation model can generate a high-quality segmentation mask, has stronger zero sample segmentation capability, and also supports the input of various conditions of points, coordinate frames, segmentation masks and texts. The SAM generic segmentation model consists of three parts, as shown in FIG. 4, image encoderf ₁ Condition encoderf ₂ Mask decoderf ₃ 。

Image encoderf ₁ The function is to extract the whole figure characteristic, condition encoder f ₂ The function is to encode the condition features, input the whole image features and the encoded condition features to the mask decoderf ₃ A final segmentation mask is obtained.

Image encoderf ₁ The ViT-H network model weight based on MAE self-supervision pre-training is adopted.

Condition encoderf ₂ Functionally, dense condition encoders and sparse condition encoders can be divided. The dense condition refers to a format in which the input form is a mask, where two convolution layers with a stride=2 convolution kernel of 2×2 are provided. Sparse condition refers to the input of the form of point coordinates, coordinate frames or text input, and the network model adopts the network model of a text encoder in the CLIP networkThe final feature map dimension is 256 dimensions.

Mask decoderf ₃ The network functions to combine the image coding features and the condition coding features and to add a learnable token for predictive masking. The fusion is achieved through two decoding layer networks. The decoding layer network sequentially executes self-attention feature update among the token, the token (token) is used as query feature to perform cross-attention feature update on the image coding feature, the token (token) feature is subjected to feature update through the multi-layer perceptron, and the image coding feature is used for performing attention feature update on the token (token). After passing the decoder, the encoded image features are up-sampled 4 times by two deconvolution layers (the feature map resolution is now one quarter of the original map). The token makes an additional cross-attention feature update to the image-encoded features. Finally, the output token (token) passes through a three-layer multi-layer perceptron (MLP) to be changed into the channel size same as the image coding characteristic, and then dot multiplication operation is carried out on the channel size and the image coding characteristic to obtain a final mask prediction result.

The fourth embodiment of the present invention will be described below.

Based on the above embodiment, in order to further improve accuracy of segmentation recognition of the intermediate image, another segmentation recognition method provided by the embodiment of the present invention is provided.

As shown in fig. 5, in addition to the text image matching model invoked in the above embodiment, a text image matching result matching the text to be processed is identified from the intermediate image; calling a segmentation model to carry out segmentation recognition based on a text-to-graph matching result to obtain a first segmentation recognition result; the image generation method provided by the embodiment of the invention further comprises the following steps of: and calling the segmentation model to carry out segmentation recognition on the intermediate image, and obtaining a second segmentation recognition result of the intermediate image. Counting the number of the objects corresponding to the object type based on the first segmentation recognition result to obtain a counting result of the intermediate image, wherein the counting result comprises the following steps: and counting the number of the objects corresponding to the object types based on the first segmentation recognition result and the second segmentation recognition result to obtain a counting result of the intermediate image.

In a specific implementation, invoking the segmentation model to perform segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image may include: inputting the intermediate image into a segmentation model, and outputting a second initial segmentation mask and a second whole graph characteristic; inputting a second initial segmentation mask and a second whole graph characteristic into Hadamard matrix product operation to obtain a second object of interest image; and calculating cosine similarity according to the second object image of interest and the second whole graph feature to obtain a second segmentation recognition result. The adopted segmentation model can be a SAM general segmentation model, and the specific segmentation recognition principle can refer to the description of the third embodiment of the present invention.

In the image generation method provided in the embodiment of the present invention, as shown in fig. 5, object recognition is performed on the intermediate image according to the object type, so as to obtain a counting result of the intermediate image, which may include: calling a text image matching model to identify a text image matching result matched with the text to be processed from the intermediate image; performing binarization processing on the text-to-image matching result by using a maximum inter-class variance method to obtain a binarized image; selecting a maximum connected subgraph from the binarized image as a basic target area; dividing a sub-graph from a basic target area according to the outline, and dividing a maximum bounding rectangular frame for each sub-graph; removing overlapped redundant frames by adopting a non-maximum suppression method to obtain a rectangular frame set; inputting the rectangular frame set into a segmentation model to obtain a first segmentation recognition result; inputting the intermediate image into a segmentation model to obtain a second segmentation recognition result; and inputting the first segmentation recognition result and the second segmentation recognition result into a statistics module, and outputting a counting result of the intermediate image.

As shown in fig. 6, invoking the segmentation model to perform segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image may include: sampling the intermediate image to obtain a sampling result of the intermediate image; and calling a segmentation model to carry out segmentation recognition according to the first segmentation recognition result and the sampling result, so as to obtain a second segmentation recognition result. T points can be uniformly sampled for the full view of the intermediate image.

Invoking the segmentation model to perform segmentation recognition according to the first segmentation recognition result and the sampling result to obtain a second segmentation recognition result may include: binarization processing is carried out on the first segmentation recognition result to obtain a similarity matrix; extracting first attention object image features corresponding to the first segmentation recognition results; and inputting the sampling result, the similarity matrix and the image features of the first object of interest into a segmentation model to obtain a second segmentation recognition result.

The similarity matrix can be used as a label of positive and negative samples, so that the number of subsequent point sampling calculation is reduced. A value of 1 in the specific similarity matrix is considered as a positive sample, and a value of 0 is considered as a negative sample. Sampling the intermediate image to obtain a sampling result of the intermediate image may include: and sampling the intermediate image according to the principle that the sampling density of the positive sample position with the median value of 1 in the similarity matrix is higher than that of the negative sample position with the median value of 0 in the similarity matrix, so as to obtain a sampling result.

In order to adapt to the limitation of hardware resources, the sampling of the intermediate image to obtain the sampling result of the intermediate image may include: and carrying out batch sampling on the intermediate images to sequentially obtain a plurality of batch sampling results. Calling the segmentation model to carry out segmentation recognition according to the first segmentation recognition result and the sampling result to obtain a second segmentation recognition result can comprise: sequentially calling a segmentation model to carry out segmentation recognition according to the first segmentation recognition result and each batch of sampling results to obtain a second segmentation recognition result corresponding to each batch of sampling results; and taking the second segmentation recognition result corresponding to the sampling result of each batch as a final second segmentation recognition result. Batch sampling is performed on the intermediate image, and a plurality of batch sampling results are sequentially obtained, which may include: and determining the number of sampling points of each batch according to the size of the memory divided by the equipment for the text-generated graph task, and carrying out batch sampling on the intermediate image according to the number of the sampling points of each batch to sequentially obtain a plurality of batch sampling results.

For a full-image uniform sampling t×t points of the intermediate image, k batches (batch) may be divided in order from top to bottom (or left to right), where the number of sampling points per batch is t×t/k. In practical applications, t may be 30 and k may be 4.

To reduce the calculation consumption, batch sampling is performed on the intermediate image, and a plurality of batch sampling results are sequentially obtained, which may include: and filtering sampling points of the current batch according to the segmentation mask result of the previous batch to obtain the sampling result of the current batch. For example, when a segmentation model is sequentially input to a batch of sampling points from top to bottom to obtain segmentation mask subgraphs corresponding to each batch, in order to reduce the influence caused by repeated computation, the segmentation mask subgraphs generated by the preamble history are used as a priori to filter the sampling points in the current batch when the computation of the current batch is performed, and if the segmentation mask subgraphs generated by the preamble history cover the sampling points in the current batch, the sampling points are filtered.

As shown in fig. 6, in the image generating method provided by the embodiment of the present invention, invoking a segmentation model to perform segmentation recognition on an intermediate image to obtain a second segmentation recognition result of the intermediate image may include:

The text image matching model is called to carry out binarization processing and sub-frame dividing processing according to the text image matching result generated by the intermediate image to obtain a rectangular frame set B and the intermediate image is input into the segmentation model to generate an initial segmentation maskAnd first whole-picture feature->Mask the initial segmentation->And first whole-picture feature->Input Hadamard matrix product operation to obtain first attention object image feature +.>According to the first object of interest image feature +.>And first whole-picture feature->And calculating cosine similarity to obtain a first segmentation recognition result (namely a second-stage similarity matching image).

And carrying out binarization processing on the first segmentation recognition result by adopting a maximum inter-class variance method to obtain a similarity matrix S. The similarity matrix S is used as a label of positive and negative samples, so that the number of subsequent point sampling calculation is reduced. A value of 1 in the specific similarity matrix is considered as a positive sample, and a value of 0 is considered as a negative sample.

And (3) carrying out point sampling on the intermediate image, uniformly sampling t times t points on the whole image, and dividing all sampling points into batches (batch) for sampling from top to bottom to obtain k batches, wherein the number of the sampling points in each batch is t times t/k. In the embodiment of the invention, t=30 and k=4.

In order to reduce the calculation consumption, the sampling point batches are sequentially input into the segmentation model from top to bottom to sequentially obtain corresponding segmentation mask subgraphs. In order to reduce the influence of repeated calculation, the segmentation mask map generated by the preamble history is used as a priori to filter the sampling points in the current batch when the current batch is calculated. The specific criteria are as follows: if the preamble history generated segmentation mask map covers a sampling point in the current batch, the sampling point is considered.

The similarity matrix S, the sampling point P and the image characteristics of the first object of interestf _R Inputting the segmentation model to generate an optimized segmentation mask as a second segmentation recognition result. The specific formula is as follows:

。

if the position coordinate corresponding to the sampling point P is 1, calculating a segmentation mask of the current sampling point; if the corresponding position value of the similarity matrix S is 0, the segmentation mask of the current sampling point is not calculated.

Through the flow, accurate counting of the intermediate images can be completed on the basis of zero samples.

The fifth embodiment of the present invention will be described below.

Based on the counting model described in the above embodiment, the embodiment of the present invention further describes the flow of the image generating method.

In the image generating method provided by the embodiment of the present invention, in S102, image recognition is performed on an intermediate image, and a first-stage denoising vector of a text-generated graph diffusion model is updated according to an image recognition result and a content error of a text to be processed, so as to obtain a second-stage denoising vector, including:

part-of-speech analysis is carried out on the text to be processed to obtain object types and corresponding object numbers;

and updating the first-stage denoising vector by using the object quantity loss value to obtain a second-stage denoising vector.

In specific implementation, part-of-speech analysis is performed on the text to be processed to obtain the object typeC _i And the corresponding number of objectsN _i . For example. The text to be processed is three oranges and four eggs, and can be identifiedC ₁ Is made of orange and is made of the orange,C ₂ is a kind of egg, which is made of chicken,N ₁ =3，N ₂ =4。

object recognition is carried out on the intermediate image according to the object type, and a counting result of the intermediate image is obtained, which can comprise:

and calculating an attention map of each object type by using a meristematic map diffusion model.

Gaussian noise and current number of denoising stepsTAnd inputting a diffusion model of the venturi image, and calculating to obtain an initial first-stage denoising vector and an initial attention map corresponding to each object type in the intermediate image.

From an initial attention diagramMObtaining each object typeiCorresponding initial attention diagram. Wherein, in the first denoising process, gaussian noise +.>Is random Gaussian noise; in the subsequent denoising process, gaussian noise +.>And the Gaussian noise is obtained by sampling the denoising vector corresponding to the current denoising step number.

Re-weighting each initial attention map to obtain an attention map corresponding to each object type; specifically, the method can be obtained by the following formula:

；

wherein,is the firstiAttention diagram of individual object typej，k) Response value on coordinates +_>Is the firstiInitial attention map of individual object typesj，k) The response value in terms of the coordinates,min() In order to take the minimum value calculation formula,max() The formula is calculated for taking the maximum value.

At this point, a first stage denoising vector may be calculated. The denoising vector and the intermediate image in the first stage are specifically obtained through the following steps:

calculating an attention loss of the attention map; the method is specifically calculated by the following formula:

；

updating the initial first-stage denoising vector according to the attention loss calculation to obtain a first-stage denoising vector; the method is specifically calculated by the following formula:

；

Calculating according to the first-stage denoising vector to obtain a first-stage denoising hidden vector; the method is specifically calculated by the following formula:

；/>

decoding and calculating the denoising hidden vector in the first stage to obtain an intermediate image; the method is specifically calculated by the following formula:

；

wherein,for the first stage denoising vector, +.>Denoising vector for last denoising step, +.>、/>、/>Is super-parameter (herba Cinchi Oleracei)>Gaussian noise for counting the number of current denoising steps for the model>Is used for the purpose of determining the derivative of (c),Lto pay attention to the attention loss of the force diagram, +.>As an intermediate image of the object,Dec() For the decoding function +.>The hidden vector is denoised for the first stage. Hyper-parameters->Can be set to 0.1, super parameter +.>May be set to 0.1.

Then, according to the attention diagramFor intermediate image->Performing segmentation processing to obtain object typesiCorresponding segmented images. Types of objectsiThe corresponding segmented image can be used +.>And (3) representing. Wherein (1)>Is the firstiSplit image of individual object type +.>As an intermediate image of the object,Mask() Is a segmentation function.

Recognizing and obtaining object number recognition results corresponding to each divided image; and taking the object number identification result corresponding to each object type as the counting result of the intermediate image.

The object number loss calculation of the intermediate image is first performed based on the attention of each object type. According to the counting result of the intermediate image and the number of objects in the text to be processed, calculating to obtain the object number loss value of the intermediate image, wherein the object number loss value can be specifically calculated by the following formula:

；

Then, updating the first-stage denoising vector by using the object quantity loss value to obtain a second-stage denoising vector, wherein the second-stage denoising vector is obtained by the following calculation:

；

wherein,loss value for the number of objects->Is the firstiA segmented image of the individual object type,C() In order to make the counting model,is the firstiNumber of objects of individual object type, +.>For the second stage denoising vector, +.>For the first stage denoising vector, +.>、/>Is super-parameter (herba Cinchi Oleracei)>Gaussian noise for counting the number of current denoising steps for the model>Is a derivative of (a).

According to the second stage denoising vectorSampling to obtain Gaussian noise->Considered as completing one-step denoisingAnd (5) a flow.

Repeating the above steps for a totalTObtaining the denoising hidden vector of the second stageThe second stage denoising vector +.>Inputting into decoder, the generated image which finally accords with the text description of the text to be processed can be obtained>：

。

The sixth embodiment of the present invention will be described.

In addition to the scheme of optimizing the number of objects of the generated image of the diffusion model of the meridional chart described in the above embodiment, the embodiment of the present invention provides another scheme of optimizing the attributes of objects of the generated image of the diffusion model of the meridional chart.

In the image generating method provided by the embodiment of the present invention, in S103, image recognition is performed on an intermediate image, and a first-stage denoising vector is updated according to an image recognition result and a content error of a text to be processed, so as to obtain a second-stage denoising vector, which may include: performing part-of-speech analysis on the text to be processed to obtain an object type; performing image recognition on the intermediate image to obtain object attributes of the intermediate image; calculating to obtain an object attribute loss value of the intermediate image according to the object type and the object attribute; and updating the first-stage denoising vector by using the object attribute loss value to obtain a second-stage denoising vector.

In implementations, the object properties may include color, texture, shape, and the like.

Image recognition is performed on the intermediate image to obtain object attributes of the intermediate image, which may include: calling a segmentation model to carry out segmentation recognition on the intermediate image to obtain a third segmentation recognition result of the intermediate image; and carrying out object attribute identification based on the third segmentation identification result to obtain object attributes of the object contained in the intermediate image.

Namely, the attribute of the object contained in the intermediate image is obtained by segmentation and identification of the intermediate image. And comparing an attribute detection result of an object contained in the intermediate image with a standard attribute corresponding to the object type described by the text to be processed to obtain an object attribute loss value of the intermediate image, and updating the first-stage denoising vector by using the object attribute loss value to obtain a second-stage denoising vector.

Various embodiments corresponding to the image generation method are described in detail above, and on the basis of the embodiments, the invention also discloses an image generation device, equipment and a computer readable storage medium corresponding to the method.

The seventh embodiment of the present invention will be described.

As shown in fig. 7, an image generating apparatus provided in an embodiment of the present invention includes:

An obtaining unit 701, configured to obtain a text to be processed;

a denoising unit 702, configured to perform denoising processing on a text to be processed for a preset number of times by using a text-to-text graph diffusion model; in each denoising process, performing image recognition on the intermediate image, and updating a first-stage denoising vector of a text-to-image diffusion model according to an image recognition result and a content error of a text to be processed to obtain a second-stage denoising vector, wherein the second-stage denoising vector is used as a denoising vector of the denoising process;

and an output unit 703, configured to generate a result image corresponding to the text to be processed using the final second stage denoising vector.

In some implementations, the denoising unit 702 performs image recognition on the intermediate image, updates a first-stage denoising vector of the text-to-text graph diffusion model according to the image recognition result and the content error of the text to be processed, and obtains a second-stage denoising vector, including:

In some implementations, the denoising unit 702 performs object recognition on the intermediate image according to the object type, to obtain a count result of the intermediate image, including:

calling a segmentation model to carry out segmentation recognition based on a text-to-graph matching result to obtain a first segmentation recognition result;

In some implementations, the denoising unit 702 is further configured to:

binarization processing is carried out on the text-to-image matching result to obtain a binarized image;

dividing sub-frames in the binarized image to obtain a rectangular frame set;

the denoising unit 702 calls the segmentation model to perform segmentation recognition based on the graph-text matching result, to obtain a first segmentation recognition result, including:

and calling the segmentation model to carry out segmentation recognition based on the rectangular frame set, so as to obtain a first segmentation recognition result.

In some implementations, the denoising unit 702 invokes the segmentation model to perform segmentation recognition based on the graph-text matching result, resulting in a first segmentation recognition result, including:

Inputting the text-to-graph matching result into a segmentation model, and outputting a first initial segmentation mask and a first whole graph characteristic;

inputting the first initial segmentation mask and the first whole image feature into a Hadamard matrix product operation to obtain a first object image feature;

and calculating cosine similarity according to the image features of the first object of interest and the first whole image features to obtain a first segmentation recognition result.

In some implementations, the denoising unit 702 is further configured to:

calling a segmentation model to carry out segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image;

the denoising unit 702 performs statistics of the number of objects corresponding to the object type based on the first segmentation recognition result, to obtain a count result of the intermediate image, including:

and counting the number of the objects corresponding to the object types based on the first segmentation recognition result and the second segmentation recognition result to obtain a counting result of the intermediate image.

In some implementations, the denoising unit 702 invokes a segmentation model to segment and identify the intermediate image, resulting in a second segmentation identification result of the intermediate image, including:

inputting the intermediate image into a segmentation model, and outputting a second initial segmentation mask and a second whole graph characteristic;

inputting a second initial segmentation mask and a second whole graph characteristic into Hadamard matrix product operation to obtain a second object of interest image;

And calculating cosine similarity according to the second object image of interest and the second whole graph feature to obtain a second segmentation recognition result.

and calling a segmentation model to carry out segmentation recognition according to the first segmentation recognition result and the sampling result, so as to obtain a second segmentation recognition result.

In some implementations, the denoising unit 702 invokes the segmentation model to perform segmentation recognition according to the first segmentation recognition result and the sampling result, resulting in a second segmentation recognition result, including:

binarization processing is carried out on the first segmentation recognition result to obtain a similarity matrix;

extracting first attention object image features corresponding to the first segmentation recognition results;

and inputting the sampling result, the similarity matrix and the image features of the first object of interest into a segmentation model to obtain a second segmentation recognition result.

In some implementations, the denoising unit 702 samples the intermediate image to obtain a sampling result of the intermediate image, including:

and sampling the intermediate image according to the principle that the sampling density of the positive sample position with the median value of 1 in the similarity matrix is higher than that of the negative sample position with the median value of 0 in the similarity matrix, so as to obtain a sampling result.

invoking the segmentation model to conduct segmentation recognition according to the first segmentation recognition result and the sampling result to obtain a second segmentation recognition result, wherein the method comprises the following steps:

sequentially calling a segmentation model to carry out segmentation recognition according to the first segmentation recognition result and each batch of sampling results to obtain a second segmentation recognition result corresponding to each batch of sampling results;

and taking the second segmentation recognition result corresponding to the sampling result of each batch as a final second segmentation recognition result.

In some implementations, the denoising unit 702 performs batch sampling on the intermediate image, and sequentially obtains a plurality of batch sampling results, including:

and determining the number of sampling points of each batch according to the size of the memory divided by the equipment for the text-generated graph task, and carrying out batch sampling on the intermediate image according to the number of the sampling points of each batch to sequentially obtain a plurality of batch sampling results.

calculating to obtain attention force diagram of each object type by using a diffusion model of the meristematic map;

dividing the intermediate image according to the attention diagram to obtain a divided image corresponding to each object type;

recognizing and obtaining object number recognition results corresponding to each divided image;

In some implementations, the denoising unit 702 calculates an object number loss value of the intermediate image according to the counting result of the intermediate image and the number of objects in the text to be processed, specifically by the following formula:

；

the denoising unit 702 updates the first stage denoising vector by using the object number loss value to obtain a second stage denoising vector, which is specifically calculated by the following formula:

；

wherein,loss value for the number of objects->Is the firstiA segmented image of the individual object type,C() In order to make the counting model,is the firstiNumber of objects of individual object type, +.>For the second stage denoising vector, +.>For the first stage denoising vector, +.>、/>Is super-parameter (herba Cinchi Oleracei)>Gaussian noise for counting the number of current denoising steps for the model >Is a derivative of (a).

In some implementations, the denoising unit 702 computes an attention map for each object type using a meridional graph diffusion model, including:

inputting Gaussian noise and the current denoising step number into a text-generated graph diffusion model, and calculating to obtain an initial first-stage denoising vector and an initial attention map corresponding to each object type in an intermediate image;

re-weighting each initial attention map to obtain an attention map corresponding to each object type;

each initial attention attempt is re-weighted, specifically by:

；/>

calculating an attention loss of the attention map;

updating the initial first-stage denoising vector according to the attention loss calculation to obtain a first-stage denoising vector;

decoding and calculating the denoising hidden vector in the first stage to obtain an intermediate image;

Wherein, the attention loss of the attention map is calculated, specifically by the following formula:

；

updating the initial first-stage denoising vector according to the attention loss calculation to obtain a first-stage denoising vector, wherein the first-stage denoising vector is obtained by the following calculation:

；

according to the first stage denoising vector calculation, a first stage denoising hidden vector is obtained, and specifically, the first stage denoising hidden vector is obtained through the following calculation:

；

decoding and calculating the denoising hidden vector in the first stage to obtain an intermediate image, wherein the intermediate image is obtained by the following formula:

；

wherein,for the first stage denoising vector, +.>Denoising vector for last denoising step, +.>、/>、/>Is super-parameter (herba Cinchi Oleracei)>Gaussian noise for counting the number of current denoising steps for the model>Is used for the purpose of determining the derivative of (c),Lto pay attention to the attention loss of the force diagram, +.>As an intermediate image of the object,Dec() For the decoding function +.>The hidden vector is denoised for the first stage.

Calculating to obtain an object attribute loss value of the intermediate image according to the object type and the object attribute;

and updating the first-stage denoising vector by using the object attribute loss value to obtain a second-stage denoising vector.

In some implementations, the denoising unit 702 performs image recognition on the intermediate image to obtain object properties of the intermediate image, including:

and carrying out object attribute identification based on the third segmentation identification result to obtain object attributes of the object contained in the intermediate image.

Since the embodiments of the apparatus portion and the embodiments of the method portion correspond to each other, the embodiments of the apparatus portion are referred to the description of the embodiments of the method portion, and are not repeated herein.

The eighth embodiment of the present invention will be described.

As shown in fig. 8, an image generating apparatus provided by an embodiment of the present invention includes:

a memory 810 for storing a computer program 811;

processor 820 for executing a computer program 811, which computer program 811 when executed by processor 820 realizes the steps of the image generating method according to any of the embodiments described above.

Processor 820 may include one or more processing cores, such as a 3-core processor, an 8-core processor, or the like, among others. Processor 820 may be implemented in hardware in at least one of digital signal processing DSP (Digital Signal Processing), field programmable gate array FPGA (Field-Programmable Gate Array), and programmable logic array PLA (Programmable Logic Array). Processor 820 may also include a main processor, which is a processor for processing data in an awake state, also referred to as central processor CPU (Central Processing Unit), and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor 820 may be integrated with an image processor GPU (Graphics Processing Unit), a GPU for use in responsible for rendering and rendering of content required for display by the display screen. In some embodiments, the processor 820 may also include an artificial intelligence AI (Artificial Intelligence) processor for processing computing operations related to machine learning.

Memory 810 may include one or more computer-readable storage media, which may be non-transitory. Memory 810 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory 810 is at least used to store a computer program 811, where the computer program 811, when loaded and executed by the processor 820, is capable of implementing the relevant steps in the image generation method disclosed in any of the foregoing embodiments. In addition, the resources stored by the memory 810 may also include an operating system 812, data 813, and the like, and the storage manner may be transient storage or permanent storage. The operating system 812 may be Windows. The data 813 may include, but is not limited to, data related to the methods described above.

In some embodiments, the image generation device may further include a display 830, a power supply 840, a communication interface 850, an input/output interface 860, a sensor 870, and a communication bus 880.

Those skilled in the art will appreciate that the structure shown in fig. 8 does not constitute a limitation of the image generation apparatus and may include more or less components than illustrated.

The image generating device provided by the embodiment of the invention comprises the memory and the processor, and the processor can realize the image generating method when executing the program stored in the memory, so that the effects are the same.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements steps such as an image generation method.

The computer readable storage medium may include: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (ram) RAM (Random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The computer program included in the computer-readable storage medium provided in this embodiment can implement the steps of the image generation method described above when executed by a processor, and the same effects are achieved.

The image generation method, apparatus, device and computer readable storage medium provided by the present invention are described in detail above. In the description, each embodiment is described in a progressive manner, and each embodiment is mainly described by the differences from other embodiments, so that the same similar parts among the embodiments are mutually referred. The apparatus, device and computer readable storage medium of the embodiments are described more simply because they correspond to the methods of the embodiments, and the description thereof will be given with reference to the method section. It should be noted that it will be apparent to those skilled in the art that the present invention may be modified and practiced without departing from the spirit of the present invention.

It should also be noted that in this specification, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

Claims

1. An image generation method, comprising:

acquiring a text to be processed;

Generating a result image corresponding to the text to be processed by utilizing the final denoising vector of the second stage;

the image recognition is performed on the intermediate image, the first-stage denoising vector of the diffusion model of the text to be processed is updated according to the image recognition result and the content error of the text to be processed, and a second-stage denoising vector is obtained, and the method comprises the following steps:

performing object recognition on the intermediate image according to the object type to obtain a counting result of the intermediate image; the counting result of the intermediate image is the number of the objects which are identified from the intermediate image and accord with the object types in the text to be processed;

updating the first-stage denoising vector by using the object quantity loss value to obtain the second-stage denoising vector;

or alternatively, the first and second heat exchangers may be,

the image recognition is carried out on the intermediate image, the first-stage denoising vector of the diffusion model of the text to be processed is updated according to the image recognition result and the content error of the text to be processed, and the second-stage denoising vector is obtained, and the method comprises the following steps:

2. The image generating method according to claim 1, wherein the performing object recognition on the intermediate image according to the object type to obtain a count result of the intermediate image includes:

3. The image generation method according to claim 2, characterized by further comprising:

Dividing sub-frames in the binarized image to obtain a rectangular frame set;

4. The image generating method according to claim 2, wherein the calling the segmentation model to perform segmentation recognition based on the text-to-image matching result to obtain a first segmentation recognition result includes:

5. The image generation method according to claim 2, characterized by further comprising:

6. The image generating method according to claim 5, wherein the calling the segmentation model to perform segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image includes:

7. The image generating method according to claim 5, wherein the calling the segmentation model to perform segmentation recognition on the intermediate image to obtain a second segmentation recognition result of the intermediate image includes:

8. The image generation method according to claim 7, wherein the calling the segmentation model to perform segmentation recognition according to the first segmentation recognition result and the sampling result to obtain the second segmentation recognition result includes:

9. The image generation method according to claim 8, wherein the sampling the intermediate image to obtain a sampling result of the intermediate image includes:

10. The image generation method according to claim 7, wherein the sampling the intermediate image to obtain a sampling result of the intermediate image includes:

11. The method of claim 10, wherein the batch sampling the intermediate image sequentially obtains a plurality of batches of the sampling result, including:

12. The method of claim 10, wherein the batch sampling the intermediate image sequentially obtains a plurality of batches of the sampling result, including:

13. The image generating method according to claim 1, wherein the performing object recognition on the intermediate image according to the object type to obtain a count result of the intermediate image includes:

14. The image generating method according to claim 13, wherein the calculating obtains the object number loss value of the intermediate image according to the counting result of the intermediate image and the number of objects in the text to be processed, specifically by the following formula:

；

wherein,for the object quantity loss value, +.>Is the firstiSaid segmented image of each said object type,C() For the counting model, ++>Is the firstiThe number of objects of each of said object types, +.>For the second stage denoising vector, +.>For said first stage denoising vector, < > and>、/>is super-parameter (herba Cinchi Oleracei)>Gaussian noise for the current number of denoising steps for the counting model>Is a derivative of (a).

15. The image generation method according to claim 13, wherein said calculating an attention map for each of said object types using said meridional graph diffusion model comprises:

；

Wherein,is the firstiThe first degree of the attention map for each of the object typesj，k) Response value on coordinates +_>Is the firstiA first degree of the initial attention profile of each of the object typesj，k) The response value in terms of the coordinates,min() In order to take the minimum value calculation formula,max() The formula is calculated for taking the maximum value.

16. The image generation method according to claim 15, wherein the first stage denoising vector and the intermediate image are obtained specifically by:

calculating a loss of attention of the attention profile;

；

wherein,for said first stage denoising vector, < > and>denoising vector for last denoising step, +.>、/>、/>Is super-parameter (herba Cinchi Oleracei)>Gaussian noise for counting the number of current denoising steps for the model>Is used for the purpose of determining the derivative of (c),Lattention loss for said attention profile, < >>In order to provide the intermediate image with a picture,Dec() For the decoding function +.>The hidden vector is denoised for the first stage.

17. The image generation method according to claim 1, wherein the performing image recognition on the intermediate image to obtain the object attribute of the intermediate image includes:

18. An image generating apparatus, comprising:

the acquisition unit is used for acquiring the text to be processed;

The output unit is used for generating a result image corresponding to the text to be processed by utilizing the final denoising vector of the second stage;

the denoising unit performs image recognition on the intermediate image, updates a first-stage denoising vector of the text-to-text diffusion model according to an image recognition result and a content error of the text to be processed, and obtains a second-stage denoising vector, and the denoising unit comprises:

or alternatively, the first and second heat exchangers may be,

the denoising unit performs image recognition on the intermediate image, updates a first-stage denoising vector of the diffusion model of the text to be processed according to an image recognition result and a content error of the text to be processed, and obtains a second-stage denoising vector, and the denoising unit comprises:

19. An image generating apparatus, characterized by comprising:

a memory for storing a computer program;

a processor for executing the computer program, which when executed by the processor realizes the steps of the image generation method according to any one of claims 1 to 17.

20. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the image generation method according to any one of claims 1 to 17.