CN116630482A

CN116630482A - Image generation method based on multi-mode retrieval and contour guidance

Info

Publication number: CN116630482A
Application number: CN202310919649.2A
Authority: CN
Inventors: 李昊昱; 王洪俊; 乔春庚
Original assignee: Tols Information Technology Co ltd
Current assignee: Tols Information Technology Co ltd
Priority date: 2023-07-26
Filing date: 2023-07-26
Publication date: 2023-08-22
Anticipated expiration: 2043-07-26
Also published as: CN116630482B

Abstract

The invention provides an image generation method based on multi-mode retrieval and contour guidance, which comprises the following steps: s1: the original image is generated by the graph-text multi-mode search: inputting a forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text Prompt, and outputting an image in a gallery which accords with a similarity threshold as an original image; s2: detecting characters; s3: repairing an image, and removing elements with bad generation effects from the image; s4: edge detection; s5: generating a guide text, S6: image condition generation: setting an implicit diffusion model supporting external input conditions; the contour map generated in the step S4 is input as an external condition, the final image is conditionally generated in the diffusion model by utilizing the guide text generated in the step S5 and is output, the method has good universality, the image generation is guided by detecting the layout structure of the existing image, and the image generation effect is effectively improved.

Description

Image generation method based on multi-mode retrieval and contour guidance

Technical Field

The invention relates to the technical field of image recognition, in particular to an image generation method based on multi-mode retrieval and contour guidance.

Background

In the prior art, latent Diffusion Models (latent Diffusion model, LDM) generates an image by iterating original noise data in a high-dimensional representation space, then decodes the representation result into a complex and fine image, and significantly reduces the computational complexity of the Diffusion model (Diffusion), so that a high-definition picture can be generated in a short time on a device with lower computational effort by using characters to generate the picture, the threshold of model landing is greatly reduced, and the heat in the field of generating the picture by using characters is also brought. While Stable Diffusion was improved based on Latent Diffusion Models, adding more training data and using more advanced text encoders and larger generation sizes (512 x 512 and 768 x 768) making it dedicated to literally generate picture tasks. Although the existing better-effect Latent Diffusion Model (LDM) generally exceeds the GANs (generated countermeasure model) and the LSGM (generated model based on the latent space) in the FID and Precision andRecall indexes, the existing diffusion model has the following problems if generating an Image (Text to Image) directly according to Text guidance:

(1) As a Prompt (forward Prompt text) for guiding image generation, the quality of the text coding result is changed by the quality of the keywords, so that the quality of the image generation is obviously affected, and certain randomness and uncontrollability exist in the method;

(2) Because the training data set contains limited scenes and semantics, the effect of the generated picture result on the complex scene is obviously lower than that of the simple scene under a certain iteration number;

(3) In some special scenes, such as human face and text generation, the generated pictures are often poor in effect and have certain randomness and uncontrollability.

Disclosure of Invention

In order to solve the problem that certain randomness and uncontrollability exist in image generation in the prior art, an image generation method based on multi-mode searching and contour guiding is provided, and the randomness and uncontrollability of image generation are reduced, so that the image generation effect is improved.

The specific scheme is as follows:

an image generation method based on multi-modal retrieval and contour guidance,

s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;

s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;

s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;

s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;

s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;

s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.

Preferably, the vectorizing method in S1 is: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:

（1）

if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.

Preferably, the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.

Preferably, the method for performing image restoration and erasing characters by using the diffusion model is that positive Prompt text Prompt is set in blank in the diffusion model, negative Prompt text Negative Prompt is changed into characters, poster characters and painting characters, CFG is improved to improve the strength of text influence, and therefore the characteristics obtained by text coding can guide the generation process to only leave image background and remove the characters.

Preferably, the method for edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.

Preferably, a loss function L of an implicit diffusion model supporting external input conditions is set in S6 _LDM The method comprises the following steps:

（2）

wherein ,representing a gaussian sampling process; />Representing a time sequence denoising self-coding process; />Representing an encoder for external conditions, +.>The representation maps the external condition y to an intermediate layer representation.

Preferably, the method of conditional contour guidance in the implicit diffusion model in S6 is:

s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;

s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;

s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space; the noise reduction diffusion process of the diffusion model is as follows:

（3）

wherein ,representing a gaussian sampling process; />Indicating that it introduces an implicit encoder->，The method comprises the steps of carrying out a first treatment on the surface of the t= … T is a time-series denoising self-encoding process, which is based on the input +.>Predicting the corresponding denoised result, whereinIs the result of inputting x after adding noise;

s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.

Preferably, in S61, the text and the image are encoded into features by an encoder by:

（4）

wherein ,representation introducing implicit encoder->，/>The representation will->Conversion to->Thereby allowing the character or image features to be characterized in implicit space; when an image is input as the diffusion model, the diffusion model will first encode the input image into the implicit feature space by the image encoder while encoding the input text into the same implicit feature space by the text encoder.

Preferably, in S63, the said Unet is a front-back symmetrical structure, the first half contains 8 main decoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.

Preferably, in S63, the method of noise reduction and diffusion by the feature predictor UNet is: using a micro-network consisting of 4 cores and 2 x 2 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:

（5）

the feature map is converted; the network will convert 512 x 512 image conditions into 64 x 64 feature maps.

Preferably, in S63, the degree of influence of the text feature is controlled by controlling the adjustment degree of the text prompt to the diffusion process by using the numerical variable CFG embedded in the Unet.

The invention provides an image generation method based on multi-mode retrieval and contour guidance, which comprises the following steps: s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images; s2: detecting characters; s3: repairing an image, and removing elements with bad generation effects from the image; s4: edge detection; s5: generating a guide text, setting a fixed positive keyword Prompt and a fixed Negative keyword Negative Prompt as the guide text aiming at different scenes, and improving the image generation effect in a specific scene; s6: image condition generation: setting an implicit diffusion model supporting external input conditions; inputting the contour map generated in the step S4 as an external condition, conditionally generating a final image in an implicit diffusion model by using the guide text generated in the step S5 and outputting the final image, wherein the step encodes the contour map into a characteristic which participates in a cyclic diffusion process so as to influence the generation process, thereby achieving the technical effect of ensuring the layout and quality of the generated image. The technical characteristics of the invention produce the following comprehensive technical effects: firstly, the invention has better universality, and the image generation is guided by detecting the layout structure of the existing image, so that the randomness and uncontrollability of the image generation are reduced, and the image generation effect is improved. Meanwhile, compared with the technology of generating images by only text information, the invention reduces the technical difficulty of generating high-quality images, does not need to design excessively complex and detailed promt, and ensures that the generated result is more reliable by depending on the layout information of the existing images. Thirdly, the invention optimizes the scenes with poor image generation effect, such as guiding the generation of human body and face through contour detection, and enabling the generated pictures to be more real through repairing the characters with poor erasing effect by the images.

Drawings

FIG. 1 is a flow chart of a method of image generation based on multimodal retrieval and contour guidance.

Fig. 2 is a view of home image effects generated by a conventional image generation method.

Fig. 3 is a view of a home image effect generated based on the method of the present invention.

FIG. 4 is a diagram of library image effects generated by a conventional image generation method.

FIG. 5 is a library image effect map generated based on the method of the present invention.

Detailed Description

The invention is further described below with reference to the drawings and examples.

As shown in fig. 1, an image generation method based on multi-modal retrieval and contour guidance,

Fig. 2 is a view of generating a home image by a conventional image generating method without contour guidance, and fig. 3 is a view of generating a home image through contour guidance; FIG. 4 is a library image effect map generated by a conventional image generation method, and FIG. 5 is a library image effect map generated based on the method of the present invention. It can be seen that if the contour map is not used as a guide, the generated result is easy to lose space sense, and meanwhile, many pieces of furniture are also like color blocks which are smeared out, and the sense of reality is lacked; but with the profile, more realistic results are produced.

（1）

The edge detection algorithm comprises a traditional image edge detection algorithm, generally a Canny edge detection algorithm and an image segmentation algorithm based on deep learning, and the Canny edge detection algorithm mainly has the advantages of low error rate, good single-point detection effect, low repeated detection rate and the like, and mainly comprises the following steps:

(1) The input image is subjected to Gaussian smoothing, and the main purpose of the Gaussian smoothing is to reduce the error rate;

(2) Estimating the edge gradient and direction of each pixel point;

(3) According to the gradient direction, carrying out non-maximum suppression on the gradient value;

(4) Edges are detected and connected with dual thresholds. Here, the double threshold generally refers to a low threshold and a high threshold, and a pixel point smaller than the low threshold is assigned with 0; pixels above the high threshold are marked as edge points, and a value of 1 or 255 is assigned.

The image segmentation algorithm based on deep learning generally refers to an image segmentation algorithm based on CNN (convolutional neural network), generally adopts Mask R-CNN, and has the main principle as follows:

(1) Acquiring a feature map (feature map) of an image using a convolutional neural network;

(2) Setting N ROI ranges at each point in the feature map to obtain a plurality of candidate ROI areas, wherein N is preset and represents the classification quantity;

(3) Sending the candidate ROI into an RPN network (region proposal network for predicting the entity frame range) for binary classification, and filtering out a part of the candidate ROI;

(4) Performing ROIAlign operation on the rest of the ROI area (namely, corresponding the pixel positions of the original image and the feature image);

(5) These ROIs are classified (N-class classification, classification information is acquired) and Mask generated (deconvolution and pooling operations are performed inside each ROI area using FCNs, obtaining the segmentation edges of the image entities).

In practical application, for some images with complex background and too clear images, canny edge detection is used to easily generate too dense and finely divided edges, so as to guide Stable Diffusion to generate images with too dispersed colors and unreal colors; or the image definition is insufficient, so that the edges obtained by Canny edge detection are insufficient and are difficult to close, and the main color and the background color are mixed when the Stable Diffusion is guided to generate the image. Both of the above cases are not suitable for implementation using conventional edge detection, and are more suitable for implementation using deep learning based image segmentation.

And (3) optimizing the image by using the forward prompt text in the step (S5), for example, adding modifier words describing image details as the forward prompt text, for example, changing ' sky, cloudiness ' into ' sky with red fire under the sunset ', cloudiness ' and changing ' puppy sitting on a bench ' into ' brown curly dog sitting on a bench of a park, lawn background and cloudy weather '. And keywords such as 'complex, fine, …, sharpening' and the like are added to improve the image generation quality; the result of image generation under a specific scene can also be optimized through Negative Prompt text Negative Prompt, for example, aiming at image generation of human faces and human bodies, keywords such as 'bad hands, missing fingers, multiple fingers, …, bad faces' and the like can be added in the Negative Prompt text Negative Prompt to avoid generating unreal human body structures.

The specific scene mainly refers to a portrait generation scene and a scene containing complex spatial relationships. For example, a half-body figure, the model collapse of face details, hand details and limb actions easily occurs when a diffusion model is generated, and obviously unreal image details occur; also for example, a bedroom view, including beds, wardrobe, windows, overhead lights, etc., may have a significant spatial relationship to each other in the view, but the images generated by the diffusion model often produce a feeling of cracking.

s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space;

Preferably, in S61, when an image is input as the diffusion model, the diffusion model will first encode the input image into an implicit feature space by an image encoder, while encoding the input text into the same implicit feature space by a text encoder.

Preferably, in S63, the method of noise reduction and diffusion by the feature predictor UNet is: using a micro-network consisting of 4 cores and 22 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:

（5）

the feature map is converted; the network will convert the image condition of 512512 into a 64 x 64 feature map.

It should be noted that the above-described embodiments will enable those skilled in the art to more fully understand the invention, but do not limit it in any way. Therefore, although the present invention has been described in detail with reference to the drawings and examples, it will be understood by those skilled in the art that the present invention may be modified or equivalent, and in all cases, all technical solutions and modifications which do not depart from the spirit and scope of the present invention are intended to be included in the scope of the present invention.

Claims

1. An image generation method based on multi-mode searching and contour guiding is characterized in that,

2. The image generating method based on multi-modal searching and contour guiding as claimed in claim 1, wherein said vectorization processing in S1 is as follows: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:

（1）

3. The image generation method based on multi-mode searching and contour guiding as claimed in claim 1, wherein the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.

4. The image generating method based on multi-mode searching and contour guiding as claimed in claim 3, characterized in that the method for performing image restoration and text erasing by the diffusion model is that positive Prompt text Prompt is set in the diffusion model, negative Prompt text Negative Prompt is changed into text, poster text and painting text, the strength of text influence is improved by CFG, so that the character obtained by text encoding can guide the generating process to only leave image background and remove text.

5. The image generating method based on multi-mode searching and contour guiding as defined in claim 1, wherein the method of edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.

6. The method for generating the image based on multi-modal searching and contour guidance as set forth in claim 1, wherein,s6, setting a loss function L of an implicit diffusion model supporting external input conditions _LDM The method comprises the following steps:

（2）

7. The image generating method based on multi-modal searching and contour guiding as set forth in claim 1, wherein the method of conditional contour guiding in the implicit diffusion model in S6 is:

（3）

wherein ,representing a gaussian sampling process; />Indicating that it introduces an implicit encoder->，/>The method comprises the steps of carrying out a first treatment on the surface of the t= … T is a time-series denoising self-encoding process, which is based on the input +.>Predicting the corresponding denoised result, wherein +.>Is the result of inputting x after adding noise;

8. The image generating method based on multi-modal retrieval and contour guidance as claimed in claim 7, wherein in S61, the method of encoding text and images into features by an encoder is:

（4）

wherein ,the introduction of an implicit encoder->，/>The representation will->Conversion to->Thereby allowing the character or image features to be characterized in implicit space; when an image is input as the diffusion model, the diffusion model will first encode the input image into the implicit feature space by the image encoder while encoding the input text into the same implicit feature space by the text encoder.

9. The image generating method based on multi-modal searching and contour guiding as claimed in claim 7, wherein in S63, said une is a front-back symmetrical structure, the first half contains 8 main encoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.

10. The image generating method based on multi-modal searching and contour guiding as claimed in claim 7, wherein in S63, the method of noise reduction and diffusion by the feature predictor UNet is as follows: using a micro-network consisting of 4 cores and 2 x 2 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:

（5）

11. The method for generating an image based on multi-modal search and contour guidance as set forth in claim 7, wherein in S63, the degree of influence of text features is controlled by controlling the adjustment degree of text prompts to diffusion process by using numerical variable CFG embedded in the Unet in the iterative process.