CN116630482A - Image generation method based on multi-mode retrieval and contour guidance - Google Patents
Image generation method based on multi-mode retrieval and contour guidance Download PDFInfo
- Publication number
- CN116630482A CN116630482A CN202310919649.2A CN202310919649A CN116630482A CN 116630482 A CN116630482 A CN 116630482A CN 202310919649 A CN202310919649 A CN 202310919649A CN 116630482 A CN116630482 A CN 116630482A
- Authority
- CN
- China
- Prior art keywords
- image
- text
- prompt
- generating
- contour
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000009792 diffusion process Methods 0.000 claims abstract description 59
- 238000003708 edge detection Methods 0.000 claims abstract description 18
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000001514 detection method Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 12
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 230000008439 repair process Effects 0.000 claims description 9
- 230000009467 reduction Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 4
- 238000013507 mapping Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 238000010422 painting Methods 0.000 claims description 3
- 238000006243 chemical reaction Methods 0.000 claims description 2
- 238000012804 iterative process Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 19
- 238000003709 image segmentation Methods 0.000 description 4
- 238000013135 deep learning Methods 0.000 description 3
- 239000003086 colorant Substances 0.000 description 2
- 238000009499 grossing Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 208000009119 Giant Axonal Neuropathy Diseases 0.000 description 1
- 238000005336 cracking Methods 0.000 description 1
- 125000004122 cyclic group Chemical group 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000009977 dual effect Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 201000003382 giant axonal neuropathy 1 Diseases 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 239000003607 modifier Substances 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000001629 suppression Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T11/00—2D [Two Dimensional] image generation
- G06T11/60—Editing figures and text; Combining figures or text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/53—Querying
- G06F16/532—Query formulation, e.g. graphical querying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/50—Information retrieval; Database structures therefor; File system structures therefor of still image data
- G06F16/58—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/583—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/5846—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/77—Retouching; Inpainting; Scratch removal
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
- G06T7/13—Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/25—Determination of region of interest [ROI] or a volume of interest [VOI]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/74—Image or video pattern matching; Proximity measures in feature spaces
- G06V10/761—Proximity, similarity or dissimilarity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/30—Subject of image; Context of image processing
- G06T2207/30196—Human being; Person
- G06T2207/30201—Face
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Databases & Information Systems (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- General Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Library & Information Science (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides an image generation method based on multi-mode retrieval and contour guidance, which comprises the following steps: s1: the original image is generated by the graph-text multi-mode search: inputting a forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text Prompt, and outputting an image in a gallery which accords with a similarity threshold as an original image; s2: detecting characters; s3: repairing an image, and removing elements with bad generation effects from the image; s4: edge detection; s5: generating a guide text, S6: image condition generation: setting an implicit diffusion model supporting external input conditions; the contour map generated in the step S4 is input as an external condition, the final image is conditionally generated in the diffusion model by utilizing the guide text generated in the step S5 and is output, the method has good universality, the image generation is guided by detecting the layout structure of the existing image, and the image generation effect is effectively improved.
Description
Technical Field
The invention relates to the technical field of image recognition, in particular to an image generation method based on multi-mode retrieval and contour guidance.
Background
In the prior art, latent Diffusion Models (latent Diffusion model, LDM) generates an image by iterating original noise data in a high-dimensional representation space, then decodes the representation result into a complex and fine image, and significantly reduces the computational complexity of the Diffusion model (Diffusion), so that a high-definition picture can be generated in a short time on a device with lower computational effort by using characters to generate the picture, the threshold of model landing is greatly reduced, and the heat in the field of generating the picture by using characters is also brought. While Stable Diffusion was improved based on Latent Diffusion Models, adding more training data and using more advanced text encoders and larger generation sizes (512 x 512 and 768 x 768) making it dedicated to literally generate picture tasks. Although the existing better-effect Latent Diffusion Model (LDM) generally exceeds the GANs (generated countermeasure model) and the LSGM (generated model based on the latent space) in the FID and Precision andRecall indexes, the existing diffusion model has the following problems if generating an Image (Text to Image) directly according to Text guidance:
(1) As a Prompt (forward Prompt text) for guiding image generation, the quality of the text coding result is changed by the quality of the keywords, so that the quality of the image generation is obviously affected, and certain randomness and uncontrollability exist in the method;
(2) Because the training data set contains limited scenes and semantics, the effect of the generated picture result on the complex scene is obviously lower than that of the simple scene under a certain iteration number;
(3) In some special scenes, such as human face and text generation, the generated pictures are often poor in effect and have certain randomness and uncontrollability.
Disclosure of Invention
In order to solve the problem that certain randomness and uncontrollability exist in image generation in the prior art, an image generation method based on multi-mode searching and contour guiding is provided, and the randomness and uncontrollability of image generation are reduced, so that the image generation effect is improved.
The specific scheme is as follows:
an image generation method based on multi-modal retrieval and contour guidance,
s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;
s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;
s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;
s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;
s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;
s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.
Preferably, the vectorizing method in S1 is: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:
(1)
if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.
Preferably, the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.
Preferably, the method for performing image restoration and erasing characters by using the diffusion model is that positive Prompt text Prompt is set in blank in the diffusion model, negative Prompt text Negative Prompt is changed into characters, poster characters and painting characters, CFG is improved to improve the strength of text influence, and therefore the characteristics obtained by text coding can guide the generation process to only leave image background and remove the characters.
Preferably, the method for edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.
Preferably, a loss function L of an implicit diffusion model supporting external input conditions is set in S6 LDM The method comprises the following steps:
(2)
wherein ,representing a gaussian sampling process; />Representing a time sequence denoising self-coding process; />Representing an encoder for external conditions, +.>The representation maps the external condition y to an intermediate layer representation.
Preferably, the method of conditional contour guidance in the implicit diffusion model in S6 is:
s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;
s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;
s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space; the noise reduction diffusion process of the diffusion model is as follows:
(3)
wherein ,representing a gaussian sampling process; />Indicating that it introduces an implicit encoder->,The method comprises the steps of carrying out a first treatment on the surface of the t= … T is a time-series denoising self-encoding process, which is based on the input +.>Predicting the corresponding denoised result, whereinIs the result of inputting x after adding noise;
s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.
Preferably, in S61, the text and the image are encoded into features by an encoder by:
(4)
wherein ,representation introducing implicit encoder->,/>The representation will->Conversion to->Thereby allowing the character or image features to be characterized in implicit space; when an image is input as the diffusion model, the diffusion model will first encode the input image into the implicit feature space by the image encoder while encoding the input text into the same implicit feature space by the text encoder.
Preferably, in S63, the said Unet is a front-back symmetrical structure, the first half contains 8 main decoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.
Preferably, in S63, the method of noise reduction and diffusion by the feature predictor UNet is: using a micro-network consisting of 4 cores and 2 x 2 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:
(5)
the feature map is converted; the network will convert 512 x 512 image conditions into 64 x 64 feature maps.
Preferably, in S63, the degree of influence of the text feature is controlled by controlling the adjustment degree of the text prompt to the diffusion process by using the numerical variable CFG embedded in the Unet.
The invention provides an image generation method based on multi-mode retrieval and contour guidance, which comprises the following steps: s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images; s2: detecting characters; s3: repairing an image, and removing elements with bad generation effects from the image; s4: edge detection; s5: generating a guide text, setting a fixed positive keyword Prompt and a fixed Negative keyword Negative Prompt as the guide text aiming at different scenes, and improving the image generation effect in a specific scene; s6: image condition generation: setting an implicit diffusion model supporting external input conditions; inputting the contour map generated in the step S4 as an external condition, conditionally generating a final image in an implicit diffusion model by using the guide text generated in the step S5 and outputting the final image, wherein the step encodes the contour map into a characteristic which participates in a cyclic diffusion process so as to influence the generation process, thereby achieving the technical effect of ensuring the layout and quality of the generated image. The technical characteristics of the invention produce the following comprehensive technical effects: firstly, the invention has better universality, and the image generation is guided by detecting the layout structure of the existing image, so that the randomness and uncontrollability of the image generation are reduced, and the image generation effect is improved. Meanwhile, compared with the technology of generating images by only text information, the invention reduces the technical difficulty of generating high-quality images, does not need to design excessively complex and detailed promt, and ensures that the generated result is more reliable by depending on the layout information of the existing images. Thirdly, the invention optimizes the scenes with poor image generation effect, such as guiding the generation of human body and face through contour detection, and enabling the generated pictures to be more real through repairing the characters with poor erasing effect by the images.
Drawings
FIG. 1 is a flow chart of a method of image generation based on multimodal retrieval and contour guidance.
Fig. 2 is a view of home image effects generated by a conventional image generation method.
Fig. 3 is a view of a home image effect generated based on the method of the present invention.
FIG. 4 is a diagram of library image effects generated by a conventional image generation method.
FIG. 5 is a library image effect map generated based on the method of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
As shown in fig. 1, an image generation method based on multi-modal retrieval and contour guidance,
s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;
s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;
s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;
s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;
s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;
s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.
Fig. 2 is a view of generating a home image by a conventional image generating method without contour guidance, and fig. 3 is a view of generating a home image through contour guidance; FIG. 4 is a library image effect map generated by a conventional image generation method, and FIG. 5 is a library image effect map generated based on the method of the present invention. It can be seen that if the contour map is not used as a guide, the generated result is easy to lose space sense, and meanwhile, many pieces of furniture are also like color blocks which are smeared out, and the sense of reality is lacked; but with the profile, more realistic results are produced.
Preferably, the vectorizing method in S1 is: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:
(1)
if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.
Preferably, the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.
Preferably, the method for performing image restoration and erasing characters by using the diffusion model is that positive Prompt text Prompt is set in blank in the diffusion model, negative Prompt text Negative Prompt is changed into characters, poster characters and painting characters, CFG is improved to improve the strength of text influence, and therefore the characteristics obtained by text coding can guide the generation process to only leave image background and remove the characters.
Preferably, the method for edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.
The edge detection algorithm comprises a traditional image edge detection algorithm, generally a Canny edge detection algorithm and an image segmentation algorithm based on deep learning, and the Canny edge detection algorithm mainly has the advantages of low error rate, good single-point detection effect, low repeated detection rate and the like, and mainly comprises the following steps:
(1) The input image is subjected to Gaussian smoothing, and the main purpose of the Gaussian smoothing is to reduce the error rate;
(2) Estimating the edge gradient and direction of each pixel point;
(3) According to the gradient direction, carrying out non-maximum suppression on the gradient value;
(4) Edges are detected and connected with dual thresholds. Here, the double threshold generally refers to a low threshold and a high threshold, and a pixel point smaller than the low threshold is assigned with 0; pixels above the high threshold are marked as edge points, and a value of 1 or 255 is assigned.
The image segmentation algorithm based on deep learning generally refers to an image segmentation algorithm based on CNN (convolutional neural network), generally adopts Mask R-CNN, and has the main principle as follows:
(1) Acquiring a feature map (feature map) of an image using a convolutional neural network;
(2) Setting N ROI ranges at each point in the feature map to obtain a plurality of candidate ROI areas, wherein N is preset and represents the classification quantity;
(3) Sending the candidate ROI into an RPN network (region proposal network for predicting the entity frame range) for binary classification, and filtering out a part of the candidate ROI;
(4) Performing ROIAlign operation on the rest of the ROI area (namely, corresponding the pixel positions of the original image and the feature image);
(5) These ROIs are classified (N-class classification, classification information is acquired) and Mask generated (deconvolution and pooling operations are performed inside each ROI area using FCNs, obtaining the segmentation edges of the image entities).
In practical application, for some images with complex background and too clear images, canny edge detection is used to easily generate too dense and finely divided edges, so as to guide Stable Diffusion to generate images with too dispersed colors and unreal colors; or the image definition is insufficient, so that the edges obtained by Canny edge detection are insufficient and are difficult to close, and the main color and the background color are mixed when the Stable Diffusion is guided to generate the image. Both of the above cases are not suitable for implementation using conventional edge detection, and are more suitable for implementation using deep learning based image segmentation.
And (3) optimizing the image by using the forward prompt text in the step (S5), for example, adding modifier words describing image details as the forward prompt text, for example, changing ' sky, cloudiness ' into ' sky with red fire under the sunset ', cloudiness ' and changing ' puppy sitting on a bench ' into ' brown curly dog sitting on a bench of a park, lawn background and cloudy weather '. And keywords such as 'complex, fine, …, sharpening' and the like are added to improve the image generation quality; the result of image generation under a specific scene can also be optimized through Negative Prompt text Negative Prompt, for example, aiming at image generation of human faces and human bodies, keywords such as 'bad hands, missing fingers, multiple fingers, …, bad faces' and the like can be added in the Negative Prompt text Negative Prompt to avoid generating unreal human body structures.
The specific scene mainly refers to a portrait generation scene and a scene containing complex spatial relationships. For example, a half-body figure, the model collapse of face details, hand details and limb actions easily occurs when a diffusion model is generated, and obviously unreal image details occur; also for example, a bedroom view, including beds, wardrobe, windows, overhead lights, etc., may have a significant spatial relationship to each other in the view, but the images generated by the diffusion model often produce a feeling of cracking.
Preferably, the method of conditional contour guidance in the implicit diffusion model in S6 is:
s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;
s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;
s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space;
s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.
Preferably, in S61, when an image is input as the diffusion model, the diffusion model will first encode the input image into an implicit feature space by an image encoder, while encoding the input text into the same implicit feature space by a text encoder.
Preferably, in S63, the said Unet is a front-back symmetrical structure, the first half contains 8 main decoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.
Preferably, in S63, the method of noise reduction and diffusion by the feature predictor UNet is: using a micro-network consisting of 4 cores and 22 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:
(5)
the feature map is converted; the network will convert the image condition of 512512 into a 64 x 64 feature map.
Preferably, in S63, the degree of influence of the text feature is controlled by controlling the adjustment degree of the text prompt to the diffusion process by using the numerical variable CFG embedded in the Unet.
It should be noted that the above-described embodiments will enable those skilled in the art to more fully understand the invention, but do not limit it in any way. Therefore, although the present invention has been described in detail with reference to the drawings and examples, it will be understood by those skilled in the art that the present invention may be modified or equivalent, and in all cases, all technical solutions and modifications which do not depart from the spirit and scope of the present invention are intended to be included in the scope of the present invention.
Claims (11)
1. An image generation method based on multi-mode searching and contour guiding is characterized in that,
s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;
s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;
s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;
s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;
s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;
s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.
2. The image generating method based on multi-modal searching and contour guiding as claimed in claim 1, wherein said vectorization processing in S1 is as follows: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:
(1)
if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.
3. The image generation method based on multi-mode searching and contour guiding as claimed in claim 1, wherein the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.
4. The image generating method based on multi-mode searching and contour guiding as claimed in claim 3, characterized in that the method for performing image restoration and text erasing by the diffusion model is that positive Prompt text Prompt is set in the diffusion model, negative Prompt text Negative Prompt is changed into text, poster text and painting text, the strength of text influence is improved by CFG, so that the character obtained by text encoding can guide the generating process to only leave image background and remove text.
5. The image generating method based on multi-mode searching and contour guiding as defined in claim 1, wherein the method of edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.
6. The method for generating the image based on multi-modal searching and contour guidance as set forth in claim 1, wherein,s6, setting a loss function L of an implicit diffusion model supporting external input conditions LDM The method comprises the following steps:
(2)
wherein ,representing a gaussian sampling process; />Representing a time sequence denoising self-coding process; />Representing an encoder for external conditions, +.>The representation maps the external condition y to an intermediate layer representation.
7. The image generating method based on multi-modal searching and contour guiding as set forth in claim 1, wherein the method of conditional contour guiding in the implicit diffusion model in S6 is:
s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;
s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;
s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space; the noise reduction diffusion process of the diffusion model is as follows:
(3)
wherein ,representing a gaussian sampling process; />Indicating that it introduces an implicit encoder->,/>The method comprises the steps of carrying out a first treatment on the surface of the t= … T is a time-series denoising self-encoding process, which is based on the input +.>Predicting the corresponding denoised result, wherein +.>Is the result of inputting x after adding noise;
s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.
8. The image generating method based on multi-modal retrieval and contour guidance as claimed in claim 7, wherein in S61, the method of encoding text and images into features by an encoder is:
(4)
wherein ,the introduction of an implicit encoder->,/>The representation will->Conversion to->Thereby allowing the character or image features to be characterized in implicit space; when an image is input as the diffusion model, the diffusion model will first encode the input image into the implicit feature space by the image encoder while encoding the input text into the same implicit feature space by the text encoder.
9. The image generating method based on multi-modal searching and contour guiding as claimed in claim 7, wherein in S63, said une is a front-back symmetrical structure, the first half contains 8 main encoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.
10. The image generating method based on multi-modal searching and contour guiding as claimed in claim 7, wherein in S63, the method of noise reduction and diffusion by the feature predictor UNet is as follows: using a micro-network consisting of 4 cores and 2 x 2 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:
(5)
the feature map is converted; the network will convert 512 x 512 image conditions into 64 x 64 feature maps.
11. The method for generating an image based on multi-modal search and contour guidance as set forth in claim 7, wherein in S63, the degree of influence of text features is controlled by controlling the adjustment degree of text prompts to diffusion process by using numerical variable CFG embedded in the Unet in the iterative process.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310919649.2A CN116630482B (en) | 2023-07-26 | 2023-07-26 | Image generation method based on multi-mode retrieval and contour guidance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310919649.2A CN116630482B (en) | 2023-07-26 | 2023-07-26 | Image generation method based on multi-mode retrieval and contour guidance |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116630482A true CN116630482A (en) | 2023-08-22 |
CN116630482B CN116630482B (en) | 2023-11-03 |
Family
ID=87613903
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310919649.2A Active CN116630482B (en) | 2023-07-26 | 2023-07-26 | Image generation method based on multi-mode retrieval and contour guidance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116630482B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117725247A (en) * | 2024-02-07 | 2024-03-19 | 北京知呱呱科技有限公司 | Diffusion image generation method and system based on retrieval and segmentation enhancement |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220005235A1 (en) * | 2020-07-06 | 2022-01-06 | Ping An Technology (Shenzhen) Co., Ltd. | Method and device for text-based image generation |
CN116012488A (en) * | 2023-01-05 | 2023-04-25 | 网易(杭州)网络有限公司 | Stylized image generation method, device, computer equipment and storage medium |
KR20230059524A (en) * | 2021-10-26 | 2023-05-03 | 삼성에스디에스 주식회사 | Method and apparatus for analyzing multimodal data |
CN116452706A (en) * | 2023-04-23 | 2023-07-18 | 中国工商银行股份有限公司 | Image generation method and device for presentation file |
CN116452410A (en) * | 2023-03-10 | 2023-07-18 | 浙江工业大学 | Text-guided maskless image editing method based on deep learning |
-
2023
- 2023-07-26 CN CN202310919649.2A patent/CN116630482B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220005235A1 (en) * | 2020-07-06 | 2022-01-06 | Ping An Technology (Shenzhen) Co., Ltd. | Method and device for text-based image generation |
KR20230059524A (en) * | 2021-10-26 | 2023-05-03 | 삼성에스디에스 주식회사 | Method and apparatus for analyzing multimodal data |
CN116012488A (en) * | 2023-01-05 | 2023-04-25 | 网易(杭州)网络有限公司 | Stylized image generation method, device, computer equipment and storage medium |
CN116452410A (en) * | 2023-03-10 | 2023-07-18 | 浙江工业大学 | Text-guided maskless image editing method based on deep learning |
CN116452706A (en) * | 2023-04-23 | 2023-07-18 | 中国工商银行股份有限公司 | Image generation method and device for presentation file |
Non-Patent Citations (1)
Title |
---|
周作为;钱真真;: "利用自然语言文本描述进行图像编辑", 电子技术与软件工程, no. 01, pages 119 - 121 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117725247A (en) * | 2024-02-07 | 2024-03-19 | 北京知呱呱科技有限公司 | Diffusion image generation method and system based on retrieval and segmentation enhancement |
CN117725247B (en) * | 2024-02-07 | 2024-04-26 | 北京知呱呱科技有限公司 | Diffusion image generation method and system based on retrieval and segmentation enhancement |
Also Published As
Publication number | Publication date |
---|---|
CN116630482B (en) | 2023-11-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107644006B (en) | Automatic generation method of handwritten Chinese character library based on deep neural network | |
CN108875935B (en) | Natural image target material visual characteristic mapping method based on generation countermeasure network | |
CN111861945B (en) | Text-guided image restoration method and system | |
CN115393396B (en) | Unmanned aerial vehicle target tracking method based on mask pre-training | |
CN116630482B (en) | Image generation method based on multi-mode retrieval and contour guidance | |
CN116721221A (en) | Multi-mode-based three-dimensional content generation method, device, equipment and storage medium | |
CN115880762B (en) | Human-machine hybrid vision-oriented scalable face image coding method and system | |
CN113066089B (en) | Real-time image semantic segmentation method based on attention guide mechanism | |
US11823432B2 (en) | Saliency prediction method and system for 360-degree image | |
CN116912257B (en) | Concrete pavement crack identification method based on deep learning and storage medium | |
CN117635771A (en) | Scene text editing method and device based on semi-supervised contrast learning | |
CN117218246A (en) | Training method and device for image generation model, electronic equipment and storage medium | |
Lin | Comparative Analysis of Pix2Pix and CycleGAN for image-to-image translation | |
CN116935292B (en) | Short video scene classification method and system based on self-attention model | |
CN116523985B (en) | Structure and texture feature guided double-encoder image restoration method | |
CN116863053A (en) | Point cloud rendering enhancement method based on knowledge distillation | |
CN109035318B (en) | Image style conversion method | |
CN116630369A (en) | Unmanned aerial vehicle target tracking method based on space-time memory network | |
CN112487992B (en) | Stream model-based face emotion image generation method and device | |
CN115115860A (en) | Image feature point detection matching network based on deep learning | |
CN115147317A (en) | Point cloud color quality enhancement method and system based on convolutional neural network | |
CN115019319A (en) | Structured picture content identification method based on dynamic feature extraction | |
CN113223038A (en) | Discrete cosine transform-based mask representation instance segmentation method | |
CN111476867A (en) | Hand-drawn sketch generation method based on variational self-coding and generation countermeasure network | |
CN113674369B (en) | Method for improving G-PCC compression by deep learning sampling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |