CN116630482A - Image generation method based on multi-mode retrieval and contour guidance - Google Patents

Image generation method based on multi-mode retrieval and contour guidance Download PDF

Info

Publication number
CN116630482A
CN116630482A CN202310919649.2A CN202310919649A CN116630482A CN 116630482 A CN116630482 A CN 116630482A CN 202310919649 A CN202310919649 A CN 202310919649A CN 116630482 A CN116630482 A CN 116630482A
Authority
CN
China
Prior art keywords
image
text
prompt
generating
contour
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310919649.2A
Other languages
Chinese (zh)
Other versions
CN116630482B (en
Inventor
李昊昱
王洪俊
乔春庚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tols Information Technology Co ltd
Original Assignee
Tols Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tols Information Technology Co ltd filed Critical Tols Information Technology Co ltd
Priority to CN202310919649.2A priority Critical patent/CN116630482B/en
Publication of CN116630482A publication Critical patent/CN116630482A/en
Application granted granted Critical
Publication of CN116630482B publication Critical patent/CN116630482B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/53Querying
    • G06F16/532Query formulation, e.g. graphical querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/5846Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using extracted text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/77Retouching; Inpainting; Scratch removal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • G06T7/13Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Library & Information Science (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an image generation method based on multi-mode retrieval and contour guidance, which comprises the following steps: s1: the original image is generated by the graph-text multi-mode search: inputting a forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text Prompt, and outputting an image in a gallery which accords with a similarity threshold as an original image; s2: detecting characters; s3: repairing an image, and removing elements with bad generation effects from the image; s4: edge detection; s5: generating a guide text, S6: image condition generation: setting an implicit diffusion model supporting external input conditions; the contour map generated in the step S4 is input as an external condition, the final image is conditionally generated in the diffusion model by utilizing the guide text generated in the step S5 and is output, the method has good universality, the image generation is guided by detecting the layout structure of the existing image, and the image generation effect is effectively improved.

Description

Image generation method based on multi-mode retrieval and contour guidance
Technical Field
The invention relates to the technical field of image recognition, in particular to an image generation method based on multi-mode retrieval and contour guidance.
Background
In the prior art, latent Diffusion Models (latent Diffusion model, LDM) generates an image by iterating original noise data in a high-dimensional representation space, then decodes the representation result into a complex and fine image, and significantly reduces the computational complexity of the Diffusion model (Diffusion), so that a high-definition picture can be generated in a short time on a device with lower computational effort by using characters to generate the picture, the threshold of model landing is greatly reduced, and the heat in the field of generating the picture by using characters is also brought. While Stable Diffusion was improved based on Latent Diffusion Models, adding more training data and using more advanced text encoders and larger generation sizes (512 x 512 and 768 x 768) making it dedicated to literally generate picture tasks. Although the existing better-effect Latent Diffusion Model (LDM) generally exceeds the GANs (generated countermeasure model) and the LSGM (generated model based on the latent space) in the FID and Precision andRecall indexes, the existing diffusion model has the following problems if generating an Image (Text to Image) directly according to Text guidance:
(1) As a Prompt (forward Prompt text) for guiding image generation, the quality of the text coding result is changed by the quality of the keywords, so that the quality of the image generation is obviously affected, and certain randomness and uncontrollability exist in the method;
(2) Because the training data set contains limited scenes and semantics, the effect of the generated picture result on the complex scene is obviously lower than that of the simple scene under a certain iteration number;
(3) In some special scenes, such as human face and text generation, the generated pictures are often poor in effect and have certain randomness and uncontrollability.
Disclosure of Invention
In order to solve the problem that certain randomness and uncontrollability exist in image generation in the prior art, an image generation method based on multi-mode searching and contour guiding is provided, and the randomness and uncontrollability of image generation are reduced, so that the image generation effect is improved.
The specific scheme is as follows:
an image generation method based on multi-modal retrieval and contour guidance,
s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;
s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;
s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;
s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;
s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;
s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.
Preferably, the vectorizing method in S1 is: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:
(1)
if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.
Preferably, the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.
Preferably, the method for performing image restoration and erasing characters by using the diffusion model is that positive Prompt text Prompt is set in blank in the diffusion model, negative Prompt text Negative Prompt is changed into characters, poster characters and painting characters, CFG is improved to improve the strength of text influence, and therefore the characteristics obtained by text coding can guide the generation process to only leave image background and remove the characters.
Preferably, the method for edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.
Preferably, a loss function L of an implicit diffusion model supporting external input conditions is set in S6 LDM The method comprises the following steps:
(2)
wherein ,representing a gaussian sampling process; />Representing a time sequence denoising self-coding process; />Representing an encoder for external conditions, +.>The representation maps the external condition y to an intermediate layer representation.
Preferably, the method of conditional contour guidance in the implicit diffusion model in S6 is:
s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;
s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;
s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space; the noise reduction diffusion process of the diffusion model is as follows:
(3)
wherein ,representing a gaussian sampling process; />Indicating that it introduces an implicit encoder->The method comprises the steps of carrying out a first treatment on the surface of the t= … T is a time-series denoising self-encoding process, which is based on the input +.>Predicting the corresponding denoised result, whereinIs the result of inputting x after adding noise;
s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.
Preferably, in S61, the text and the image are encoded into features by an encoder by:
(4)
wherein ,representation introducing implicit encoder->,/>The representation will->Conversion to->Thereby allowing the character or image features to be characterized in implicit space; when an image is input as the diffusion model, the diffusion model will first encode the input image into the implicit feature space by the image encoder while encoding the input text into the same implicit feature space by the text encoder.
Preferably, in S63, the said Unet is a front-back symmetrical structure, the first half contains 8 main decoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.
Preferably, in S63, the method of noise reduction and diffusion by the feature predictor UNet is: using a micro-network consisting of 4 cores and 2 x 2 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:
(5)
the feature map is converted; the network will convert 512 x 512 image conditions into 64 x 64 feature maps.
Preferably, in S63, the degree of influence of the text feature is controlled by controlling the adjustment degree of the text prompt to the diffusion process by using the numerical variable CFG embedded in the Unet.
The invention provides an image generation method based on multi-mode retrieval and contour guidance, which comprises the following steps: s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images; s2: detecting characters; s3: repairing an image, and removing elements with bad generation effects from the image; s4: edge detection; s5: generating a guide text, setting a fixed positive keyword Prompt and a fixed Negative keyword Negative Prompt as the guide text aiming at different scenes, and improving the image generation effect in a specific scene; s6: image condition generation: setting an implicit diffusion model supporting external input conditions; inputting the contour map generated in the step S4 as an external condition, conditionally generating a final image in an implicit diffusion model by using the guide text generated in the step S5 and outputting the final image, wherein the step encodes the contour map into a characteristic which participates in a cyclic diffusion process so as to influence the generation process, thereby achieving the technical effect of ensuring the layout and quality of the generated image. The technical characteristics of the invention produce the following comprehensive technical effects: firstly, the invention has better universality, and the image generation is guided by detecting the layout structure of the existing image, so that the randomness and uncontrollability of the image generation are reduced, and the image generation effect is improved. Meanwhile, compared with the technology of generating images by only text information, the invention reduces the technical difficulty of generating high-quality images, does not need to design excessively complex and detailed promt, and ensures that the generated result is more reliable by depending on the layout information of the existing images. Thirdly, the invention optimizes the scenes with poor image generation effect, such as guiding the generation of human body and face through contour detection, and enabling the generated pictures to be more real through repairing the characters with poor erasing effect by the images.
Drawings
FIG. 1 is a flow chart of a method of image generation based on multimodal retrieval and contour guidance.
Fig. 2 is a view of home image effects generated by a conventional image generation method.
Fig. 3 is a view of a home image effect generated based on the method of the present invention.
FIG. 4 is a diagram of library image effects generated by a conventional image generation method.
FIG. 5 is a library image effect map generated based on the method of the present invention.
Detailed Description
The invention is further described below with reference to the drawings and examples.
As shown in fig. 1, an image generation method based on multi-modal retrieval and contour guidance,
s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;
s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;
s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;
s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;
s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;
s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.
Fig. 2 is a view of generating a home image by a conventional image generating method without contour guidance, and fig. 3 is a view of generating a home image through contour guidance; FIG. 4 is a library image effect map generated by a conventional image generation method, and FIG. 5 is a library image effect map generated based on the method of the present invention. It can be seen that if the contour map is not used as a guide, the generated result is easy to lose space sense, and meanwhile, many pieces of furniture are also like color blocks which are smeared out, and the sense of reality is lacked; but with the profile, more realistic results are produced.
Preferably, the vectorizing method in S1 is: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:
(1)
if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.
Preferably, the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.
Preferably, the method for performing image restoration and erasing characters by using the diffusion model is that positive Prompt text Prompt is set in blank in the diffusion model, negative Prompt text Negative Prompt is changed into characters, poster characters and painting characters, CFG is improved to improve the strength of text influence, and therefore the characteristics obtained by text coding can guide the generation process to only leave image background and remove the characters.
Preferably, the method for edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.
The edge detection algorithm comprises a traditional image edge detection algorithm, generally a Canny edge detection algorithm and an image segmentation algorithm based on deep learning, and the Canny edge detection algorithm mainly has the advantages of low error rate, good single-point detection effect, low repeated detection rate and the like, and mainly comprises the following steps:
(1) The input image is subjected to Gaussian smoothing, and the main purpose of the Gaussian smoothing is to reduce the error rate;
(2) Estimating the edge gradient and direction of each pixel point;
(3) According to the gradient direction, carrying out non-maximum suppression on the gradient value;
(4) Edges are detected and connected with dual thresholds. Here, the double threshold generally refers to a low threshold and a high threshold, and a pixel point smaller than the low threshold is assigned with 0; pixels above the high threshold are marked as edge points, and a value of 1 or 255 is assigned.
The image segmentation algorithm based on deep learning generally refers to an image segmentation algorithm based on CNN (convolutional neural network), generally adopts Mask R-CNN, and has the main principle as follows:
(1) Acquiring a feature map (feature map) of an image using a convolutional neural network;
(2) Setting N ROI ranges at each point in the feature map to obtain a plurality of candidate ROI areas, wherein N is preset and represents the classification quantity;
(3) Sending the candidate ROI into an RPN network (region proposal network for predicting the entity frame range) for binary classification, and filtering out a part of the candidate ROI;
(4) Performing ROIAlign operation on the rest of the ROI area (namely, corresponding the pixel positions of the original image and the feature image);
(5) These ROIs are classified (N-class classification, classification information is acquired) and Mask generated (deconvolution and pooling operations are performed inside each ROI area using FCNs, obtaining the segmentation edges of the image entities).
In practical application, for some images with complex background and too clear images, canny edge detection is used to easily generate too dense and finely divided edges, so as to guide Stable Diffusion to generate images with too dispersed colors and unreal colors; or the image definition is insufficient, so that the edges obtained by Canny edge detection are insufficient and are difficult to close, and the main color and the background color are mixed when the Stable Diffusion is guided to generate the image. Both of the above cases are not suitable for implementation using conventional edge detection, and are more suitable for implementation using deep learning based image segmentation.
And (3) optimizing the image by using the forward prompt text in the step (S5), for example, adding modifier words describing image details as the forward prompt text, for example, changing ' sky, cloudiness ' into ' sky with red fire under the sunset ', cloudiness ' and changing ' puppy sitting on a bench ' into ' brown curly dog sitting on a bench of a park, lawn background and cloudy weather '. And keywords such as 'complex, fine, …, sharpening' and the like are added to improve the image generation quality; the result of image generation under a specific scene can also be optimized through Negative Prompt text Negative Prompt, for example, aiming at image generation of human faces and human bodies, keywords such as 'bad hands, missing fingers, multiple fingers, …, bad faces' and the like can be added in the Negative Prompt text Negative Prompt to avoid generating unreal human body structures.
The specific scene mainly refers to a portrait generation scene and a scene containing complex spatial relationships. For example, a half-body figure, the model collapse of face details, hand details and limb actions easily occurs when a diffusion model is generated, and obviously unreal image details occur; also for example, a bedroom view, including beds, wardrobe, windows, overhead lights, etc., may have a significant spatial relationship to each other in the view, but the images generated by the diffusion model often produce a feeling of cracking.
Preferably, the method of conditional contour guidance in the implicit diffusion model in S6 is:
s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;
s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;
s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space;
s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.
Preferably, in S61, when an image is input as the diffusion model, the diffusion model will first encode the input image into an implicit feature space by an image encoder, while encoding the input text into the same implicit feature space by a text encoder.
Preferably, in S63, the said Unet is a front-back symmetrical structure, the first half contains 8 main decoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.
Preferably, in S63, the method of noise reduction and diffusion by the feature predictor UNet is: using a micro-network consisting of 4 cores and 22 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:
(5)
the feature map is converted; the network will convert the image condition of 512512 into a 64 x 64 feature map.
Preferably, in S63, the degree of influence of the text feature is controlled by controlling the adjustment degree of the text prompt to the diffusion process by using the numerical variable CFG embedded in the Unet.
It should be noted that the above-described embodiments will enable those skilled in the art to more fully understand the invention, but do not limit it in any way. Therefore, although the present invention has been described in detail with reference to the drawings and examples, it will be understood by those skilled in the art that the present invention may be modified or equivalent, and in all cases, all technical solutions and modifications which do not depart from the spirit and scope of the present invention are intended to be included in the scope of the present invention.

Claims (11)

1. An image generation method based on multi-mode searching and contour guiding is characterized in that,
s1: the original image is generated by the graph-text multi-mode search: inputting forward Prompt text Prompt, performing word segmentation and vectorization on the forward Prompt text, calculating the similarity with images in a gallery, and outputting the images in the gallery meeting a similarity threshold as original images;
s2: and (3) character detection: inputting an original image generated in the step S1, selecting an image from the set of the original images, firstly obtaining the position of a text in the image through text detection, and generating a Mask image;
s3: image restoration: inputting a Mask image generated in the step S2, and if characters are detected in the Mask image, erasing the characters by using an image restoration function and outputting the characters as a restoration image; if no text exists, the Mask image is directly output to serve as a repair image;
s4: generating profile conditions: inputting the repair image generated in the step S3, carrying out edge detection on the repair image, obtaining a layout outline and generating an outline drawing, and taking the outline drawing as a guiding and constraint condition for image generation;
s5: generating a guide text: inputting different generated scenes of the positive Prompt text Prompt detection image in the S1, and setting a fixed positive keyword Prompt and a Negative keyword Negative Prompt as guide texts aiming at the different scenes;
s6: image contour guidance: setting an implicit diffusion model supporting external input conditions; and (3) inputting the guide text generated in the step (S5), taking the contour map generated in the step (S4) as an external condition, performing conditional contour guide in the implicit diffusion model, generating a final image and outputting the final image.
2. The image generating method based on multi-modal searching and contour guiding as claimed in claim 1, wherein said vectorization processing in S1 is as follows: encoding the text and the image by using a text encoder and an image encoder respectively, and calculating an included angle cosine similarity cos (theta) between the text vector xi and the image vector yi:
(1)
if the cosine similarity of the included angle exceeds a set threshold, the images meet the similarity condition, all the images meeting the similarity condition are compiled into a set, and one image is randomly selected from the set to serve as an original image.
3. The image generation method based on multi-mode searching and contour guiding as claimed in claim 1, wherein the method of image restoration in S3 is as follows: and detecting characters in the image by using OCR, marking the positions of the characters, generating a Mask image with marked characters, and performing image restoration by using Stable Diffusion in the Diffusion model to erase the characters so as to integrate the characters with the background of the picture.
4. The image generating method based on multi-mode searching and contour guiding as claimed in claim 3, characterized in that the method for performing image restoration and text erasing by the diffusion model is that positive Prompt text Prompt is set in the diffusion model, negative Prompt text Negative Prompt is changed into text, poster text and painting text, the strength of text influence is improved by CFG, so that the character obtained by text encoding can guide the generating process to only leave image background and remove text.
5. The image generating method based on multi-mode searching and contour guiding as defined in claim 1, wherein the method of edge detection in S4 is as follows: and setting an edge detection algorithm based on different scenes where the original image is, acquiring the outline of the image main body, and generating a Mask image with only gray information as an outline map.
6. The method for generating the image based on multi-modal searching and contour guidance as set forth in claim 1, wherein,s6, setting a loss function L of an implicit diffusion model supporting external input conditions LDM The method comprises the following steps:
(2)
wherein ,representing a gaussian sampling process; />Representing a time sequence denoising self-coding process; />Representing an encoder for external conditions, +.>The representation maps the external condition y to an intermediate layer representation.
7. The image generating method based on multi-modal searching and contour guiding as set forth in claim 1, wherein the method of conditional contour guiding in the implicit diffusion model in S6 is:
s61: encoding the input text and image into features by an encoder, and the encoder aligning the text features and image feature representations;
s62: concat connection is carried out on the features coded in the S61, the features are combined into one feature, and random noise is added to the feature so that images generated each time are different; if only text features are available, no Concat connection is required;
s63: the features added with random noise in the step S63 are sent to a feature predictor UNet for noise reduction and diffusion and iteration, and features which are closer to a real image are generated in an implicit space; the noise reduction diffusion process of the diffusion model is as follows:
(3)
wherein ,representing a gaussian sampling process; />Indicating that it introduces an implicit encoder->,/>The method comprises the steps of carrying out a first treatment on the surface of the t= … T is a time-series denoising self-encoding process, which is based on the input +.>Predicting the corresponding denoised result, wherein +.>Is the result of inputting x after adding noise;
s64: features in the implicit space are converted into images in the pixel space by the variational self-encoder VAE and output.
8. The image generating method based on multi-modal retrieval and contour guidance as claimed in claim 7, wherein in S61, the method of encoding text and images into features by an encoder is:
(4)
wherein ,the introduction of an implicit encoder->,/>The representation will->Conversion to->Thereby allowing the character or image features to be characterized in implicit space; when an image is input as the diffusion model, the diffusion model will first encode the input image into the implicit feature space by the image encoder while encoding the input text into the same implicit feature space by the text encoder.
9. The image generating method based on multi-modal searching and contour guiding as claimed in claim 7, wherein in S63, said une is a front-back symmetrical structure, the first half contains 8 main encoding blocks, and the second half contains 8 main decoding blocks; each block contains a convolutional neural network ResNet and an image attention network ViT; the convolutional neural network ResNet is used for encoding and decoding of global features; the image attention network ViT includes cross-attention and self-attention mechanisms for the codec of local features.
10. The image generating method based on multi-modal searching and contour guiding as claimed in claim 7, wherein in S63, the method of noise reduction and diffusion by the feature predictor UNet is as follows: using a micro-network consisting of 4 cores and 2 x 2 step convolutional layersAs an encoder, the shape is 16×32×64×128, and the image space condition is +.>The encoding is feature mapping:
(5)
the feature map is converted; the network will convert 512 x 512 image conditions into 64 x 64 feature maps.
11. The method for generating an image based on multi-modal search and contour guidance as set forth in claim 7, wherein in S63, the degree of influence of text features is controlled by controlling the adjustment degree of text prompts to diffusion process by using numerical variable CFG embedded in the Unet in the iterative process.
CN202310919649.2A 2023-07-26 2023-07-26 Image generation method based on multi-mode retrieval and contour guidance Active CN116630482B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310919649.2A CN116630482B (en) 2023-07-26 2023-07-26 Image generation method based on multi-mode retrieval and contour guidance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310919649.2A CN116630482B (en) 2023-07-26 2023-07-26 Image generation method based on multi-mode retrieval and contour guidance

Publications (2)

Publication Number Publication Date
CN116630482A true CN116630482A (en) 2023-08-22
CN116630482B CN116630482B (en) 2023-11-03

Family

ID=87613903

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310919649.2A Active CN116630482B (en) 2023-07-26 2023-07-26 Image generation method based on multi-mode retrieval and contour guidance

Country Status (1)

Country Link
CN (1) CN116630482B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725247A (en) * 2024-02-07 2024-03-19 北京知呱呱科技有限公司 Diffusion image generation method and system based on retrieval and segmentation enhancement

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220005235A1 (en) * 2020-07-06 2022-01-06 Ping An Technology (Shenzhen) Co., Ltd. Method and device for text-based image generation
CN116012488A (en) * 2023-01-05 2023-04-25 网易(杭州)网络有限公司 Stylized image generation method, device, computer equipment and storage medium
KR20230059524A (en) * 2021-10-26 2023-05-03 삼성에스디에스 주식회사 Method and apparatus for analyzing multimodal data
CN116452706A (en) * 2023-04-23 2023-07-18 中国工商银行股份有限公司 Image generation method and device for presentation file
CN116452410A (en) * 2023-03-10 2023-07-18 浙江工业大学 Text-guided maskless image editing method based on deep learning

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20220005235A1 (en) * 2020-07-06 2022-01-06 Ping An Technology (Shenzhen) Co., Ltd. Method and device for text-based image generation
KR20230059524A (en) * 2021-10-26 2023-05-03 삼성에스디에스 주식회사 Method and apparatus for analyzing multimodal data
CN116012488A (en) * 2023-01-05 2023-04-25 网易(杭州)网络有限公司 Stylized image generation method, device, computer equipment and storage medium
CN116452410A (en) * 2023-03-10 2023-07-18 浙江工业大学 Text-guided maskless image editing method based on deep learning
CN116452706A (en) * 2023-04-23 2023-07-18 中国工商银行股份有限公司 Image generation method and device for presentation file

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
周作为;钱真真;: "利用自然语言文本描述进行图像编辑", 电子技术与软件工程, no. 01, pages 119 - 121 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117725247A (en) * 2024-02-07 2024-03-19 北京知呱呱科技有限公司 Diffusion image generation method and system based on retrieval and segmentation enhancement
CN117725247B (en) * 2024-02-07 2024-04-26 北京知呱呱科技有限公司 Diffusion image generation method and system based on retrieval and segmentation enhancement

Also Published As

Publication number Publication date
CN116630482B (en) 2023-11-03

Similar Documents

Publication Publication Date Title
CN107644006B (en) Automatic generation method of handwritten Chinese character library based on deep neural network
CN108875935B (en) Natural image target material visual characteristic mapping method based on generation countermeasure network
CN111861945B (en) Text-guided image restoration method and system
CN115393396B (en) Unmanned aerial vehicle target tracking method based on mask pre-training
CN116630482B (en) Image generation method based on multi-mode retrieval and contour guidance
CN116721221A (en) Multi-mode-based three-dimensional content generation method, device, equipment and storage medium
CN115880762B (en) Human-machine hybrid vision-oriented scalable face image coding method and system
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
US11823432B2 (en) Saliency prediction method and system for 360-degree image
CN116912257B (en) Concrete pavement crack identification method based on deep learning and storage medium
CN117635771A (en) Scene text editing method and device based on semi-supervised contrast learning
CN117218246A (en) Training method and device for image generation model, electronic equipment and storage medium
Lin Comparative Analysis of Pix2Pix and CycleGAN for image-to-image translation
CN116935292B (en) Short video scene classification method and system based on self-attention model
CN116523985B (en) Structure and texture feature guided double-encoder image restoration method
CN116863053A (en) Point cloud rendering enhancement method based on knowledge distillation
CN109035318B (en) Image style conversion method
CN116630369A (en) Unmanned aerial vehicle target tracking method based on space-time memory network
CN112487992B (en) Stream model-based face emotion image generation method and device
CN115115860A (en) Image feature point detection matching network based on deep learning
CN115147317A (en) Point cloud color quality enhancement method and system based on convolutional neural network
CN115019319A (en) Structured picture content identification method based on dynamic feature extraction
CN113223038A (en) Discrete cosine transform-based mask representation instance segmentation method
CN111476867A (en) Hand-drawn sketch generation method based on variational self-coding and generation countermeasure network
CN113674369B (en) Method for improving G-PCC compression by deep learning sampling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant