CN114399668A

CN114399668A - Natural image generation method and device based on hand-drawn sketch and image sample constraint

Info

Publication number: CN114399668A
Application number: CN202111617371.0A
Authority: CN
Inventors: 高成英; 袁梦丽; 许琦
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-12-27
Filing date: 2021-12-27
Publication date: 2022-04-26

Abstract

The invention discloses a natural image generation method and a device based on hand-drawn sketch and image sample constraint, wherein the method comprises the following steps: firstly, acquiring an original natural image and category information, and constructing a training data set; then, a natural image generation model is constructed, wherein the natural image generation model comprises a generator and a multi-task discriminator, the generator is used for generating natural images in the training process, and the multi-task discriminator is used for judging whether the generated natural images are real or not in the training process and judging the category of the generated natural images; then training the natural image generation model through the training data set, and adjusting parameters of the natural image generation model to obtain a trained target model; and finally, inputting the target image sample and the target hand-drawn sketch into the target model to generate a natural image based on the target hand-drawn sketch and the image sample constraint. The invention improves the convenience and the controllability, and can be widely applied to the technical field of image processing.

Description

Natural image generation method and device based on hand-drawn sketch and image sample constraint

Technical Field

The invention relates to the technical field of image processing, in particular to a natural image generation method and device based on hand-drawn sketches and image sample constraints.

Background

With the rapid development of conditional countermeasure generation networks, a variety of conditional countermeasure generation networks based on different constraints are emerging continuously, such as networks using edge maps, natural images, hand-drawn sketches, semantic segmentation maps, or the like as constraints. But these networks can only control the generation of image content such as pose, shape, etc. The controllability of the generated image is not high enough.

The recent emergence of using key points, edge maps and only natural images to control the generation of image content, but using key points does not express user intent well because they are too abstract; the user can not find a proper input to express the intention of the user by using the edge map or the natural image, and the method is not convenient enough.

Disclosure of Invention

In view of this, embodiments of the present invention provide a natural image generation method and apparatus based on sketching and image sample constraints, which are highly convenient and controllable.

The invention provides a natural image generation method based on hand-drawn sketch and image sample constraint, which comprises the following steps:

acquiring an original natural image and category information, and constructing a training data set; wherein the training data set comprises content images and image samples;

constructing a natural image generation model, wherein the natural image generation model comprises a generator and a multi-task discriminator, the generator is used for generating natural images in the training process, and the multi-task discriminator is used for judging whether the generated natural images are real or not in the training process and judging the category of the generated natural images;

training the natural image generation model through the training data set, and adjusting parameters of the natural image generation model to obtain a trained target model;

and inputting the target image sample and the target hand-drawn sketch into the target model, and generating a natural image based on the target hand-drawn sketch and the image sample constraint.

Optionally, the acquiring the original natural image and the category information to construct a training data set includes:

acquiring an edge map of the original natural image through an edge map extraction algorithm;

the edge graph and the corresponding natural image are paired to form a training data pair in a data set, and the edge graph and the random natural image are also paired to form a training data pair in the data set;

and constructing the training data set according to the two training data pairs, wherein the edge graph is a content image, and the natural image is an image sample.

Optionally, the constructing a natural image generation model includes:

constructing a content encoder, wherein the content encoder comprises five convolution modules and two residual modules, and is used for extracting content characteristics in input data;

constructing a style encoder, wherein the style encoder is used for extracting style characteristics of an image sample in input data;

and constructing a content decoder, wherein the content decoder is used for acquiring affine transformation parameters according to the style characteristics and generating pictures according to the content characteristics and the affine transformation parameters.

Optionally, the content feature extraction formula is:

Z_content＝E_content(X_edge _or _sketch)

wherein Z is_contentIs a content feature; e_contentBeing a content encoder, X_edge _or _sketchIs a sketch or hand drawing in the content image.

Optionally, the method further includes a step of performing style migration using adaptive instance normalization, where the step specifically includes:

processing the style characteristics through three full connection layers to obtain affine transformation parameters;

inputting the style characteristics into an AdaIN Resblock module of the content decoder according to the affine transformation parameters, and performing style migration by using self-adaptive example normalization through the AdaIN Resblock module;

wherein the calculation expression of the adaptive instance normalization is as follows:

wherein AdaIN (z)_content,z_reference) Represents the result of the adaptive instance normalization, z_contentRepresenting a content feature, z_referenceDenotes the style characteristics, μ denotes the mean, and σ denotes the variance.

Optionally, in the step of training the natural image generation model through the training data set, and adjusting parameters of the natural image generation model to obtain a trained target model,

the reconstruction loss function of the training process is:

L_rec(G)＝||G(X_edge,Y_image)-Y_image||₁

wherein L is_rec(G) Represents a reconstruction loss function; x_edgeRepresenting edge maps in the training data pairs; y is_imageRepresenting natural images in the training data pair; g () represents the generator model, and during training, the generator inputs an edge graph and a natural image to generate a target picture;

the multi-tasking discriminant loss function is:

L_GAN(D,G)＝E_X[-logD^c-1(Y_c-1)]+E_X,Y[log(1-logD^c-1(G(X_edge,Y_c-1)))]

wherein L is_GAN(D, G) represents a multitask discrimination loss, G represents a generator network, and D represents a discriminator network; e_XRepresenting the true data distribution, logD^c-1(Y_c-1) Representing the output of the discriminator when the real sample is input; e_X,YRepresenting the generated sample distribution, logD^c-1(G(X_edge,Y_c-1) Output of the discriminator when the representative input generates a sample; g (X)_edge,Y_c-1) Representing samples generated by a generator network; the subscript c-1 indicates the category.

The knowledge distillation loss function is:

wherein L is_Distill(G_S) Represents a loss of knowledge distillation; n represents the number of layers of the middle layer of the selected generator network;

an output representing the activation value of the ith layer of the network middle layer of the teacher generator;

an output representing a layer i activation value of a middle layer of the student generator network.

Another aspect of the embodiments of the present invention provides a natural image generation apparatus based on hand-drawn sketches and image sample constraints, including:

the system comprises a first module, a second module and a third module, wherein the first module is used for acquiring an original natural image and category information and constructing a training data set; wherein the training data set comprises content images and image samples;

the system comprises a first module, a second module and a third module, wherein the first module is used for establishing a natural image generation model, the natural image generation model comprises a generator and a multi-task discriminator, the generator is used for generating a natural image in a training process, and the multi-task discriminator is used for judging whether the generated natural image is real or not in the training process and judging the category of the generated natural image;

the third module is used for training the natural image generation model through the training data set, and adjusting parameters of the natural image generation model to obtain a trained target model;

and the fourth module is used for inputting the target image sample and the target hand-drawn sketch into the target model and generating a natural image based on the target hand-drawn sketch and the image sample constraint.

Another aspect of the embodiments of the present invention provides an electronic device, including a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

Another aspect of the embodiments of the present invention provides a computer-readable storage medium storing a program, the program being executed by a processor to implement the method as described above.

Another aspect of embodiments of the present invention provides a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

The method comprises the steps of firstly, obtaining original natural images and category information, and constructing a training data set; then, a natural image generation model is constructed, wherein the natural image generation model comprises a generator and a multi-task discriminator, the generator is used for generating natural images in the training process, and the multi-task discriminator is used for judging whether the generated natural images are real or not in the training process and judging the category of the generated natural images; then training the natural image generation model through the training data set, and adjusting parameters of the natural image generation model to obtain a trained target model; and finally, inputting the target image sample and the target hand-drawn sketch into the target model to generate a natural image based on the target hand-drawn sketch and the image sample constraint. The invention improves the convenience and the controllability.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart illustrating the overall steps provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of edge map extraction according to natural images according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating two training data pairs according to an embodiment of the present invention;

FIG. 4 is a diagram of a natural image generation model according to an embodiment of the present invention;

fig. 5 is a diagram of a natural image generation result based on a sketching and an image sample constraint according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Aiming at the problems in the prior art, in order to combine the advantages of using the hand-drawn sketch and the image sample as input, the hand-drawn sketch can be used for providing convenience for users and controlling the generated image content, and the image sample can be used for controlling the styles of generating image textures and the like. The invention provides a fine-grained generation framework based on a hand-drawn sketch and using an image sample as an additional constraint. In the framework, aiming at the problem that the mapping relation between the hand-drawn sketch and the natural image is difficult to construct, a content encoder capable of extracting general semantic features from the cross-domain image is designed, namely, correct semantic features can be extracted no matter a side graph or the hand-drawn sketch is used as the input of the content encoder; and the knowledge distillation loss is introduced into the image generation process by combining the idea of knowledge distillation, so that the image generation quality is improved without changing the network architecture.

Specifically, an embodiment of the present invention provides a natural image generation method based on hand-drawn sketches and image sample constraints, including:

Optionally, the constructing a natural image generation model includes:

Optionally, the content feature extraction formula is:

Z_content＝E_content(X_edge _or _sketch)

the reconstruction loss function of the training process is:

L_rec(G)＝||G(X_edge,Y_image)-Y_image||₁

the multi-tasking discriminant loss function is:

L_GAN(D,G)＝E_X[-logD^c-1(Y_c-1)]+E_X,Y[log(1-logD^c-1(G(X_edge,Y_c-1)))]

The knowledge distillation loss function is:

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

The following detailed description of the specific implementation principles of the present invention is made with reference to the accompanying drawings:

as shown in FIG. 1, the invention discloses a natural image generation method based on hand-drawn sketch and image sample constraint, which includes the steps of firstly collecting natural images and hand-drawn sketch data to make a data set, then constructing a natural image generation model, training the natural image generation model by using a loss function, finally using the hand-drawn sketch and the image sample as input by a user, generating natural images which refer to the hand-drawn sketch in posture and refer to the image sample in texture by a generator in the natural image generation model, and obtaining a final result. Specifically, the method mainly comprises the following steps:

step 1: collecting a natural image and a hand-drawn sketch making data set, wherein the edge map and the sketch in the data set are content images, the natural image is an image sample, and the method comprises the following steps 1-1 and 1-2:

step 1-1: the method comprises the steps of collecting natural images and corresponding category information of the natural images, firstly obtaining edge maps of the natural images by using an edge map extraction algorithm, and obtaining the result of extracting the edge maps according to the natural images as shown in fig. 2. And (3) taking the edge map, the corresponding natural image, the edge map and the random natural image pair as training data pairs, and constructing a training data set by using the two training data pairs as shown in fig. 3.

Step 1-2: collecting the hand-drawn sketches, randomly distributing 10 natural images for each sketches as image samples, and constructing a test data set.

Step 2: the method comprises the following steps of constructing a natural image generation model, wherein the model comprises a generator and a multi-task discriminator, the generator consists of a content encoder which is responsible for extracting input content characteristics, a style encoder which is responsible for extracting input image sample characteristics and a decoder, and the multi-task discriminator consists of a plurality of two-classification discriminators and comprises the following steps of 2-1 to 2-5:

and 2-1, constructing a content encoder, wherein the content encoder is responsible for extracting the content characteristics of the input content image and is designed into a seven-layer convolution network comprising five convolution modules Conv-64, Conv-128, Conv-256, Conv-512 and two residual modules Resblock-512 and Resblock-512. The content feature process formula for extracting the content image is expressed as follows:

Z_content＝E_content(X_edge _or _sketch)

wherein Z is_contentAs a content feature, E_contentBeing a content encoder, X_edge _or _sketchIs a content image edge map or sketch; conv-in the network structure represents a convolution block, Resblock-represents a residual block, and the number represents the number of output characteristic channels.

And 2-2, constructing a style encoder, wherein the style encoder is responsible for extracting style characteristics of the input image sample and is designed into seven-layer networks including Conv-64, Conv-128, Conv-256, Conv-512, Conv-1024, AvgPooling and Conv-8. The style characteristic process formula for extracting the image sample is expressed as follows:

Z_reference＝E_reference(Y_c-1)

wherein Z is_referenceAs a style feature, E_referenceFor a genre encoder, Y_c-1For the input image sample, subscript c-1 indicates the class of the input image sample; conv-in the network structure represents the volume block, AvgPooling represents the average pooling layer, and numbers represent the number of output feature channels.

And 2-3, constructing a content decoder, wherein the content decoder uses the content characteristics as input, and performs style migration on the style characteristics by using adaptive instance normalization (AdaIN) to obtain a final generated picture. The content decoder is designed as two AdaIN Resblock-512 modules and five convolution modules Conv-512, Conv-256, Conv-128, Conv-64, Conv-3. The content decoder uses the content features and the style features as parameters to obtain a final generated picture process formula as follows:

wherein Z is_contentAs a characteristic of the content, Z_referenceFor style characteristics, Decoder is a content Decoder,

for the final generated picture, subscript c-1 indicates the category of the generated picture; conv-in the network structure represents the convolution block, AdaIN Resblock represents the AdaIN residual block, and the number represents the number of output characteristic channels.

The specific steps of the content decoder for style migration using adaptive instance normalization (AdaIN) are as follows:

step 2-4: obtaining affine transformation parameters required by the ith AdaIN reblock module after the style characteristics pass through the first three full-connection layers

The specific calculation is as follows

Wherein, W^TAnd b is the offset of the full connection layer, and the full connection layer converts the output into a vector form to realize feature transformation.

Step 2-5: using affine transformation parameters

And (3) injecting the style characteristics into an AdaIN Resblock module of a decoder model, wherein the injection method is specifically calculated as follows

Wherein σ_i(z_reference) And u_i(z_reference) Respectively representing predicted affine transformation parameters according to style characteristics

x is a content characteristic Z_content，z_referenceFor the style characteristics, μ represents the mean and σ represents the variance.

And step 3: training data is adopted to generate a model for training a natural image, and parameters of the model for generating the natural image are adjusted by using a loss function in each training round, wherein the specific loss function is as follows:

reconstruction loss function:

L_rec(G)＝||G(X_edge,Y_image)-Y_image||₁

wherein, X_edgeAs content image, Y_imageFor the image example, G is the generator network.

Multitask discriminant loss function:

L_GAN(D,G)＝E_X[-logD^c-1(Y_c-1)]+E_X,Y[log(1-logD^c-1(G(X_edge,Y_c-1)))]

wherein G is a generator network and D is a discriminator network; e_XFor true data distribution, logD^c-1(Y_c-1) The output of the discriminator when the real sample is input; e_X,YTo generate the sample distribution, logD^c-1(G(X_edge,Y_c-1) When a sample is generated for input, the output of the discriminator; g (X)_edge,Y_c-1) Samples generated for a generator network; the subscript c-1 is a category.

Knowledge distillation loss function:

wherein the content of the first and second substances,

and

and respectively outputting the activation values of the ith layers of the intermediate layers of the teacher generator network and the student generator network, wherein the teacher generator network and the student generator network are consistent with the generator network structure, N represents the total selected number of the intermediate layers, and N is 6.

And 4, inputting the sketch in the test data set and any natural image data into a trained natural image generation model to realize natural image generation based on the freehand sketch and image sample constraint, wherein the generation result is shown in FIG. 5.

In summary, the invention combines the advantages of using the hand-drawn sketch and the image sample as input, which can not only use the hand-drawn sketch to provide convenience for users and control the generation of image content, but also use the image sample to control the generation of styles such as image texture.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more thorough understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise stated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in a separate physical device or software module. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer, given the nature, function, and internal relationship of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The natural image generation method based on the hand-drawn sketch and the image sample constraint is characterized by comprising the following steps of:

2. The method for generating a natural image based on a hand-drawn sketch and image example constraint according to claim 1, wherein the obtaining of the original natural image and the category information and the construction of the training data set comprise:

3. The method for generating natural images based on sketching and image sample constraints as claimed in claim 1, wherein the constructing a natural image generation model comprises:

4. The method of generating natural images based on sketching and image sample constraints as recited in claim 3,

the extraction formula of the content features is as follows:

Z_content＝E_content(X_{edge or sketch})

wherein Z is_contentIs a content feature; e_contentBeing a content encoder, X_{edge or sketch}Is a sketch or hand drawing in the content image.

5. The method for generating natural images based on sketching and image sample constraints as claimed in claim 3, wherein said method further comprises a step of performing style migration using adaptive instance normalization, said step comprising:

wherein AdaIN (z)_content，z_reference) Represents the result of the adaptive instance normalization, z_contentRepresenting a content feature, z_referenceDenotes the style characteristics, μ denotes the mean, and σ denotes the variance.

6. The method according to claim 1, wherein in the step of training the natural image model by the training data set, adjusting parameters of the natural image model to obtain a trained target model,

the reconstruction loss function of the training process is:

L_rec(G)＝||G(X_edge，Y_image)-Y_image||₁

the multi-tasking discriminant loss function is:

L_GAN(D，G)＝E_X[-logD^c-1(Y_c-1)]+E_X，Y[log(1-logD^c-1(G(X_edge，Y_c-1)))]

wherein L is_GAN(D, G) represents a multitask discrimination loss, G represents a generator network, and D represents a discriminator network; e_XRepresenting the true data distribution, logD^c-1(Y_c-1) Representing the output of the discriminator when the real sample is input; e_X，YRepresenting the generated sample distribution, logD^c-1(G(X_edge，Y_c-1) Output of the discriminator when the representative input generates a sample; g (X)_edge，Y_c-1) Representing samples generated by a generator network; the subscript c-1 indicates the category.

The knowledge distillation loss function is:

output representing activation value of layer I of middle layer of student generator network。

7. A natural image generating apparatus based on hand-drawn sketches and image sample constraints, comprising:

8. An electronic device comprising a processor and a memory;

the memory is used for storing programs;

the processor executing the program realizes the method of any one of claims 1 to 6.

9. A computer-readable storage medium, characterized in that the storage medium stores a program, which is executed by a processor to implement the method according to any one of claims 1 to 6.

10. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1 to 6 when executed by a processor.