CN115496843A

CN115496843A - Local realistic-writing cartoon style migration system and method based on GAN

Info

Publication number: CN115496843A
Application number: CN202110608065.4A
Authority: CN
Inventors: 黄国方; 周宁宁; 孙天鹏; 张静; 单超; 周兴俊; 刘晓铭; 郝永奇; 钟亮民; 廖志勇; 陈向志; 杨明鑫; 彭奕; 谢芬; 王文政; 谢永麟; 甘志坚; 张丛丛
Original assignee: NARI Group Corp; Nanjing University of Posts and Telecommunications; Nari Technology Co Ltd; State Grid Electric Power Research Institute
Current assignee: NARI Group Corp; Nanjing University of Posts and Telecommunications; Nari Technology Co Ltd; State Grid Electric Power Research Institute
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2022-12-20

Abstract

The invention discloses a local realistic cartoon style migration system and method based on GAN, the method adopts expresson GAN to generate a portrait global migration image, adopts SceneryGAN to generate a background global migration image, adopts a Deeplabv3+ model to generate a portrait mask image and a background mask image from an image which needs local style migration, and finally obtains the portrait local migration image and the background local migration image through fusion. The invention introduces the compressed excitation residual block to strengthen important characteristics, thereby greatly improving the pertinence of training and greatly improving the recovery of detailed textures which are easy to lose; with distributed offset convolution, lower memory usage and higher speed are achieved by storing integer values in a variable quantization kernel.

Description

Local realistic-writing cartoon style migration system and method based on GAN

Technical Field

The invention relates to the technical field of image processing, in particular to a local-writing-sense cartoon style migration system and method based on GAN.

Background

The cartoon is a very popular artistic expression form at present, and the artistic expression form is widely applied to various aspects of the society, including advertisement, games, movie and television works, photography and the like. At present, young people in this era are mostly influenced by the Japanese comics, and the Japanese comics do have great influence all over the world, but because the comics are drawn and generated by hands and then rendered by a computer, the time and labor are relatively large, and people without drawing bases can not finish the production. It is therefore desirable to be able to automatically convert a picture of the real world into a picture having a cartoon style by a computer. Meanwhile, the style of the portrait or the background can be artificially regulated and controlled to meet the requirements of different people on the pictures.

At present, image style migration based on deep learning has achieved relatively good effect, so deep learning becomes a common method for image-to-image conversion at present. The method applies a learned style to an input content image by style learning of the style image to generate a new image combining the content of the content image and the style of the style image. These methods mainly use the correlation between depth features and encode the visual style of images based on optimization methods.

In 2016, gatys et al proposed an image pattern migration method by adopting deep learning, which mainly makes a computer distinguish and learn artistic styles through a processing mode simulating human vision and combined with training of a multilayer convolutional neural network, so that the original image is rich in artistic sense. The method well achieves the purpose of style migration, but the migration effect is stiff, content distortion occurs, and the generation speed is slow.

A Radford and L Metz et al propose an unsupervised learning method based on convolution generation countermeasure network in 2015, and provide a new research method for image style migration of people. However, the network needs paired data sets, and the acquisition of the transferred corresponding images is very difficult, so that the model is not practical, and in order to solve the problem, a cyclic generation countermeasure network is subsequently proposed, which is an image translation architecture capable of adopting unpaired training data for training and solves the problem that many training data sets are not matched. But the cartoon pattern cannot be well captured by circularly generating the anti-network stylization, and the output image cannot sufficiently reserve the semantic content of the input image.

In 2018, yang Chen et al propose Cartonon GAN (cartoon generating confrontation network) on the basis of confrontation generating network, which adopts a novel network architecture, and the network structure of the Cartesian generating confrontation network can be trained by using unpaired data sets and can show the style characteristics of cartoons to the greatest extent. But images generated by cartonongan are caused by severely ambiguous patches in terms of human images.

In 2019, jie Chen et al improved it and proposed AnimeGAN (animation generation countermeasure network) which introduced grayscale images and changed the loss function adopted by the original cartoon generation countermeasure network to eliminate the problem of ambiguous color blocks of characters, but which led to loss of details of portrait and landscape parts during style migration, including loss of many important textural features of the face part, in order to ensure the reality of colors.

Disclosure of Invention

The invention aims to provide a local realistic-writing cartoon style migration system and method based on GAN, which respectively improve AnimeGAN and CartonoGAN by adopting expression GAN (an expression generation countermeasure network) and SceneryGAN (a background generation countermeasure network), and carry out edge optimization processing on a mask map generated by a model Deeplabv3+ to realize local realistic-writing cartoon style migration.

In order to achieve the purpose, the invention adopts the technical scheme that:

the invention provides a local realistic cartoon style migration system based on GAN, which comprises:

the method comprises the following steps that an expression generation countermeasure network, a background generation countermeasure network, a Deeplabv3+ network and an image fusion module are adopted;

the expression generation countermeasure network is used for generating a portrait global migration diagram based on the real character image;

the background generation countermeasure network is used for generating a background global migration map based on the real background image;

the Deeplabv3+ network is used for generating a portrait mask image and a background mask image from the image which needs to be subjected to local style migration;

the image fusion module is used for fusing the real person image, the portrait mask image and the portrait global migration image to obtain a portrait local migration image; and fusing the real background image, the background mask image and the background global migration image to obtain a background local migration image.

Furthermore, the expression generation countermeasure network introduces a compression excitation residual block and a cartoon face detection module on the basis of the animation generation countermeasure network;

the cartoon face detection module is used for screening the input real figure images and detecting images containing faces;

the compressed excitation residual block is used to enhance facial features.

Further, the background generation countermeasure network adopts distributed offset convolution instead of standard convolution on the basis of the cartoon generation countermeasure network.

The invention also provides a local-realistic-sense cartoon style migration method based on GAN, which comprises the following steps:

acquiring an original data set and dividing the original data set into a training set and a test set; the original data set comprises a real figure image, a gray-scale image of the real figure image, a real background image, a cartoon image and a strip-removed cartoon image;

adopting a training set to train the expression to generate a confrontation network and a background to generate the confrontation network;

inputting the test set image into a trained expression generation confrontation network to obtain a portrait global migration image, inputting the test set image into a trained background generation confrontation network to obtain a background global migration image, and inputting the real character image into a Deeplabv3+ network to generate a portrait mask image and a background mask image;

fusing the real figure image, the figure mask image and the figure global migration image to obtain a figure local migration image; and fusing the real background image, the background mask image and the background global migration image to obtain a background local migration image.

Further, the gray level image of the real person image is obtained by converting the real person image through a Gram matrix;

the strip-removed cartoon image is obtained by processing the cartoon image through Gaussian smoothing.

Further, the generating of the confrontation network by training the expressions with the training set includes:

inputting real figure images in a training set into an expression generation confrontation network, screening by a cartoon face detection module, and inputting detected face images into an expression generation confrontation network generator;

sequentially carrying out flat convolution with the sizes of three convolution kernels of 7 multiplied by 7, the number of the convolution kernels of 64 and the step length of 1, carrying out downward convolution with the sizes of the convolution kernels of 3 multiplied by 3, the number of the convolution kernels of 128 and the step length of 2 and downward convolution with the sizes of the convolution kernels of 3 multiplied by 3, the number of the convolution kernels of 256 and the step length of 1 in the expression generation countermeasure network generator;

after the downward convolution, the compressed excitation residual block is used for carrying out face characteristic enhancement operation;

after the facial features are enhanced, performing two upward convolutions with convolution kernel of 3 × 3, convolution number of 256, step length of 1/2 and convolution kernel of 3 × 3, convolution number of 64 and step length of 1, and performing standard convolution with convolution kernel of 7 × 7, convolution number of 3 and step length of 1 to obtain an output image;

generating an antagonistic network generator output image by using the gray scale image and the expression of the real character image, and inputting the cartoon image and the de-streaked cartoon image into an expression generation antagonistic network identifier; the discriminator is a trained VGG-19 network;

and training the expression generation countermeasure network generator through iterative learning until a termination condition is reached.

Further, the performing facial feature enhancement operation includes:

carrying out standard convolution after the image is subjected to downward convolution;

the output of the standard convolution is averaged and pooled, and the compression calculation is performed:

wherein, F _sq (. Represents a compression operation, Z) _c For the compression calculation, subscript c is the number of channels, u _c Representing the c two-dimensional matrix, wherein W is the width of the image and H is the height of the image;

then, excitation calculations were performed:

S _c ＝sigmod(W ₂ *Relu(W ₁ Z _c ))

wherein S is _c For exciting the calculation result, sigmod is Sigmoid function, relu is Relu activation function, W ₁ As a full link layer parameter, W ₂ C/r, wherein C is the number of channels and r is a scaling coefficient;

then, calculate:

by passing

Value to faceAnd performing enhanced regulation and control on the characteristics.

Further, generating the countermeasure network using the training set training context includes:

inputting the real background image in the training set into a background generation countermeasure network generator network, performing two downward convolutions after three flat convolutions, performing two upward convolutions after eight identical residual blocks, and finally obtaining an output image through the flat convolutions;

inputting a background generation confrontation network generator network output image, a cartoon image and a strip-removed cartoon image into a background generation confrontation network discriminator, wherein the discriminator is a trained VGG-19 network;

training the background generation confrontation network generator network through iterative learning until a termination condition is reached.

Further, the method also comprises the following steps:

the portrait mask map and the background mask map generated by the Deeplabv3+ network are convolved by 5 × 5 convolution so that the edges thereof are blurred.

Further, in the above-mentioned case,

respectively selecting the portrait and the background of the portrait masking image as (0, 1), and then fusing to obtain a portrait local migration image;

and (4) negating the portrait and the background of the background mask image, and then fusing to obtain a background local migration image.

The invention achieves the following beneficial effects:

according to the method, the SE-Residual-Block and the cartoon face detection module are based on the animeGAN, the SE-Residual-Block can be used for avoiding feature information loss caused by maximized pool violence screening, modeling is carried out on the SE-Residual-Block through correlation among feature channels, important features are strengthened, training pertinence is improved to a great extent, and recovery of detailed textures easy to lose is greatly improved.

The method of the present invention replaces standard convolution with distributed offset convolution based on CartononGAN, achieves lower memory usage and higher speed by storing integer values in a variable quantization kernel, while maintaining the same output as the original convolution by applying kernel-based and channel-based distributed offsets.

Drawings

FIG. 1 is a diagram of a partially realistic caricature model of the present invention.

Fig. 2 is a diagram of an expressngan network architecture in the present invention.

FIG. 3 is a diagram showing the structure of SE-Residual-Block in the present invention.

FIG. 4 is a flow chart of the SE-Residual-Block operation in the present invention.

FIG. 5 is a network diagram of the SceneryGAN generator of the present invention.

Detailed Description

The invention is further described below. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.

As shown in fig. 1, the present invention proposes a new local-sketch cartoon model, which is composed of an expression generation countermeasure network improved based on AnimeGAN (animation generation countermeasure network), a scenergan (background generation countermeasure network) improved based on CartoonGAN (cartoon generation countermeasure network), and a deepabv 3+ network model.

The expression GAN is a module for introducing SE-Residual-Block (compressed excited Residual Block) and cartoon face detection on the basis of animeGAN. SE-Residual-Block (compressed excited Residual Block) is adopted to replace invoked-Residual-Block (Inverted Residual Block) in original animeGAN, so that feature information loss caused by maximized pool violence screening can be avoided, modeling is carried out on SE-Residual-Block through correlation among feature channels, important features are strengthened, training pertinence is improved to a great extent, and recovery of detail textures easy to lose is greatly improved.

The scenerrygan is obtained by replacing original Conv (standard convolution) with DSConv (distributed offset convolution) on the basis of CartonGAN. DSConv achieves lower memory usage and higher speed by storing integer values in a Variable Quantization Kernel (VQK) while maintaining the same output as the original convolution by applying kernel-based and channel-based distribution offsets.

Deeplabv3+ is used for generating a mask map, and the edge optimization is carried out on the mask map generated by the Deeplabv3+ by adopting the volume blocks.

As an embodiment of the present invention, a method for migrating a cartoon style with a local realistic sensation based on GAN is provided, which includes the following specific steps:

step 1: downloading 5890 real pictures containing the characters from a Flicker website under the flag of Yahoo for expressongGAN training; 6153 real pictures with the size of 256 multiplied by 256 are downloaded from a website and are used in scenergaN, wherein 5402 pictures are used as a training set, and the rest pictures are used as a testing set; finally, image interception is carried out on the Miyagashi Jun movie in a keyword interception mode, 4500 cartoon images are intercepted, and the cartoon images serve as a data set shared by the expresson GAN and the SceneryGAN.

Step 2: and processing the cartoon image by adopting Gaussian smoothing to obtain a strip-removed cartoon image, and converting the real image into a gray image by adopting a Gram matrix.

And step 3: and inputting the obtained cartoon image into an OpenCV _ training cascade classifier for training.

And 4, step 4: 5890 real person images, a grayscale map of 5890 real person images, 4500 caricature images, and 4500 striped caricature images were input to an expressongan to train the network.

And 5: 5402 real background images, 4500 caricature images and 4500 stripped caricature images were input into a scenergygan to train the network.

Step 6: and (3) inputting the test set images in the step 1 into the trained expressgan and scenerggan to generate a global cartoon style migration diagram.

And 7: and inputting the real character picture into a Deeplabv3+ to obtain a human image mask picture and a background mask picture.

And step 8: and performing edge optimization processing on the image mask image and the background mask image by means of convolution.

And step 9: and fusing the global cartoon style migration diagram generated by the expression GAN, 5402 real images and the portrait mask diagram after edge optimization to obtain a portrait local migration diagram. And fusing the global cartoon style migration map generated by the scenerrygan, the real image and the background mask map after edge optimization to obtain a background local migration map.

As another embodiment of the present invention, a method for migrating a cartoon style with local realistic sensation based on GAN includes the following specific steps:

(1) 5890 real person images, a grayscale map of 5890 real person images, 4500 caricature images, and 4500 line-cut caricature images were input into the network, with 5890 real person images being input into the expressngan generator network, and the grayscale map of 5890 real person images, 4500 caricature images, and 4500 line-cut caricature images being input into the expressngan discriminator network. The generator and discriminator structure is shown in fig. 2.

(2) 5890 images of real persons are input into an expression GAN generator network, firstly, pictures in a data set are screened by a cartoon face detection module, and the images of detected faces are input into the expression GAN generator network.

(3) The picture entering the expressgan generator network is first subjected to a flat convolution with three convolution kernels of size 7 × 7, the number of convolution kernels being 64, and the step size being 1. Then, the downward convolution with a convolution kernel size of 3 × 3, a convolution kernel number of 128, and a step size of 2 and the downward convolution with a convolution kernel size of 3 × 3, a convolution kernel number of 256, and a step size of 1 are performed.

(4) The specific structure of the picture after being convolved downwards entering a compression excitation Residual Block SE-Residual-Block, which is composed of a standard convolution (Conv-Block), an averaging pool (Global Pooling), a full connection layer (FC), a Sigmoid function and an example normalization layer (Inst-Norm) is shown in 3. The SE-Residual-Block workflow is shown in fig. 4.

The output of the standard convolution is averaged and pooled, and the average method is used in the compression section to average the information at all points in space to one value, because the final scale is applied to the entire channel, which requires the channel ensemble information to calculate the scale. The calculation formula for compression is:

wherein subscript c is the number of channels u _c Denotes the c-th two-dimensional matrix, W being the width of the image and H the height of the image.

Secondly, in the excitation part, the result Z is obtained through the compression process and passes through the full-connection layer W ₁ Multiplied by Z, 16 is taken in the present embodiment. W is to be ₁ Multiplying XZ by W after dimension is unchanged after Relu activation function ₂ ，W ₂ And C/r, finally outputting through a Sigmoid function.

The Excitation calculation formula is:

S＝sigmod(W ₂ *Relu(W ₁ Z)) (2)

where C is the number of channels, r is the scaling factor, Z is the output of the Squeeze portion, W ₁ ，W ₂ Is C/r.

After obtaining S, substituting the input U and S into:

by passing

The value is used for regulating and controlling the characteristics, and the facial characteristics are enhanced to the maximum extent, so that the effect of restoring facial textures is achieved.

(5) And (3) after the image passes through SE-Residual-Block, performing two upward convolutions with convolution kernel of 3 multiplied by 3, convolution number of 256, step length of 1/2, convolution kernel of 3 multiplied by 3, convolution number of 64 and step length of 1. Finally, outputting the image through standard convolution with convolution kernel of 7 multiplied by 7 and convolution number of 3 and step length of 1.

(6) Inputting a gray scale image of a real image, an expressgan generator output image, a cartoon image and a de-streaked cartoon image into an expressgan discriminator, screening out an image containing a cartoon face by a cartoon face detection module, and inputting the screened image into the discriminator, wherein the discriminator is a trained VGG-19 network.

In the embodiment of the invention, the expressing GAN modifies the animeGAN loss function. Animagegan converts the original cartoon image into a grayscale image by using a grayscale matrix, which can eliminate the dark interference while preserving the image texture. AnimeGAN solves the problem of color blocks but cannot improve the problem of dark color of pictures. The expressongan is modified in its color reconstruction loss.

(8) 5402 real background images are input into a scenergAN generator network, the images enter two downward convolutions after passing through three flat convolutions, then pass through eight identical residual blocks, pass through the residual blocks, then pass through two upward convolutions, and finally are output through the flat convolutions to generate the images. Scenerrygan replaces the original standard volume block (Conv) with a distributed offset convolution (DSConv) as shown in fig. 5, using the same number of convolution kernels and convolution kernel size as cartonongan.

(9) The images generated by the scenerggan generator network, 4500 caricature images and 4500 striped caricature images were input to the scenergan discriminator.

(11) And inputting the image needing to be subjected to local style migration into the trained Deeplabv3+ network to generate a portrait mask image and a background mask image of the image.

Convolving the generated portrait and background mask images by a 5 x 5 convolution blurs their edges, making the edges more natural when fitted.

(12) 5890 real person images, a person image mask image and a person image global cartoon style migration image are input into an image fusion algorithm, the person image and the background part of the person image mask image are respectively selected to be (0, 1), and then the images are fused to obtain a person image local migration image. 5402 real background images, background mask images and background global migration images are input into an image fusion algorithm, and the portrait and background parts of the mask images are inverted, and finally the background local migration images are obtained through fusion.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: although the present invention has been described in detail with reference to the above embodiments, it should be understood by those skilled in the art that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A local realistic cartoon style migration system based on GAN is characterized by comprising:

the expression generation countermeasure network is used for generating a portrait global migration map based on the real character image;

the image fusion module is used for fusing the real character image, the portrait mask image and the portrait global migration image to obtain a portrait local migration image; and fusing the real background image, the background mask image and the background global migration image to obtain a background local migration image.

2. The GAN-based local realistic cartoon style migration system according to claim 1, wherein the expression generation confrontation network introduces a compression excitation residual block and a cartoon face detection module on the basis of an animation generation confrontation network;

the cartoon face detection module is used for screening the input real figure image and detecting an image containing a face;

the compressed excitation residual block is used to enhance facial features.

3. The GAN-based localized calligraphic cartoon style migration system of claim 1, wherein the background generation countermeasure network uses a distributed offset convolution instead of a standard convolution based on the cartoon generation countermeasure network.

4. A local-realistic cartoon style migration method based on GAN is characterized by comprising the following steps:

adopting a training set to train the expression to generate an confrontation network and a background to generate the confrontation network;

5. The method for migrating the local realistic cartoon style based on the GAN of claim 4, wherein the gray scale image of the real person image is obtained by converting the real person image through a Gram matrix;

6. The GAN-based local realistic cartoon style migration method according to claim 4, wherein the training of the expression by the training set to generate the confrontation network comprises:

inputting real figure images in a training set into an expression generation countermeasure network, screening the images by a cartoon face detection module, and inputting the images of detected faces into an expression generation countermeasure network generator;

sequentially carrying out flat convolution with the sizes of three convolution kernels of 7 multiplied by 7, the number of the convolution kernels of 64 and the step length of 1, carrying out downward convolution with the sizes of the convolution kernels of 3 multiplied by 3, the number of the convolution kernels of 128 and the step length of 2 and carrying out downward convolution with the sizes of the convolution kernels of 3 multiplied by 3, the number of the convolution kernels of 256 and the step length of 1 in the expression generation countermeasure network generator;

after the face features are enhanced, performing two upward convolutions with convolution kernel of 3 × 3, convolution number of 256, step size of 1/2 and convolution kernel of 3 × 3, convolution number of 64 and step size of 1, and performing standard convolution with convolution kernel of 7 × 7, convolution number of 3 and step size of 1 to obtain an output image;

7. The method according to claim 6, wherein the performing facial feature enhancement operation comprises:

the output of the standard convolution is averaged and pooled, and a compression calculation is performed:

wherein, F _sq (. Denotes a compression operation, Z) _c For the compression calculation, subscript c is the number of channels, u _c Representing the c-th two-dimensional matrix, W being the imageH is the height of the image;

then, excitation calculations were performed:

S _c ＝sigmod(W ₂ *Relu(W ₁ Z _c ))

then, calculate:

by passing

Values to provide enhanced control of facial features.

8. The method for transferring the local cartoon style based on the GAN of claim 4, wherein the step of generating the confrontation network by using the training set to train the background comprises the following steps:

inputting a background generation confrontation network generator network output image, a cartoon image and a delineated cartoon image into a background generation confrontation network discriminator, wherein the discriminator is a trained VGG-19 network;

9. The GAN-based localized calligraphic cartoon style migration method according to claim 4, further comprising:

10. The method according to claim 4, wherein the local calligraphic cartoon style migration based on GAN is characterized in that,

respectively selecting the portrait and the background of the portrait mask image as (0, 1), and then fusing to obtain a portrait local migration image;

and negating the portrait and the background of the background mask image, and then fusing to obtain a background local migration image.