CN112365556A

CN112365556A - Image extension method based on perception loss and style loss

Info

Publication number: CN112365556A
Application number: CN202011244337.9A
Authority: CN
Inventors: 李孝杰; 任勇鹏; 吴锡; 任红萍
Original assignee: Chengdu University of Information Technology
Current assignee: Chengdu University of Information Technology
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-12
Anticipated expiration: 2040-11-10
Also published as: CN112365556B

Abstract

The invention relates to an image expansion method based on perception loss and lattice loss, which comprises the following steps: sending the preprocessed data set into a constructed image expansion network, wherein the image expansion network comprises a reconstruction path and a generation path, the reconstruction path is used for inputting an image of a region to be compensated, acquiring prior information of all components to be compensated of the image, and finally reconstructing an original image; the generation path is used for inputting the missing image, the priori distribution obtained by the reconstruction path is utilized to guide the generation process of the image, in the training process, the perception loss and the lattice loss constraint are emphatically introduced to generate the texture and style of the image, and the distortion fuzzy structure of the traditional method is improved. The perception loss and the style loss respectively acquire semantic information and an overall style of a known region, so that the network is facilitated to grasp the real texture style of the image.

Description

Image extension method based on perception loss and style loss

Technical Field

The invention relates to the field of image processing, in particular to an image expansion method for perception loss and style loss.

Background

In computer vision tasks, image completion based on deep learning is widely used, and particularly in recent years, the image completion task is a remarkable result. Image completion is actually a special task between image editing and image generation. There are several typical methods of processing image completion: the method is a simple texture patching method, only similar pixels are collected from the existing pixels to fill in the missing area, and because the sampling is directly carried out from the picture, the filling part is not naturally connected, and the effect is poor; secondly, a data-driven idea is adopted, and a data set is used for learning related data distribution, so that a logical structure can be generated, but the repair area is fuzzy; and the other generation model is based on deep learning, a neural network is used for extracting high-dimensional spatial features of the image, and finally a new structure is generated for the missing region.

Image expansion is a specific application of image completion, which infers an extension part from a segment of an image to complete the whole picture. The aim is to extend the known area of the image, i.e. to complement the picture beyond the image boundary. The main difficulties of the method are as follows: the unknown area is deduced by using less image adjacent information, and the extension part is required to achieve the aims of real visual effect and reasonable content structure.

Older approaches to image completion using deep learning have been Context coders (Context Encoders), which use a typical encoder-decoder network architecture. Context Encoders mainly fill up an image missing region through image information of adjacent pixels, and is an unsupervised learning method based on Context pixel prediction, wherein high-level features of an image are extracted through an encoder-decoder network structure, then a prediction graph corresponding to a region to be repaired of the image is obtained through decoding, the Context Encoders use simple reconstruction loss to restrict the mean square error of corresponding pixel points of the generated graph and an original graph, and the difference between the corresponding pixel points of the generated image and the original image is directly calculated, so that the reconstructed result lacks image high-frequency information, and the final output result has the problems of fuzzy distortion and even distortion deformation. However, the image complemented by the context encoder has the problems of structural distortion and texture blurring, because the context encoder depends too much on surrounding information, and thus the neighboring information cannot be fully utilized when the image is too much missing.

The scholars propose to adopt two network structures to carry out image expansion, one is a content generation network which is used for directly generating the missing content of the image; the other is a texture generation network, which takes the texture difference between the original image and the generated image into consideration, and uses a VGG network to extract texture feature maps of the original image and the generated image, so as to reduce the distance between the feature maps and improve the texture structure of the content generation network. This work accounts for texture differences based on the context encoder. The defects of the method are as follows: the memory requirement is large, only the internal block information of the picture is learned, the internal block information is excessively dependent on the information around the picture, and when the content of the picture is lost too much, the adjacent information cannot be fully utilized, so that the final output result is fuzzy and unnatural.

The existing image expansion algorithm uses a neural network to extract high-level feature information of an image, so that semantic information of the image is obtained, the semantic naturalness of a final output result is ensured, and certain limitation exists on clear texture. The image expansion task requires that the generated image not only achieves a reasonable target in semantics, but also ensures that the image texture is clear and natural.

More importantly, in the prior art, more researches on inward expansion of images are carried out, less researches on outward expansion of images are carried out, and the difficulty of outward expansion of images is far greater than that of inward expansion of images because the prior information which can be utilized by the inward expansion is more and the uncertainty of the outward expansion is larger.

Therefore, how to improve the performance of image expansion, especially for the situation of external expansion, to make the expanded image have natural texture and clear picture, is a problem that needs to be solved in the field of image expansion.

Disclosure of Invention

Aiming at the defects of the prior art, the image expansion method based on the perception loss and the style loss comprises the following steps:

step 1: preparing an image data set to be expanded, and dividing the image data set into a training data set and a testing data set according to an agreed proportion;

step 2: preprocessing the image data set to be expanded, and segmenting an original image in the data set into a middle image Im and images Ic at the rest periphery;

and step 3: training the constructed image extension network by adopting the preprocessed training data set, wherein the image extension network comprises a generator, two parallel discriminators and a pre-trained VGG19 network, the generator comprises an encoder, two parallel residual modules and a decoder which are operated in sequence, wherein,

the encoder is used for generating a hidden layer feature representation of the image according to the input missing image;

the parallel residual modules comprise a reconstructed residual module and a generated residual module, and are used for realizing the function of deducing a network, and the generated mean value and variance are used for a decoder to sample the characteristics of the hidden layer space;

the decoder is used for generating missing image content according to the input hidden layer feature map;

the parallel discriminators comprise a reconstruction discriminator and a generation discriminator and are used for scoring the reconstructed image and the original image and scoring the generated image and the original image to carry out countermeasure training;

the pre-trained VGG19 network is used for extracting feature maps of a generated image and an original image generated by the generator and calculating the perception loss and the style loss of the two feature maps so as to restrict the generator, so that the network can acquire high-level global information and bottom-level detail information of the image, grasp the overall style trend of the image and improve the expansion performance; in particular, the method comprises the following steps of,

the perception loss means that a range constraint is carried out on the generated image and the original image through the extracted feature maps of the generated image and the original image, the difference between the generated image and the original image is reduced, and the effect of improving the quality of an expansion area is finally achieved;

style loss, namely calculating the gram matrixes of the two characteristic images through the extracted characteristic images of the generated image and the original image, and constraining the gram matrixes of the two characteristic images so as to enable the general styles of the two characteristic images to be close to each other as much as possible, and finally achieving the effect of improving the quality of the generated image;

and 4, step 4: and after the training is finished, testing the trained extension model by using the test data set, removing a reconstruction residual module and a reconstruction discriminator in the extension model during testing, and only inputting the intermediate image Im into the extension model to finally obtain the predicted extension image output by the generator.

According to a preferred embodiment, the method for training an extended network specifically includes:

step 31: splicing the middle image Im and the peripheral image Ic according to channel dimensions, and then sending the spliced middle image Im and the peripheral image Ic into an encoder, wherein the encoder performs feature extraction on input to obtain a first feature diagram, and splitting the first feature diagram according to the channel dimensions to obtain a second feature diagram and a third feature diagram, wherein the second feature diagram and the third feature diagram are respectively feature diagrams of the middle image Im and the peripheral image Ic; the purpose of stitching is that the encoder can process the middle image Im and the peripheral image Ic simultaneously;

step 32: inputting the second feature map and the third feature map into a residual error generation module and a residual error reconstruction module respectively, and calculating the mean value and the variance of the second feature map and the third feature map respectively so as to obtain hidden layer features from normal distribution by sampling, and enabling a decoder to sample the hidden layer features back to the original size of the image;

step 33: splicing the hidden layer features according to channel dimensions, inputting the spliced hidden layer features into a decoder, splitting an image obtained by the decoder into a generated image Igen and a reconstructed image Irec according to the channel dimensions, respectively sending the generated image and the reconstructed image into corresponding discriminators, and simultaneously inputting an original image into two discriminators, wherein the generated discriminator is used for scoring the generated image and the original image, the reconstructed discriminator is used for scoring the reconstructed image and the original image, and countermeasure training is performed according to discrimination results output by the two discriminators;

step 34: inputting the generated image and the original image corresponding to the generated image into a pre-trained VGG19 network, extracting a feature map corresponding to the generated image, and restricting the distance between the feature maps of the generated image and the original image to achieve the purpose of improving the quality of the generated image, specifically, guiding the training of a generator by calculating perception loss and style loss;

step 35: taking 1 batch data as one iteration training to generate a confrontation network, one iteration updating parameters of a discriminator and a generator, and one iteration finishing alternate training of the discriminator and the generator;

step 36: and judging whether the total times of training iteration is reached, if so, finishing the training, otherwise, returning to the step 31.

According to a preferred embodiment, the training method in step 3 further includes calculating a difference of distribution determined by a mean value and a variance of the feature maps corresponding to the middle image Im and the peripheral image Ic by using KL loss, and using semantic information of the peripheral image Ic to make the distribution of the middle image Im and the distribution of the peripheral image Ic close to each other, so as to improve network expansion performance.

Compared with the prior art, the invention has the beneficial effects that:

1. the method has the advantages that the semantic features of the image are extracted by the aid of the feature extractor for perception loss, low-level pixel information and high-level abstract features of the image are grasped, texture styles of the image are further constrained, a real and reasonable structure is finally generated, consistency of semantic information of a known area and an expanded area is guaranteed, boundary blurring is visually eliminated, a natural and attractive visual effect is achieved, and the defect that high-frequency details are lacked due to the fact that reconstruction loss is only adopted in the prior art is overcome.

2. The style loss is used, the integral style of the image is obtained through the feature extractor and the gram matrix, the consistency of the texture styles of the known region and the extended region is ensured, the style of the extended region is real and natural, and the defect that in the prior art, only the information of the internal block of the image is learned, and the output result is fuzzy and unnatural is overcome.

3. Due to the coordination constraint of the perception loss and the style loss with other losses, the method is favorable for improving the expansion imagination of the whole network and improving the network expansion performance, so that the expansion result is further improved qualitatively.

Drawings

FIG. 1 is a flow chart of an image expansion method of the present invention;

FIG. 2 is a network architecture diagram of the image expansion model of the present invention; and

FIG. 3 is a graph comparing the results of the experiment according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the accompanying drawings in conjunction with the following detailed description. It should be understood that the description is intended to be exemplary only, and is not intended to limit the scope of the present invention. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present invention.

Aiming at the defects in the prior art, the invention provides an image expansion method based on perception loss and style loss, and fig. 1 is a flow chart of the image expansion method of the invention, and the expansion method of the invention is firstly elaborated in detail by combining fig. 1.

The invention provides an image expansion method based on perception loss and lattice loss, which specifically comprises the following steps:

the pre-processing comprises processing the raw images in the dataset each to 128 x 128 and normalizing all their pixel values to 0-1, processing the images in the dataset to a middle image Im leaving only the content of the middle region 64 x 64 and a surrounding image Ic leaving the surrounding content.

the pre-trained VGG19 network is used for extracting feature maps of generated images and original images generated by the generator, and introducing the perception loss and the lattice loss into an objective function of the generator, so that the network can acquire high-level global information and bottom-level detail information of the images, the overall style trend of the images is grasped, and the expansion performance is improved.

Fig. 2 is a schematic structural diagram of an image expansion model, and as shown in fig. 2, the image expansion model of the present invention includes two parallel paths, one of which is a reconstruction path, inputs an image of a region to be compensated, obtains prior information of all components to be compensated, and finally reconstructs an original image; the next one is a generation path, missing images are input, and the generation process of the images is guided by using the prior distribution obtained by the reconstruction path. The hidden layer information is extracted through the distribution of the residual module, the hidden layer distribution of the generated path is close to the upper reconstruction path, the network structure balances the variance of the original reconstruction data and the condition distribution, the prior information of the missing part is adopted to guide the generation process of the completion part, therefore, the image closer to the real image can be generated, and the diversity of the input result provides sufficient conditions for selecting the high-quality completion result.

The training method of the extended network specifically comprises the following steps:

specifically, the KL loss is used for calculating the difference of the distribution determined by the mean value and the variance of the corresponding feature images of the middle image Im and the peripheral image Ic, and the semantic information of the peripheral image Ic is used for enabling the distribution of the middle image Im to be close to the distribution of the peripheral image Ic, so that the network expansion performance is improved.

Step 33: the hidden layer features are spliced according to channel dimensions and then input into a decoder, an image obtained through the decoder is split into a generated image Igen and a reconstructed image Irec according to the channel dimensions, the generated image and the reconstructed image are respectively sent into corresponding discriminators, meanwhile, an original image is input into two discriminators, the generated discriminator is used for scoring the generated image and the original image, the reconstructed discriminator is used for scoring the reconstructed image and the original image, and countermeasure training is carried out according to discrimination results output by the two discriminators. The discriminator of the invention adopts a global discriminator to score the whole image, which is helpful for grasping the whole generation quality of the image.

Step 34: and inputting the generated image and the original image corresponding to the generated image into a pre-trained VGG19 network, extracting a feature map corresponding to the generated image, and restricting the distance between the feature maps of the generated image and the original image to achieve the purpose of improving the quality of the generated image.

According to the features extracted from the generated image and the original image in the pre-trained VGG19 network, the distance of the feature map is directly constrained by using the perception loss, and the whole content and style of the generated image are aligned to the original image by using the Gram matrix constraint of the style loss on the feature map, so that the content of the generated image is more natural, harmonious, clear and real.

The general reconstruction loss has obvious defects because the reconstruction result is directly compared with the original image pixel, the reconstruction result really has higher signal-to-noise ratio, but contains less high-frequency information, so that the situation of structural distortion and even picture blurring can occur.

Further, based on the problem of image extension, the invention provides that the training of the generator is guided by calculating the perception loss and the style loss, so that the generated image is more real and natural and is closer to the original image in structure.

And (3) perception loss, performing range constraint on the generated image and the original image through the extracted feature maps of the generated image and the original image, reducing the difference between the generated image and the original image, and finally achieving the effect of improving the quality of an expansion area, wherein the calculation formula is as follows:

wherein, I_gtAnd I_genRespectively an original picture and a generated picture phi_j() represents the j-th level feature graph extracted by the VGG19 network;

the perception loss is to compare the characteristics obtained by the convolution of the original picture in the VGG network with the characteristics obtained by the convolution of the generated picture, so that the high-level characteristic information (such as the content and the structure of the image) of the original picture and the generated picture are as close as possible, which means that the network perceives the image. When the method is used for training the GAN network, the use of the perception loss can enable the feature map of the generated image and the feature map of the original image to be well close, so that the image generation process is assisted, and the final generated image quality is improved.

And style loss, namely calculating the gram matrixes of the feature images of the generated image and the original image through the extracted feature images of the generated image and the original image, and constraining the gram matrixes of the generated image and the original image to further enable the general styles of the generated image and the original image to be as close as possible, and finally achieving the effect of improving the quality of the generated image, wherein the calculation formula is as follows:

wherein, I_gtAnd I_genRespectively an original picture and a generated picture phi_i(. cndot.) represents the i-th layer feature graph extracted by the VGG network, and G (-) is a gram matrix corresponding to the feature graph.

The gram matrix is used for calculating pairwise inner products of any k vectors in the n-dimensional Euclidean space. The gram matrix can be viewed as a covariance matrix without mean reduction between different profiles. In the convolutional network, a shallow network extracts low-level features of an image structure, a deep network extracts abstract high-level features of the image, and the low-level features and the high-level features are combined to be more like the overall style of the image to determine the real attribute of an image.

By calculating the gram matrix of the feature maps, the relationship between each feature vector can be estimated, and the correlation between every two feature vectors can be grasped, so that the general style of the feature maps can be embodied. In actual network training, style loss is added into total loss, and the style of the generated image can be continuously close to the style of the original image by taking the minimum Graham matrix of the characteristics of the optimized generated image and the original image as a target.

When the problem of image external expansion is processed, the difficulty of external expansion is far greater than that of image internal expansion because the prior information which can be utilized by the method is far less than that of the problem of image internal expansion. Therefore, when the generator is trained, besides the calculation of the perception loss and the style loss, the reconstruction loss, the KL loss and the countermeasure loss are added, the generator is further constrained by the prior information of the reconstructed image, and the calculation formula is as follows:

where the superscripts r and g represent the loss of the reconstruction path and the generation path, respectively, L_KLFor constraining the hidden layer distribution, L_appReconstruction loss, L, used to constrain the image_adIt is the competing loss of the network. Alpha is alpha_KL、α_app、α_adIs the weight coefficient of each term.

Step 35: taking 1 batch data as one iteration training to generate a confrontation network, one iteration updating parameters of the discriminator and the generator, and one iteration finishing alternate training of the discriminator and the generator.

Step 36: judging whether the total number of training iterations is reached, if so, finishing the training, otherwise, returning to the step 31;

and 4, step 4: and testing the trained extension model by using the test data set, removing a reconstruction residual module and a reconstruction discriminator in the extension model during testing, and only inputting the intermediate image Im into the extension model to finally obtain the prediction extension image output by the generator.

In order to more intuitively and effectively explain the effect of the image expansion method provided by the invention, the following verification is carried out in both objective and subjective aspects.

The invention adopts two indexes of IS (initiation score) and FID (Freehet initiation distance) to carry out quantitative evaluation, which are two indexes for evaluating the quality of a generated model commonly used and can be used for representing the diversity and the definition of a generated image. Other indicators include L1 loss, RMSE and SSIM.

The image datasets used for this experiment were Places2 and Paris street view, all pictures were uniform in size 128 x 128, and the test inputs were: the image was masked out all around using a mask, leaving a 64 × 64 middle area. The comparison method of the experiment is IOGnet and PICnet, and because the PICnet has a plurality of output results, the picture with the highest mark of the discriminator is taken for comparison.

Fig. 3 is a comparison graph of experimental results, where (a) is an original image, (b) is an image input to an extension network, (c) is an experimental result of a conventional extension method, (d) is an experimental result of the present invention which proposes to introduce a style loss, and (e) is an experimental result of the present invention which proposes to introduce a style loss and a perception loss.

It can be seen from fig. 3 that there are some distortions and even some residual shadows around the existing extension method, but the distortion condition can be obviously improved by the method of the present invention, the residual shadows are basically eliminated, and the overall clarity and naturalness of the extension area are improved, so that the overall effect is closer to that of the Ground route.

Table 1 shows the comparison between the objective evaluation indexes of the conventional PIC method and the method proposed by the present invention, and Table 1 shows the quantitative indexes of 20000 test set pictures of Places 2.

TABLE 1 Objective evaluation index comparison of Experimental results

As can be seen from Table 1, the IS index IS increased by 0.09, and the FID IS decreased by 2.42, which illustrates that the method of the present invention IS favorable for improving the diversity and the definition of the generated image, improving the texture naturalness of the extended area, and improving the image blur. The L1 Loss index is reduced by 0.65, which shows that the overall pixel difference of the generated image is closer to the original image; the improvement of SSIM and PSNR indicates a certain improvement in the quality of the generated image.

The invention provides the texture and style of the image generated by the constraint of the perception loss and the style loss, and improves the distortion fuzzy structure of the traditional method. The perception loss and the style loss respectively acquire semantic information and an overall style of a known region, so that the network is facilitated to grasp the real texture style of the image. The final quality of image extension is improved by using perception loss and style loss, training and testing are carried out on a common image extension data set, and compared with the final experimental test result, the trained model can generate an extended image which is closer to the original image pattern.

It should be noted that the above-mentioned embodiments are exemplary, and that those skilled in the art, having benefit of the present disclosure, may devise various arrangements that are within the scope of the present disclosure and that fall within the scope of the invention. It should be understood by those skilled in the art that the present specification and figures are illustrative only and are not limiting upon the claims. The scope of the invention is defined by the claims and their equivalents.

Claims

1. An image expansion method based on perceptual loss and style loss, the method comprising:

2. The image expansion method according to claim 1, wherein the training method of the expansion network specifically comprises:

3. The image expansion method according to claim 2, wherein the training method in step 3 further includes calculating a difference between distributions determined by means of the mean and variance of the feature maps corresponding to the middle image Im and the peripheral images Ic by using KL loss, and using semantic information of the peripheral images Ic to make the distribution of the middle image Im and the distribution of the peripheral images Ic close to each other, thereby improving network expansion performance.