CN111242841B

CN111242841B - Image background style migration method based on semantic segmentation and deep learning

Info

Publication number: CN111242841B
Application number: CN202010043890.XA
Authority: CN
Inventors: 颜成钢; 郑鑫磊; 孙垚棋; 张继勇; 张勇东
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2020-01-15
Filing date: 2020-01-15
Publication date: 2023-04-18
Anticipated expiration: 2040-01-15
Also published as: CN111242841A

Abstract

The invention provides a picture background style migration method based on semantic segmentation and deep learning. Firstly, selecting content pictures and style pictures and preprocessing the pictures; then, directly calculating a picture with a relatively close result by the content picture and the style picture through a ResNet network; and then obtaining style constraint and content constraint through a VGG-19 network, performing gradient descent according to a loss function, obtaining a background style migration result in a multi-iteration mode, and finally putting the migration result back on the picture. The invention has the advantages of hundreds of times of speed improvement, strong expandability, style migration of local areas, retention of image main body contents, prominent main body and enhanced image artistic expressive force, and strong code readability and transportability.

Description

Image background style migration method based on semantic segmentation and deep learning

Technical Field

The invention relates to the field of deep learning, in particular to a picture background style migration method based on semantic segmentation and deep learning.

Background

Neural network based image style migration was proposed in 2015 by Gatys et al. There is only one innovation point in the article by Gatys, but the innovation point has very important significance in the fields of style migration and image texture generation: the paper proposes a method of modeling texture with deep learning. Until then scientists have hoped to find a method that enables texture to be described by local statistical models, while manual modeling is too complex and extremely poorly versatile. Gatys inspired from papers in the field related to object recognition, found that VGG19 networks can be considered as many local feature recognizers, and after experimental verification, found that these feature recognizers also perform very well in the field of style migration.

However, the Gatys method has obvious problems and disadvantages, and the most important is that the migration speed of the original method is very slow, almost twenty minutes is needed even when the top GPU is operated in 2019, and the style migration of a picture with the size of 512 by 512 can be completed in more than twelve hours on a common CPU.

The other disadvantage is that the transfer method of the Gatys only can transfer the whole picture, and cannot perform style transfer on a target with a certain characteristic and keep other objects unchanged.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a picture background style migration method based on semantic segmentation and deep learning. The invention aims to only transfer the image background when transferring the image style so as to highlight the main body and enhance the artistic expressive force of the image.

In order to achieve the technical purpose, the technical scheme of the invention is that a picture background style migration method based on semantic segmentation and deep learning comprises the following steps:

selecting a content picture and a style picture and preprocessing the pictures;

selecting a picture with a definite main body as a content picture, and performing semantic segmentation on the content picture by adopting a U-Net network, wherein the backbone network of the U-Net is selected as ResNet-18. The number of output classes of the network is set to 2, defining the softmax function required for classification:

in the formula, a _k (x) Expressing the score of the characteristic channel corresponding to each pixel point, K is the number of classes, p _k (x) Is thatAnd for the class k result of classifying the pixel points, the softmax function is used for classifying and outputting the calculation result after the neural network calculation is completed, before the softmax is passed, the data is represented as a one-dimensional activated data body, and the output result after the softmax is passed is mapped to the corresponding class and forms different marks of the main body and the background. After the main body is divided, subtracting the main body by using the full image to obtain a content image with the main body removed and only the background left;

selecting a style picture of which the background style needs to be obtained, and generally selecting a picture with obvious and strong style as the style picture in order to ensure that the effect is good;

and finally, cutting the content picture and the style picture into the same size to obtain the preprocessed inner picture and the preprocessed style picture.

Respectively inputting the preprocessed content pictures and style pictures into a ResNet-50 network, and outputting the content pictures and style pictures after the last layer of pooling layer is output

And &>

Performing weighted addition with the formula

Will be provided with

Inputting into ResNet-50 for up-sampling to obtain a picture with the same size as the input picture>

Step (3), inputting the style pictures into the VGG-19 network, and marking the style pictures as

To obtain a representation of the input image style, a feature space for capturing texture information is employed. The feature space can be built on top of the filter responses of any layer of the network, consisting of feature correlations between different filter responses, where the expected values occupy the spatial extent of the feature map. The characteristic correlation is based on the Gram matrix->

Give, wherein>

Is the inner product between the vectorized feature maps i and j in layer i.

By the inner product operation, a stylistic representation of the input image can be obtained. These information captured by the stylistic feature space built on different levels of the network are visualized by constructing images that match the stylistic representation of a given input image. By making a pair

Performing a gradient descent may enable a style migration, wherein the loss function is defined as a mean square distance between Gram matrix terms of the style picture and Gram matrix terms of the image to be generated.

A ^l And G ^l Are respectively provided with

Is and>

the respective features in layer l are indicated. The contribution of the first layer convolution layer to the total loss is:

the total loss can be expressed as

In the formula, w _l Is a weight of the contribution of each layer to the total loss. Regarding activities in layer l, E _l The derivative of (c) can be calculated by analysis:

E _l relative to

The gradient of the value of each pixel in (a) can be easily calculated using standard error back propagation, taking the outputs of the first, third, fifth, ninth, and thirteenth layers of the VGG-19 network as the style constraints, respectively.

Step (4) inputting the content pictures into the VGG-19 network, and marking the content pictures as

In order to visualize the image information encoded in the different convolutional layers, the image output in step (2) is->

A gradient descent operation is performed to find another image that matches the characteristic response of the original image. P ^l Represents->

Characterization in layer l, F ^l Represents->

In the l-th layer, the feature representation of all convolution kernels after activation is defined, and then the square error loss between the two feature representations is defined

The derivative of this loss with respect to the active data volume in layer l is equal to

From which standard error back-propagation calculations can be used with respect to the image

Of the gradient of (a). Thus, the original random image can be modified until it is produced in a certain layer of the convolutional neural network that->

In the same response, the initial random image is not a completely random noise map but the output result of step (2)>

Taking the output of the tenth layer of the VGG-19 network as content constraint;

step (5) for making style pictures

Is transferred to the content picture->

In step (2), the output picture obtained in step (2) is taken>

As the image to be combined, this image is simultaneously matched->

Is represented by and->

Is input into the VGG-19 network, and a loss function calculation step is used(3) The resulting style constraint->

And the output picture obtained in the step (2) is based on>

Loss of

And the content constraint obtained in step (4)>

And the output picture obtained in the step (2) is based on>

Loss of

The two are multiplied by a coefficient and then added. A gradient descent algorithm is used to optimize the loss function. Overall style migration by jointly minimizing>

The euclidean distance between the three. The overall loss function is as follows:

where alpha and beta are weighting factors for content and style reconstruction, respectively.

And (6) repeating the step (5) to carry out 10 times of iterative computation so as to enable the output picture obtained in the step (2) to be output

The output after an iteration->

Similar in content to the content graph as much as possibleThe tiles are similar in style to the style pictures.

And (7) finally, putting the segmented main body back on the background subjected to style migration.

The invention has the following beneficial effects:

(1) Compared with the traditional style migration algorithm, the algorithm does not perform gradient reduction from a white noise picture, but directly calculates a picture relatively close to a result from a content picture and a style picture, and uses a faster and simpler ResNet network, so that the test speed is increased by hundreds of times, and the quality is not obviously reduced;

(2) The expandability is strong, the deeper image style and content information can be extracted by deepening the depth of ResNet, and the arithmetic operation time is slightly increased;

(3) Carrying out style migration on the local area, and reserving the main content of the image so as to achieve the effects of highlighting the main body and enhancing the artistic expressive force of the image;

(4) The high-level framework fastai development based on the pyrrch is adopted, and the code readability and the portability are strong.

Drawings

FIG. 1 is a system flow diagram of the present invention;

FIG. 2 is a content portrait image employed by embodiments of the present invention;

FIG. 3 is the result of the portrait style migration method of the present invention.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the method for migrating the background style of a picture of the present invention includes the following steps:

selecting a content picture and a style picture and preprocessing the pictures;

selecting a picture with a definite main body as a content picture, such as a portrait and a still picture with a definite theme, and performing semantic segmentation on the content picture by adopting a U-Net network, wherein a backbone network of the U-Net is selected as ResNet-18, and because the segmentation task is simple, the task can be completed by using a shallow network. Setting the number of output classes of the network to 2, defining the softmax function required for classification:

in the formula, ak (x) represents the score of the feature channel corresponding to each pixel point, K is the number of classes, and p _k (x) The softmax function is used for classifying and outputting the calculation result after the neural network calculation is completed, before the calculation result passes the softmax function, the data is represented as a one-dimensional activated data body, the output result after the softmax function is mapped to the corresponding category, and different marks of a main body and a background are formed. After the main body is divided, subtracting the main body by using the full image to obtain a content image with the main body removed and only the background left;

selecting style pictures of which the background style needs to be obtained, and generally selecting pictures with obvious and strong styles as the style pictures for better effect, such as oil paintings of impression groups and abstract groups;

And &>

Performing weighted addition with the formula

Will be provided with

Inputting the picture into ResNet-50 to perform upsampling to obtain a picture which is equal to the size of the input picture>

Step (3) inputting the style pictures into the VGG-19 network, and marking the style pictures as

Give, wherein>

Is the inner product between the vectorized feature maps i and j in layer i.

By the inner product operation, a stylistic representation of the input image may be obtained, which captures the texture information of the input image, rather than a global arrangement, because the inner product operation destroys the spatial correlation of the original image. These information captured by the stylistic feature space built on different levels of the network are visualized by constructing images that match the stylistic representation of a given input image. By pairs

Performing a gradient descent may enable a style transition, wherein the loss function is defined as the mean square distance between the Gram matrix terms of the style picture and the Gram matrix terms of the image to be generated.

A ^l And G ^l Are respectively provided with

Is and>

the respective characteristics are represented in layer i. The contribution of the first layer convolution layer to the total loss is:

the total loss can be expressed as

In the formula, w _l Is a weight of the contribution of each layer to the total loss. With respect to activities in layer l, E _l The derivative of (c) can be calculated by analysis:

E _l relative to

Step (4), inputting the content pictures into the VGG-19 network, and marking the content pictures as

A gradient descent operation is performed to find another image that matches the characteristic response of the original image. P is ^l Represents->

Characterization in layer l, F ^l Represents->

The derivative of this loss with respect to the volume of active data in layer l is equal to

Of the gradient of (c). Thus, the original random image can be modified until it is produced in a certain layer of the convolutional neural network that->

The same response, the initial random image is not a completely random noise plot, but rather the output result of step (2)>

Taking the output of the tenth layer of the VGG-19 network as a content constraint;

step (5) for making style pictures

Is transferred to the content picture->

In step (2), the output picture obtained in step (2) is taken>

As the image to be combined, this image is simultaneously matched->

Is represented by and->

Is input into the VGG-19 network, the style constraint ≥ obtained in step (3) is calculated using a penalty function>

And the output picture obtained in the step (2) is based on>

Loss of

And the content constraint obtained in step (4)>

And the output picture obtained in the step (2) is based on>

Loss of (2)

where alpha and beta are weight factors for content and style reconstruction, respectively.

Claims

1. A picture background style migration method based on semantic segmentation and deep learning is characterized by comprising the following steps:

selecting a content picture and a style picture and preprocessing the pictures;

And &>

Performing weighted addition with the formula

Will be provided with

Step (3), inputting the style picture into a VGG-19 network to obtain style constraint;

step (4), inputting the content picture into a VGG-19 network to obtain content constraint;

step (5),To take style pictures

Is transferred to a content picture>

Adopting the output picture obtained in the step (2)

As an image to be combined, which is simultaneously matched +>

Is represented by and->

Is input into the VGG-19 network and the style picture ÷ obtained in step (3) is calculated using a loss function>

And the output picture obtained in the step (2) is based on>

Loss of

And the content picture obtained in the step (4) is/are>

And the output picture obtained in the step (2) is based on>

Loss of (2)

Multiplying the two by a coefficient and then adding the two; make itOptimizing the loss function with a gradient descent algorithm; overall style migration by jointly minimizing>

The Euclidean distance between the three is calculated; the overall loss function is as follows:

where α and β are weight factors for content and style reconstruction, respectively;

The output after an iteration->

Similar to the content picture as much as possible in content and similar to the genre picture in genre;

2. The method for migrating the background style of the picture based on semantic segmentation and deep learning according to claim 1, wherein the specific method in the step (1) is as follows:

selecting a picture with a definite main body as a content picture, and performing semantic segmentation on the content picture by adopting a U-Net network, wherein the backbone network of the U-Net is selected as ResNet-18; setting the number of output classes of the network to 2, defining the softmax function required for classification:

in the formula, ak (x) represents the score of the feature channel corresponding to each pixel point, K is the number of classes, and p _k (x) The soft max function is used for classifying and outputting the calculation result after the neural network calculation is completed, before the neural network passes through the soft max function, the data is represented as a one-dimensional activated data body, the output result after the soft max function is mapped to the corresponding category, and different marks of a main body and a background are formed; after the main body is divided, subtracting the main body by using the full image to obtain a content image which removes the main body and only leaves the background;

selecting a style picture of which the background style needs to be obtained, and selecting a picture with obvious and strong style as the style picture in order to obtain a better effect;

3. The method for migrating the background style of the picture based on semantic segmentation and deep learning according to claim 2, wherein the specific method in the step (3) is as follows:

inputting style pictures into a VGG-19 network and marking style pictures as

To obtain a representation of the input image style, a feature space for capturing texture information is employed; the feature space is established on the filter responses of any layer of the network and is composed of feature correlations among different filter responses, wherein the expected value occupies the spatial range of the feature mapping; the characteristic correlation is based on the Gram matrix->

Give, wherein>

Is the inner product between the vectorized feature maps i and j in layer l;

through inner product operation, style representation of the input image can be obtained; visualizing these stylistic feature space captured information built on different levels of the network by constructing images that match the stylistic representation of a given input image; by pairs

Performing gradient descent to realize style migration, wherein a loss function is defined as the mean square distance between the Gram matrix item of the style picture and the Gram matrix item of the image to be generated;

A ^l and G ^l Are respectively provided with

Is and>

the respective feature representation in layer l; the contribution of the first layer convolution layer to the total loss is:

the total loss can be expressed as

In the formula, w _l Is the weight of each layer contributing to the total loss; with respect to activities in layer l, E _l The derivative of (c) can be calculated by analysis:

F ^l to represent

The feature representation of all the convolution kernels in the l layer after activation; e _l Relative to->

4. The method for migrating the background style of the picture based on the semantic segmentation and the deep learning according to claim 3, wherein the specific method in the step (4) is as follows:

inputting content pictures into VGG-19 network and marking content pictures as

In order to visualize the image information encoded in the different convolutional layers, the image output in step (2) is +>

Performing a gradient descent operation to find another image matching the characteristic response of the original image; p ^l Represents->

Characterization in layer l, F ^l Represents->

In level i the activated representation of the features of all convolution kernels is defined, and the loss of square error between the two representations of the features is then defined>

A gradient of (a); thus, the original random image can be modified until it is produced in a certain layer of the convolutional neural network that->

The output of the tenth layer of the VGG-19 network is taken as the content constraint. />