CN107767328B

CN107767328B - Migration method and system of any style and content generated based on small amount of samples

Info

Publication number: CN107767328B
Application number: CN201710957685.2A
Authority: CN
Inventors: 张娅; 张烨珣; 蔡文彬; 王延峰
Original assignee: Shanghai Media Intelligence Co ltd
Current assignee: Shanghai Media Intelligence Co ltd
Priority date: 2017-10-13
Filing date: 2017-10-13
Publication date: 2021-12-17
Anticipated expiration: 2037-10-13
Also published as: CN107767328A

Abstract

The invention provides a method and a system for migrating any style and content based on a small amount of sample generation, which comprises the following steps: style feature extraction: for the style reference image, extracting style characteristics of the image by using a depth convolution neural network; a content feature extraction step: extracting content characteristics of the image by using a depth convolution neural network for the content reference image; and combining style and content characteristics: combining the extracted style features and the content features through a bilinear model to obtain the features of the target image; a target image generation step: and generating the target image by the characteristics of the target image through a deconvolution neural network. The method and the device separate styles and contents, realize the image generation of any target content in any target style, can migrate to unseen styles and unseen contents, and extract the characteristics of the unseen styles and contents. The invention only needs a small amount of reference images, and the style characteristic and the content characteristic can be learned from a small amount of data.

Description

Migration method and system of any style and content generated based on small amount of samples

Technical Field

The invention relates to a method in the field of computer vision and image processing, in particular to a method and a system for migrating any style and content generated based on a small number of samples.

Background

In the field of style migration, initial research efforts were directed to progressively transform a noisy image into an image with target style target content in an iterative fashion based on a trained existing network, such as a VGG network. However, this method requires many iterations and is inefficient. Much of the research work that follows is directed to learning a model that can generate images with a certain style directly from a content image. The loss functions used are mostly perceptual loss functions, so that the generated image is an image of the target genre and the target content. However, the deep neural network models used in these works do not separate styles and contents, and the two factors of styles and contents are still fused together, so these networks can only migrate styles appearing in the training set, if migration needs to be performed on other unseen styles, the models need to be retrained, and a large number of training images are used in the process of retraining the models.

The current deep neural network model generally needs a large amount of training data for learning, and human beings have the ability to learn a certain concept from a small amount of samples or even a single sample, for example, when a child is taught to recognize a certain animal, the child can recognize and learn the certain animal according to one or more images of the certain animal, and can recognize and judge the certain animal when the child sees a new image of the certain animal again. In particular, humans also have the ability to generate from a small number of samples, encounter a new image and concept, understand its structure and expressive information, and imagine its deformation in the mind. In addition, the human can automatically fuse the images according to certain styles and certain content images, and imagine the style-transferred images formed by combining the two images.

With the gradual development of the deep neural network in the field of image processing, some algorithms for target recognition based on a small number of samples appear at present, and good effects are obtained. Unlike the generation of images, object recognition based on a small number of samples focuses more on extracting features of an object, and because the number of samples is small, more representative and distinctive features need to be extracted. In the current research, no method for performing style migration by using a small amount of samples is available.

Through retrieval, the chinese patent publication No. CN106651766A discloses an image style migration method based on a deep convolutional neural network, which implements style migration through image input, loss function training, stylization, image enhancement, and image refinement, but has the disadvantage that the above patent can only implement style migration that occurs in a training set, and does not have a style that has already occurred in the training set, and needs retraining, which is time-consuming on one hand and requires a large number of training samples for training on the other hand.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for transferring any style and content generated based on a small amount of samples, each trained module can be transferred to unseen styles and content, and for unseen styles and content, a small amount of style reference images and content reference images can be used for generation.

According to a first aspect of the present invention, there is provided a migration method of arbitrary styles and contents generated based on a small number of samples, comprising:

style feature extraction: for the style reference image, extracting style characteristics of the image by using a depth convolution neural network;

a content feature extraction step: extracting content characteristics of the image by using a depth convolution neural network for the content reference image;

and combining style and content characteristics: combining the style characteristics obtained in the style characteristic extraction step with the content characteristics obtained in the content characteristic extraction step through a bilinear model to obtain the characteristics of the target image;

a target image generation step: and performing deconvolution on the characteristics of the target image obtained in the step of combining the style and the content characteristics to generate the target image.

The migration method of any style and content generated based on a small amount of samples can carry out end-to-end training.

Preferably, the style feature extraction step, wherein: and combining a small number of style reference images in the style reference set on the image channel dimension by using a deep convolutional neural network model, inputting the combined style reference images into a style feature extraction step, and learning a common part in the reference images, namely the style of the images to obtain style features. The style reference set is an image of different contents in the same style, and provides information of the style to be generated for the style characteristic extraction step.

Preferably, the content feature extraction step, wherein: and combining a small number of content reference images in the content reference set on the image channel dimension by using a deep convolutional neural network model, inputting the combined images into a content feature extraction step, and learning a common part in the reference images, namely the content of the images to obtain content features. The content reference set is images of the same content in different styles, and provides information of the content to be generated for the content feature extraction step.

Preferably, the genre and content characteristics are combined, wherein: the style and content features are combined by a bilinear model. A bilinear model is a two-factor model that has the mathematical property that one factor is fixed and the output of the model is linear to the other factor. It is flexible to separate or combine two factors.

Preferably, the target image generating step specifically includes: and inputting the characteristics of the target image obtained in the step of combining the style and the content characteristics into a deconvolution neural network to generate the target image.

According to a second aspect of the present invention, there is provided a style migration system based on a small number of sample generations, comprising:

a style feature extraction module: for the style reference image, extracting style characteristics of the image by using a depth convolution neural network;

a content feature extraction module: extracting content characteristics of the image by using a depth convolution neural network for the content reference image;

a genre and content feature integration module: combining the style characteristics obtained by the style characteristic extraction module with the content characteristics obtained by the content characteristic extraction module through a bilinear model to obtain the characteristics of a target image;

a target image generation module: and performing deconvolution on the characteristics of the target image obtained by the style and content characteristic combination module to generate the target image.

Compared with the prior art, the invention has the following beneficial effects:

the method respectively extracts the style features and the content features from the style reference images and the content reference images through style feature extraction and content feature extraction, separates the style and the content by utilizing the conditional correlation between the style and the content, and learns the representation modes of the style and the content features, thereby migrating each learned module to any unseen style and content. And then combining the extracted style and content features by using a bilinear model to generate the features of the target image, and further generating the target image by using a target image generation module.

The method only uses a small number of reference images, and the style reference set and the content reference set used in the training process only have a small number of samples, so that the method can extract style and content characteristics from a small number of samples, and only needs to migrate a small number of samples for unseen style and content.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system according to an embodiment of the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that variations and modifications can be made by persons skilled in the art without departing from the spirit of the invention. All falling within the scope of the present invention.

As shown in fig. 1, which is a flowchart of an embodiment of a migration method of an arbitrary genre and content generated based on a small number of samples according to the present invention, the method processes a genre reference image and a content reference image into a genre feature and a content feature through a genre feature extraction step and a content feature extraction step, combines the genre feature and the content feature into a feature of a target image using a genre and content feature combination step, and finally generates the target image using a target image generation step.

The method and the device can separate the styles from the content by utilizing the conditional correlation of the styles and the content, and realize the image generation of any target content in any target style. Because the feature representation mode is learned, the trained style and content feature extraction module can be transferred to unseen styles and unseen contents, and can extract the features of the unseen styles and contents. In addition, the invention only needs a small amount of reference images, and the style characteristic and the content characteristic can be learned from a small amount of data.

Specifically, with reference to fig. 1, the method comprises the steps of:

Corresponding to the above method, referring to fig. 2, a style migration system based on a small number of sample generation includes:

Specific implementations of the above steps and modules are described in detail below to facilitate understanding of the technical solutions of the present invention.

In some embodiments of the present invention, the style feature extracting step includes: and combining a small number of style reference images in the style reference set on the image channel dimension by using a deep convolutional neural network model, inputting the combined style reference images into a style feature extraction module, and learning a common part in the reference images, namely the style of the images to obtain style features. The style reference set is an image of different contents in the same style, and provides the style information to be generated for the style characteristic extraction step.

In some embodiments of the present invention, the content feature extraction step includes: and combining a small number of content reference images in the content reference set on the image channel dimension by using a deep convolutional neural network model, inputting the combined content reference images into a content feature extraction module, and learning a common part in the reference images, namely the content of the images to obtain content features. The content reference set is images of the same content in different genres, and provides information of the content to be generated to the content feature extraction step.

In some embodiments of the present invention, the style and content features are combined, wherein: the style and content features are combined by a bilinear model. A bilinear model is a two-factor model that has the mathematical property that one factor is fixed and the output of the model is linear to the other factor. It is flexible to separate or combine two factors.

In some embodiments of the present invention, the target image generating step specifically includes: and inputting the characteristics of the target image obtained in the step of combining the style and the content characteristics into a deconvolution neural network to generate the target image.

In a preferred embodiment of the present invention, the target image generating step may specifically include the following operations:

in the training phase, in each iteration, a training target image set D is first defined_tI.e. the set of target images that it is desired to generate; for each target image I_ij∈D_tProviding a referenceImages to determine a target style S to be generated_iAnd target content C_jRandomly selecting k style reference images from the training set for each target image to form a style reference set R_siRandomly selecting k content reference images to form a content reference set R_cjThen is followed by<Style reference image, content reference image, target image>Training in a triple form, wherein a style reference image and a content reference image are respectively input into a style feature extraction step and a content feature extraction step to obtain style features and content features, the obtained style features and content features are combined by a style and content feature combination module to obtain features of a target image, and the features are input into a target image generation module to generate the target image;

in the character style migration task: the generated target image is compared with the real target image and the following L1 loss function is calculated:

wherein, I_ijIs a true style of S_iThe content is C_jThe target image of (a) is displayed,

is generated with a style S_iThe content is C_jA target image;

considering the imbalance of the samples in each training batch, adding a corresponding weight to each target image in a loss function, wherein the weight comprises two aspects, namely the weight related to the size and the thickness of the image character, namely taking the reciprocal of the number of pixel points covered by the character as the weight, and the calculation formula of the weight related to the size and the thickness of the image character is as follows:

wherein N is_ijIs a target image I_ijThe number of pixels covered by the middle character,

is a target image I_ijThe more pixel points are covered, the larger or thicker the character is, and the lower the corresponding weight is;

the other is weight related to the blackness of the image characters, and the calculation method is as follows: calculating the average pixel value of pixel points covered by the characters for each target image, and then setting the loss weight in the form of a softmax function:

wherein mean is_ijIs a target image I_ijThe middle character covers the average pixel value of the pixel,

is a target image I_ijThe lower the average pixel value, the darker the character, the lower the corresponding weight;

therefore, the training targets of the migration method based on the arbitrary style and content generated by a small number of samples are as follows:

where D is the set of all training samples, D_siIs to remove I_ijAll other styles are S_iSet of training samples of (D)_cjIs to remove I_ijAll styles except C_jIs a parameter of the convolutional neural network used in the method and a parameter of the bilinear model,

is the generated target image.

As shown in FIG. 2, the migration system is composed of a style feature extraction module, a content feature extraction module, a style and content feature combination module and a target image generation module, and the whole system framework can be trained end to end.

In the system framework of the embodiment shown in fig. 2, the genre reference image and the content reference image are first combined in the image channel dimension and then input to the genre feature extraction module and the content feature extraction module, respectively. The style feature extraction module and the content feature extraction module are both composed of a series of downsampling modules consisting of convolutional layers + batchnorm layers + leakyrelu layers, wherein the sizes of convolution kernels of the other convolutional layers are 3 × 3 except that the size of a convolution kernel of the first convolutional layer is 5 × 5. The number of channels at the output of each layer is indicated in fig. 2.

In the system framework of the embodiment shown in fig. 2, the style and content feature combination module combines the style feature and the content feature using a bilinear model, and the formula of the combination is as follows:

F_ij＝S_iWC_j

wherein S is_iRepresenting a style feature, is an R-dimensional vector, C_jRepresenting the content features, is a B-dimensional vector, W is an R × K × B-dimensional tensor, F_ijIs the characteristic of the target image, is a K-dimensional vector with the representative style of S_iThe content is C_jThe characteristic of the image of (1). And when the style image is combined, combining the style characteristics obtained from the style reference set and the content characteristics obtained from the content reference set according to the formula to obtain the characteristics of the target image.

In the system framework of the embodiment shown in fig. 2, the target image generation module inputs the features of the target image obtained by the style and content feature combination module into the deconvolution neural network to generate the target image. The structure of the target image generation module is symmetrical to the style feature extraction module and the content feature extraction module, and the target image generation module is composed of an up-sampling module consisting of a series of deconvolution layers, a batchnorm layer and a relu layer, and is used for up-sampling the target image features to enable the generated image to be the same as the real target image in size. Wherein the convolution kernel sizes of the deconvolution layers are 3 × 3, except for the convolution kernel size of the last deconvolution layer being 5 × 5. The number of channels at the output of each layer is indicated in fig. 2.

In the training phase, in each iteration, a training target image set D is first defined_tI.e. the set of target images that it is desired to generate. For each target image I_ij∈D_tProviding a reference image to determine a target style S to be generated_iAnd target content C_jRandomly selecting k style reference images from the training set for each target image to form a style reference set R_siRandomly selecting k content reference images to form a content reference set R_cj. Then is provided with<Style reference image, content reference image, target image>And training in a triple form, wherein the style reference image and the content reference image are respectively input into a style feature extraction module and a content feature extraction module to obtain style features and content features. The obtained style characteristic and the content characteristic are combined by a style and content characteristic combining module to obtain the characteristic of the target image, and the characteristic is input into a target image generating module to generate the target image.

In the character style migration task, the generated target image is compared with the real target image, and the following L1 loss function is calculated:

wherein, I_ijIs a real image of the object and is,

is the generated target image.

Furthermore, because the size, thickness and blackness difference of characters in a batch of target images selected in each iteration is likely to be large, some target images contain larger and thicker characters, and the loss caused by more pixel points covered by the characters is larger than that of other images, the method can focus on learning and optimizing larger and thicker fonts. In addition, there are some images with darker characters, and the method also favors learning over them, and does not yield well for lighter characters. Considering the imbalance of the samples in each training batch, adding corresponding weight to each target image in a loss function, wherein the weight comprises two aspects, namely the weight related to the size and the thickness of the image character, namely taking the reciprocal of the number of pixel points covered by the character as the weight. The calculation formula of the weights for the size and thickness of the image character is as follows:

is a target image I_ijWeight on image character size and thickness. The more pixel points covered, the larger or coarser the character, the lower the corresponding weight.

is a target image I_ijWeight of blackness of the image character. The lower the average pixel value, the darker the character, and the lower the corresponding weight.

is the generated target image.

In summary, the style and the content of the image can be separated by using the conditional correlation between the style and the content, the style characteristic and the content characteristic can be respectively extracted from the style reference image and the content reference image by the style and content characteristic extraction module, then the style characteristic and the content characteristic are combined by the bilinear model, and the obtained characteristic of the target image is input to the target image generation module to generate the target image. By setting the weighting L1 loss function, the imbalance of the selected target image per iteration can be improved.

In addition, the training sample of the triple of the < style reference image, the content reference image and the target image > designed in the invention enables the style and content feature extraction module to learn the feature representation mode, so the style and content feature extraction module obtained by training can be transferred to the unseen style and content, and the features of the unseen style and content can also be extracted. In addition, the style reference set and the content reference set used in training only contain a small number of reference images, so the method can realize the function of generating the target image with unseen style and unseen content by using a small number of reference samples.

The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention.

Claims

1. A migration method of any style and content generated based on a small number of samples is characterized by comprising the following steps:

style feature extraction: extracting style characteristics of the image from the style reference image by using a first depth convolution neural network;

a content feature extraction step: extracting content features of the image from the content reference image by using a second deep convolution neural network;

a target image generation step: performing deconvolution on the characteristics of the target image obtained in the step of combining the style and the content characteristics to generate a target image;

the style and content features combining step, wherein: the bilinear model is a two-factor model, has the mathematical property that one factor is fixed, and the output of the model is linear to the other factor, so that the two factors can be flexibly separated or combined;

the bilinear model is combined, and the formula is as follows:

F_ij＝S_iWC_j

wherein S is_iRepresenting a style feature, is an R-dimensional vector, C_jRepresenting the content features, is a B-dimensional vector, W is an R × K × B-dimensional tensor, F_ijIs the characteristic of the target image, is a K-dimensional vector with the representative style of S_iThe content is C_jThe characteristics of the image of (a); when the image is combined, combining the style characteristics obtained by the style reference set and the content characteristics obtained by the content reference set according to the formula to obtain the characteristics of the target image;

the style feature extraction step, wherein: combining a small number of style reference images in a style reference set on the image channel dimension by using a first depth convolution neural network model, inputting the style reference images together into a style feature extraction step, and learning a common part in the reference images, namely the style of the images to obtain style features; the style reference set is an image with the same style and different contents, and provides information of the style to be generated for the style characteristic extraction step;

the content feature extraction step, wherein: combining a small number of content reference images in the content reference set on the image channel dimension by using a second deep convolution neural network model, inputting the combined content reference images into the content feature extraction step, and learning common parts in the reference images, namely the content of the images to obtain content features; the content reference set is images of the same content and different styles, and provides information of the content to be generated for the content feature extraction step;

the target image generating step, wherein: inputting the characteristics of the target image obtained in the step of combining the style and the content characteristics into a deconvolution neural network to generate a target image, and comparing the generated target image with a real target image through a loss function; the target image generation step specifically comprises the following steps:

in the training phase, in each iteration, a training target image set D is first defined_tI.e. the set of target images that it is desired to generate; for each target image I_ij∈D_tProviding a reference image to determine a target style S to be generated_iAnd target content C_jRandomly selecting k style reference images from the training set for each target image to form a style reference set R_siRandomly selecting k content reference images to form a content reference set R_cjThen is followed by<Style reference image, content reference image, target image>And training in a triple form, wherein the style reference image and the content reference image are respectively input into the style characteristic extraction step and the content characteristic extraction step to obtain style characteristics and content characteristics.

2. The method for migrating arbitrary styles and contents generated based on a small number of samples as claimed in claim 1, wherein said target image generation step further has the following features:

is generated with a style S_iThe content is C_jA target image;

considering the imbalance of the samples in each training batch, a corresponding weight is added to each target image in the loss function, and the weight comprises two aspects: the first is the weight of the size and thickness of the image character, that is, the reciprocal of the number of pixel points covered by the character is taken as the weight, and the calculation formula of the weight of the size and thickness of the image character is as follows:

the training targets of the migration method of any style and content generated based on a small amount of samples are as follows:

is the generated target image.

3. A migration system of arbitrary styles and contents based on small sample generation using the method of any one of claims 1-2, characterized by: