CN110580726A

CN110580726A - Dynamic convolution network-based face sketch generation model and method in natural scene

Info

Publication number: CN110580726A
Application number: CN201910772659.1A
Authority: CN
Inventors: 林倞; 陈景文; 刘凌波; 李冠彬
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2019-08-21
Filing date: 2019-08-21
Publication date: 2019-12-17
Anticipated expiration: 2039-08-21
Also published as: CN110580726B

Abstract

The invention discloses a face sketch generation model and a method under a natural scene based on a dynamic convolution network, wherein the method comprises the following steps: step S1, initializing all convolution network and full-connection network parameters; step S2, acquiring a face image, and extracting the layering characteristics of the face image by using a full convolution neural network; step S3, the obtained features are up-sampled by using a transposed convolution network, and the potential area of the human face and the information of the human face change form are mined by using a deformable convolution network; step S4, dividing the features into multi-scale regions, dynamically calculating self-adaptive filter weights in each region, carrying out convolution calculation on the filter weights and the features to obtain new features, and combining all the region features under multiple scales to generate high-quality face sketch; step S5, updating model parameters according to the comparison between the generated face sketch and the real face sketch; and step S6, performing step S2-S5 training in multiple iterations.

Description

Dynamic convolution network-based face sketch generation model and method in natural scene

Technical Field

The invention relates to the technical field of computer vision based on deep learning, in particular to a face sketch generation model and method in a natural scene based on a dynamic convolution network.

background

face sketch generation refers to automatically generating corresponding face sketch from a face photo. Face sketching is a classic task in the field of computer vision. The face sketch has a wide application scene in reality, such as law enforcement agencies and the field of digital entertainment, and attracts academia and industry to conduct a lot of research work on the face sketch.

in recent years, The successful application of convolutional neural networks has brought about a major breakthrough to face sketch generation, for example, Liliang Zhang et al, work in 2015 "End-to-End photo-deletion generation of a fully functional presentation learning" (The Annual ACM International conference on Multimedia Retrieval (ICMR),2015), and Phillip Isola et al, In 2017, "Image-to-Image transformation with conditional adaptive networking" (In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition,2017), all focused on the generation of face sketches using convolutional neural network modeling, however, these methods of modeling by using convolutional neural networks and generation of countermeasure networks based on the deep learning theory can ensure good performance only under limited conditions, for example, the background of a face image needs to be processed into a pure color background, and the orientation of the face needs to be limited.

However, since most face images in real natural scenes are generated under unlimited conditions, these methods lack versatility in real natural scenes. In recent two years, some face sketch generation methods under non-limited conditions have also made certain progress. For example, Zhang et al in 2017 "Content-adaptive panel transport generation by secondary composite representation learning." (IEEEtransactions on Image Processing, 2017) and Jun Yu et al in 2017 "Composition-aid face photo-mask synthesis". In order to ensure the performance of generating the face sketch under non-limiting conditions, the image is preprocessed before generating the face sketch, and the preprocessing comprises removing a disordered background and analyzing the face into different components (such as hair, eyes and mouth). However, these preprocessing methods are very time-consuming and may even fail in a complicated scene, and these disadvantages may cause the existing face sketch generation method to have a serious performance degradation and a poor generalization performance in a real natural scene.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a model and a method for generating a face sketch in a natural scene based on a dynamic convolutional network, so as to achieve the purpose of effectively generating the face sketch in the natural scene without depending on any preprocessing method.

In order to achieve the above object, the present invention provides a face sketch generation model in a natural scene based on a dynamic convolution network, including:

The initialization unit is used for initializing the network parameters of the model;

The encoder unit is used for acquiring a face image under a natural scene without preprocessing and extracting the layering characteristics of the face image by using a full convolution neural network;

The decoder unit gradually samples the hierarchical features generated by the encoder unit by using a transposed convolutional network, and excavates potential areas of the human face and information of the change form of the human face by using a deformable convolutional network;

The human face sketch generation unit is used for dividing the features output by the deformable convolution network into multi-scale regions, dynamically calculating self-adaptive filter weights in each region, carrying out convolution calculation on the filter weights and the features of the corresponding regions to obtain new features, and combining the features of all the regions under multiple scales to generate a high-quality human face sketch;

The updating unit is used for comparing the face sketch generated by the face sketch generating unit with a real face sketch and updating the model parameters by a strategy of an optimized objective function;

And the iterative training unit is used for carrying out the training processes of the encoder unit, the decoder unit, the face sketch generation unit and the updating unit in a multi-iteration mode until the training process converges or the maximum iteration times is reached to obtain a final model.

Preferably, the facial image in the natural scene without preprocessing is from a pre-established training set, and the training set is established by the following processes:

Collecting face images of different data sources and face sketches thereof, establishing a face sketches data set containing the face images and the corresponding face sketches thereof in a natural scene, and not using any preprocessing process for removing a background;

And randomly selecting a plurality of pairs of images from the established face sketch data set as a training set, and using the rest pairs of images as a test set.

Preferably, the fully convolutional neural network is provided with eight layers of convolutional neural networks in sequence, each layer of convolutional neural network is followed by a modified linear unit, the convolution kernel size of each layer of convolutional neural network is 2 × 2, and the number of output channels is 64,128, 192,256 and 256 respectively.

Preferably, in the full convolutional neural network, after the second, fourth and sixth layers of convolutional neural networks, a layer of pooling network is inserted for pooling downsampling, and the step size and pooling size of the pooling network are both 2 × 2.

Preferably, the decoder unit comprises:

The transposition convolution network is used for acquiring the layering characteristics output by the encoder unit and up-sampling the characteristics;

the deformable convolution neural network is used for acquiring the characteristics sampled on the transposed convolution network and calculating each pixel point in the characteristics by utilizing the convolution layer of the characteristics to obtain the offset O_pFor each pixel point p, at offset O_pis guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution neural network to obtain an output characteristic Fo;

and the splicing module is used for splicing the characteristics sampled on the transposed convolutional network and the characteristics with the same resolution of the full convolutional neural network of the encoder unit by a jump connection method, and adding a layer of standard convolutional network after the spliced characteristics to reduce the number of channels.

Preferably, the deformable convolutional neural network is divided into two steps: the first step is to generate a positional offset, the feature F_iInputting the data into a convolution layer with convolution kernel size of 3 multiplied by 3 and output channel number of 18, calculating all pixel points in the characteristics by the convolution layer to obtain offset O, and recording the offset of each pixel point p as O_p(ii) a Second, for each pixel p, at offset O_pis guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution network to obtain an output characteristic F_o。

Preferably, the decoder unit is formed by stacking 3 layers of networks, each layer of network includes a transposed convolutional network and a deformable convolutional network, after each layer of transposed convolutional network, the splicing module splices the features with the same resolution in the full convolutional encoder by using a jump connection technology, and can add a layer of standard convolutional network after the 3 rd spliced feature to reduce the number of channels.

Preferably, the face sketch generating unit further comprises:

a dividing module for equally dividing the final output feature of the deformable convolution network into n × n regions, each region having a resolution ofThe ith region is denoted as R_i；

A mapping module for mapping R at different scales using a spatial pooling layer_iMapping low dimensional features called fixed size;

A weight calculation module for inputting the pooled features into three continuous full-connection networks with dimensions of 256, 512 and 18432, the output of the last full-connection network is reorganized to 64 × 3 × 3 × 32 and recorded as the weight w of the adaptive convolutional network_i；

A convolution module for calculating the weight w obtained by the three-layer full-connection network_iand regionR_iThe characteristics of the target are subjected to convolution calculation to obtain new specialized characteristics

and the characteristic combination module is used for combining the characteristics of all the areas and generating a high-quality face sketch.

Preferably, the feature combination module combines the features of all the regions at various scalesReorganizing the features with the resolution of H multiplied by W, splicing the features under all scales, and inputting the spliced features into a standard convolution network with the convolution kernel size of 1 multiplied by 1 to generate the final face sketch.

in order to achieve the above object, the present invention further provides a method for generating a face sketch in a natural scene based on a dynamic convolution network, comprising the following steps:

Step S1, initializing the network parameters of all the convolution networks and the full-connection network;

step S2, acquiring a face image under a natural scene without preprocessing, and extracting the layering characteristics of the face image by using a full convolution neural network;

Step S3, the hierarchical characteristics obtained in step S2 are up-sampled by using a transposed convolution network, and the potential area of the human face and the information of the human face change form are mined by using a deformable convolution network;

Step S4, dividing the characteristics output by the deformable convolution network into multi-scale areas, dynamically calculating self-adaptive filter weights in each area, carrying out convolution calculation on the filter weights and the characteristics to obtain new characteristics, and combining all the area characteristics under multiple scales to generate a high-quality face sketch;

step S5, updating the parameters of the model according to the contrast between the generated face sketch and the real face sketch;

And step S6, performing step S2-S5 training in a multi-iteration mode until the training process converges or the maximum iteration times is reached to obtain the final model.

compared with the prior art, the invention relates to a face sketch generation model and a method under a natural scene based on a dynamic convolution network, which select a face sketch data set under the natural scene, initialize the weight of a target model without any preprocessing process such as background clearing, input a face image under the natural scene into a full convolution network consisting of continuous convolution layers and pooling layers to extract hierarchical features, input the hierarchical features into a transposed convolution network and a deformable convolution network for up-sampling and calculating the information features containing potential face regions, divide the features into multi-scale regions, dynamically calculate the self-adaptive filter weight in each region, perform convolution calculation on the filter weight and the features to obtain new features, combine all the region features under three scales to generate a high-quality face sketch, update the parameters of the model according to the comparison between the generated face sketch and the real face sketch, the method provided by the invention dynamically and adaptively calculates the characteristics of the face components in different scale areas in the optimization process, can generate high-quality sketch for the face in an unrestricted natural environment without any preprocessing process, and finally the effect of generating the face sketch in a natural scene under the restricted and unrestricted conditions exceeds that of all the existing methods.

Drawings

FIG. 1 is a system architecture diagram of a human face sketch generation model in a natural scene based on a dynamic convolution network according to the present invention;

FIG. 2 is a schematic diagram of a deformable convolution network in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of a dynamic convolution network framework under non-limiting conditions in an embodiment of the present invention;

FIG. 4 is a diagram of an adaptive convolutional network in an embodiment of the present invention

FIG. 5 is a flowchart of steps of a method for generating a face sketch in a natural scene based on a dynamic convolutional network according to the present invention.

Detailed Description

other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

Fig. 1 is a system architecture diagram of a human face sketch generation model in a natural scene based on a dynamic convolution network. As shown in fig. 1, the invention relates to a face sketch generation model in a natural scene based on a dynamic convolution network, which comprises:

The initialization unit 101 is configured to initialize network parameters of the model, and specifically, the initialization unit 101 randomly initializes the network parameters by using a normal distribution with a standard deviation of 0.02 for all convolution layers and all connection layers of the model.

an encoder unit 102, configured to obtain a face image in a natural scene without preprocessing, extract hierarchical features of the face image by using a full convolutional neural network, where the full convolutional neural network is composed of consecutive convolutional layers and pooling layers,

in an embodiment of the present invention, the face image in the natural scene without being preprocessed may be from a pre-established training set. The training set may be established by the following process:

A face sketch data set containing a face image and a face sketch corresponding to the face image in a natural scene is established, and any preprocessing processes such as background clearing and the like are not used. In the embodiment of the invention, the face images and the face sketches thereof from two data sources are collected, the first data source is a Facececescub data set which comprises 530 individual face images, and for each individual, a face image and a corresponding face sketch are randomly selected and added into the face sketches data set; the second data source is 270 face images collected from the internet, and the professional is asked to draw a face sketch for each image, and finally a face sketch data set containing 800 face images and corresponding face sketches is established.

The 400 pairs of images from the created face sketch dataset were randomly selected as the training set, and the remaining 400 images were used as the test set (which can be used to evaluate the effect of the present invention and other existing methods). The face image in the face sketch data set has more complicated changes of background, illumination, age, expression and the like, and can better reflect the real situation.

Therefore, the encoder unit 102 acquires the face image in the natural scene without preprocessing from the training set, and extracts the hierarchical features of the face image by using the full convolution neural network. Specifically, the obtained face image is input into the full convolution neural network, the full convolution neural network is sequentially provided with eight layers of convolution neural networks, each layer of convolution neural network is followed by a correction linear unit, the convolution kernel size of each layer of convolution neural network is 2 multiplied by 2, and the number of output channels is 64,128, 192,256 and 256 respectively. Preferably, after the second, fourth and sixth layers of convolutional neural networks, a layer of pooling network is inserted for pooling downsampling, and the step size and pooling size of the pooling network are both 2 × 2.

After passing through a full convolution neural network comprising 8 layers of convolution neural networks, 8 correction linear units and 3 pooling layers, the resolution of an input image is H multiplied by W, and the size of the feature output after passing through the full convolution neural network is H multiplied by W

The decoder unit 103 gradually upsamples the hierarchical features generated by the encoder unit 102 by using a transposed convolutional network, and mines face potential areas and face change form information by using a deformable convolutional network. Specifically, for the input feature F_ithe decoder unit 103 dynamically calculates the offset position using a convolutional neural network with an output channel number of 18 and a convolutional kernel size of 2 × 2, and for a pixel p, its offset can be organized as a tensor of 3 × 3, denoted as O_pIn the embodiment of the present invention, the up-sampling ratio of the transposed convolution network is set to 2, and the features after up-sampling are compared with the full featuresFeatures with the same resolution in the convolutional neural network are spliced together by a jump connection method to generate new face features. The decoder unit 103 may suppress clutter and make the network computationally prone to the face region through a deformable convolutional network.

specifically, the decoder unit 103 further includes:

The transposed convolution network is used for acquiring the hierarchical features output by the encoder unit and up-sampling the features, that is, the hierarchical features output by the encoder unit are input to the transposed convolution network, and the features are up-sampled by using the transposed convolution network. In a specific embodiment of the present invention, the transposed convolution network is used to perform a scale-2 upsampling on a feature.

The deformable convolution neural network is used for acquiring the characteristics sampled on the transposed convolution network and calculating each pixel point in the characteristics by utilizing the convolution layer of the characteristics to obtain the offset O_pFor each pixel point p, at offset O_pis guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution neural network to obtain an output characteristic F_o. Specifically, the characteristic after the up-sampling of the transposed convolution network is input into a deformable convolution neural network, and the convolution layer of the characteristic is used for calculating each pixel point in the characteristic to obtain the offset O_p. The deformable convolutional network is shown in FIG. 2, where F_iAs input features, F_ois the output characteristic. The deformable convolution is divided into two steps, the first step being the generation of the position offset: will be characterized by F_iInputting the data into a convolution layer with convolution kernel size of 3 multiplied by 3 and output channel number of 18, calculating all pixel points in the characteristics by the convolution layer to obtain offset O, and recording the offset of each pixel point p as 3 multiplied by 3 and O_p(ii) a Second, for each pixel p, at offset O_pIs guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution network to obtain an output characteristic F_o. A grid G of 3 × 3 is G { (-1, -1), (-1, 0),. -, (0.1), (1.1) }

Output characteristic diagram of pixel point pF_oIs represented as follows:

And the splicing module is used for splicing the features sampled on the transposed convolutional network and the features with the same resolution in the full convolutional neural network of the encoder unit by a jump connection method.

As shown in fig. 3, the FCE on the left is an encoder unit 102, fully called a full convolutional network encoder, consisting of 4 Standard Convolutions (SC), 4 convolution types have been labeled in fig. 3, and the DCD in the middle is a decoder unit 103 consisting of 3 deformable convolutional networks (DC) and 3 transposed convolutional networks (TC) with an up-sampling ratio of 2 (i.e. three layers stacked by a network comprising one layer of transposed convolutional networks + one layer of deformable convolutional networks). After each layer of the transposed convolutional network, features with the same resolution in the full convolutional encoder are spliced by using a jump connection technology, and a layer of standard convolutional network (SC) can be added after the 3 rd spliced feature (i.e. after the last layer of the transposed convolutional network TC) to reduce the number of channels from 128 to 64, thereby reducing the amount of model calculation.

The face sketch generation unit 104 divides the features generated by the deformable convolution network into multi-scale regions, dynamically calculates the adaptive filter weights in each region, performs convolution calculation on the filter weights and the features of the corresponding regions to obtain new features, combines the features of all the regions under three scales and generates a high-quality face sketch, and the generated face sketch is Y_F。

Specifically, the face sketch generating unit 104 further includes

A dividing module for equally dividing the final output characteristics of the deformable convolution network in the decoder unit 103 into n × n regions, each region having a resolution ofThe ith region is denoted as R_i. Since the scales of the components in the human face are not all consistent, the regions are divided in a single scaleThe method of (2) does not cover a certain component of the face well. Therefore, if n is 3, n is 4, and n is 5, three division regions under different scales are obtained, and the extraction capability of the model on the face component features of different scales is improved.

a mapping module to map R at different scales using a spatial pooling layer of 32 x 32 size_iThe mapping is referred to as a fixed-size low-dimensional feature, which reduces the number of parameters of the model and reduces computational complexity.

A weight calculation module for inputting the pooled features into three continuous full-connection networks with dimensions of 256, 512 and 18432, respectively, the output of the last full-connection network is reorganized to 64 × 3 × 3 × 32, and recorded as the weight w of the adaptive convolutional network (AC)_i，. Note that the output of the fully-connected network is computed independently at different regions of different scales, which makes w for the adaptive convolutional network_iIs associated with the region R_iThe feature content of (a) varies and thus the weight calculation process is said to be adaptive. That is, the present invention uses fully connected network outputs as weights rather than directly using fixed weights, so that different inputs have different weights.

A convolution module for calculating the weight w obtained by the three-layer full-connection network_iAnd region R_iThe convolution calculation is carried out on the characteristics, the number of output channels is 32, and new specialized characteristics are obtainedThe adaptive convolutional network described above is shown in fig. 4.

A feature combination module for combining the features of all regions and generating a high-quality face sketch, specifically, specializing the features of all regions at each scaleReorganizing the features with the resolution of H multiplied by W, splicing the features under the three scales, and inputting the spliced features into a standard convolution network with the convolution kernel size of 1 multiplied by 1 to generate the final face sketch.

And the updating unit 105 is used for comparing the generated face sketch with the real face sketch and updating the model parameters by optimizing the strategy of the objective function. Specifically, the optimization objective function is composed of a countervailing loss function and a euclidean space loss function, and the objective function used for the optimization model is as follows:

where Y is the true face sketch from the training set, Y_Fand F, generating a network for the generated face sketch. D is a discriminator which distinguishes the generated face sketch and the real face sketch one by one. In the above objective function, the left term is the penalty function and the right term is the Euclidean penalty function. In the training process, the whole model parameters are optimized by using an Adam optimization algorithm. The learning rate is set to 2e-4 and the batch size is set to 1.

and the iterative training unit 106 is configured to iteratively perform the training processes of the encoder unit 102, the decoder unit 103, the face sketch generating unit 104, and the updating unit 105 for multiple times until the training process converges or a maximum iteration number is reached to obtain a final model.

FIG. 5 is a flowchart of steps of a method for generating a face sketch in a natural scene based on a dynamic convolutional network according to the present invention. As shown in fig. 5, the method for generating a face sketch in a natural scene based on a dynamic convolution network of the present invention includes the following steps:

Step S1, configured to initialize network parameters of all convolutional networks and fully-connected networks used, specifically, randomly initialize network parameters of all convolutional layers and fully-connected layers using normal distribution with a standard deviation of 0.02.

and step S2, acquiring the face image under the natural scene without preprocessing, and extracting the layering characteristics of the face image by using a full convolution neural network. The full convolutional neural network consists of continuous convolutional layers and pooling layers

and randomly selecting 400 pairs of images from the established face sketch data set as a training set, and using the remaining 400 images as a test set. The face image in the face sketch data set has more complicated changes of background, illumination, age, expression and the like, and can better reflect the real situation.

therefore, in step S2, a face image in a natural scene without preprocessing is obtained from the training set, and the hierarchical features of the face image are extracted by using a full convolution neural network. Specifically, the human face image is input into the full convolution neural network, the full convolution neural network is sequentially provided with eight layers of convolution neural networks, each layer of convolution neural network is followed by a correction linear unit, the convolution kernel size of each layer of convolution neural network is 2 multiplied by 2, and the number of output channels is 64,128, 192,256 and 256 respectively. Preferably, after the second, fourth and sixth layers of convolutional neural networks, a layer of pooling network is inserted for pooling downsampling, and the step size and pooling size of the pooling network are both 2 × 2.

specifically, step S2 further includes:

And step S200, directly inputting the face image into a two-layer convolution neural network with convolution kernels of 2 multiplied by 2 and output channels of 64 and 64 respectively.

Step 201, a layer of modified linear unit is inserted after each layer of convolutional neural network in step 200, and the features output by the layer of convolutional neural network are downsampled by using pooling layers with step length and size both being 2 × 2 after the second layer of convolutional neural network.

Step S202, repeating steps S200-S201 according to the output of step S201, changing the number of output channels of two layers of convolutional neural networks into 128 and 128,192 and 192,256 and 256 respectively, and finally outputting the hierarchical features of the face image, specifically, inputting the output of step S101 into a third layer and a fourth layer of convolutional neural networks with convolutional cores of 2 x 2 and output channels of 128 and 128 respectively, inserting a layer of correction linear unit behind each layer of convolutional neural network, and down-sampling the features output by the layer of convolutional neural network by using a pooling layer with the step length and the size of 2 x 2 behind the fourth layer of convolutional neural network; inputting the output of the pooling layer after the fourth layer of convolutional neural network into a fifth layer and a sixth layer of convolutional neural network, wherein the sizes of convolutional kernels are 2 multiplied by 2, the number of output channels is 192 and 92 respectively, inserting a layer of correction linear unit after each layer of convolutional neural network, and downsampling the output characteristics of the layer of convolutional neural network by using the pooling layer with the step length and the size of 2 multiplied by 2 after the sixth layer of convolutional neural network; and (3) inputting the output of the pooling layer after the sixth layer of convolutional neural network into a seventh layer and an eighth layer of convolutional neural network, wherein the sizes of convolutional kernels are 2 multiplied by 2, the number of output channels is 256 and 256 respectively, and a layer of correction linear unit is inserted after each layer of convolutional neural network.

after the convolutional networks with different channel numbers are repeatedly stacked in step S202, 8 layers of convolutional neural networks, 8 modified linear units, and 3 pooling layers are provided. Assuming that the resolution of the input image in S100 is H × W, the output feature size after the steps S200-S202 is

and step S3, the hierarchical features obtained in the step S2 are up-sampled by using a transposed convolution network, and the potential area of the human face and the information of the human face change form are mined by using a deformable convolution network. In step S3, the cluttered background may be suppressed by the deformable convolutional network and the network is made computationally inclined to the face region.

Specifically, step S3 further includes:

in step S300, the features output in step S2 are input into a transposed convolution network, and the features are up-sampled by the transposed convolution network. In a specific embodiment of the present invention, the transposed convolution network is used to perform a scale-2 upsampling on a feature.

Step S301, inputting the feature sampled in step S300 into a deformable convolutional neural network, and calculating each pixel point in the feature by using the convolutional layer to obtain an offset O_p. The deformable convolution is divided into two steps, the first step (i.e., step S201) is to generate a position offset: will be characterized by F_iInputting the data into a convolution layer with convolution kernel size of 3 multiplied by 3 and output channel number of 18, calculating all pixel points in the characteristics by the convolution layer to obtain offset O, and recording the offset of each pixel point p as 3 multiplied by 3 and O_p。

Step S302, for each pixel point p, at offset O_pIs guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution network to obtain an output characteristic F_o. A grid G of 3 × 3 is G { (-1, -1), (-1, 0),. -, (0.1), (1.1) }

output characteristic diagram F for pixel point p_oIs represented as follows:

In step S303, after transposing the convolutional network in each layer, features with the same resolution in the full convolutional neural network in step S2 are spliced together by using a jump connection technique, and a layer of standard convolutional network is added after the spliced features to reduce the number of channels from 128 to 64, so as to reduce the amount of model computation.

And step S4, dividing the features output by the deformable convolution network into multi-scale regions, dynamically calculating self-adaptive filter weights in each region, carrying out convolution calculation on the filter weights and the features to obtain new features, and combining the features of all the regions under three scales to generate the high-quality face sketch.

Specifically, step S4 further includes:

Step S400, equally dividing the final output characteristic of the deformable convolution network in the step S3 into n multiplied by n areas, wherein the resolution of each area isThe ith region is denoted as R_i. Because the scales of the components in the human face are not all consistent, the method of dividing the regions in a single scale cannot well cover a certain component of the human face. Therefore, if n is 3, n is 4, and n is 5, three division regions under different scales are obtained, and the extraction capability of the model on the face component features of different scales is improved.

Step S401, using spatial pooling layer of 32 × 32 size to pool R at different scales_iThe mapping is referred to as a fixed-size low-dimensional feature, which reduces the number of parameters of the model and reduces computational complexity.

Step S402, inputting the pooled features into three continuous full-connection networks with the dimensions of 256, 512 and 18432 respectively. The output of the last layer of fully-connected network is re-organized to 64 x 3 x 32 and recorded as the weight w of the adaptive convolutional network_i. Note that the output of the fully-connected network is computed independently at different regions of different scales, which makes w for the adaptive convolutional network_iIs associated with the region R_iThe feature content of (a) varies and thus the weight calculation process is said to be adaptive.

Step S403, calculating the weight w obtained by the three-layer full-connection network_iand region R_iThe convolution calculation is carried out on the characteristics, the number of output channels is 32, and new specialized characteristics are obtainedThe adaptive convolution process of the above steps S402-S403The network is shown in fig. 4.

Step S404, combining the features of all the regions and generating a high-quality face sketch, specifically, specializing the features of all the regions under each scaleReorganizing the features with the resolution of H multiplied by W, splicing the features under the three scales, and inputting the spliced features into a standard convolution network with the convolution kernel size of 1 multiplied by 1 to generate the final face sketch.

step S5, updating parameters of the model according to the comparison between the generated face sketch and the real face sketch, specifically, optimizing the model by using the following objective function:

wherein Y is a true face sketch, Y_FAnd F, generating a network for the generated face sketch. D is a discriminator which distinguishes the generated face sketch and the real face sketch one by one. In the above objective function, the left term is the penalty function and the right term is the Euclidean penalty function. In the training process, the whole model parameters are optimized by using an Adam optimization algorithm. The learning rate is set to 2e-4 and the batch size is set to 1.

and step S6, performing training from step S2 to step S5 in a multi-iteration mode until the training process converges or the maximum iteration number is reached to obtain the final model.

To sum up, the invention relates to a face sketch generation model and method under natural scene based on dynamic convolution network, which is characterized in that a face sketch data set under natural scene is selected, the weight of a target model is initialized without any preprocessing process such as background clearing, the face image under natural scene is input into a full convolution network composed of continuous convolution layers and pooling layers to extract hierarchical features, the hierarchical features are input into a transposed convolution network and a deformable convolution network to be up-sampled and calculate the information features containing potential face regions, the features are divided into multi-scale regions, the self-adaptive filter weight is dynamically calculated in each region, the filter weight and the features are convolution calculated to obtain new features, all the region features under three scales are combined to generate high-quality face sketch, the parameters of the model are updated according to the contrast between the generated face sketch and the real face sketch, the method provided by the invention dynamically and adaptively calculates the characteristics of the face components in different scale areas in the optimization process, can generate high-quality sketch for the face in an unrestricted natural environment without any preprocessing process, and finally the effect of generating the face sketch in a natural scene under the restricted and unrestricted conditions exceeds that of all the existing methods.

the foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A face sketch generation model under a natural scene based on a dynamic convolution network comprises the following steps:

2. The dynamic convolutional network-based face sketch generation model under natural scenes as claimed in claim 1, characterized in that: the face image under the natural scene without being preprocessed comes from a pre-established training set, and the training set is established through the following processes:

3. The dynamic convolutional network-based face sketch generation model under natural scenes as claimed in claim 1, characterized in that: the full convolutional neural network is sequentially provided with eight layers of convolutional neural networks, each layer of convolutional neural network is followed by a correction linear unit, the convolutional kernel size of each layer of convolutional neural network is 2 multiplied by 2, and the number of output channels of each layer of convolutional neural network is 64,128, 192,256 and 256.

4. the dynamic convolutional network-based face sketch generation model under natural scenes as claimed in claim 1, characterized in that: and in the full convolutional neural network, after the second convolutional neural network, the fourth convolutional neural network and the sixth convolutional neural network, respectively inserting a layer of pooling network for pooling downsampling, wherein the step length and the pooling size of the pooling network are both 2 multiplied by 2.

5. The dynamic convolutional network-based face sketch generation model in natural scene as claimed in claim 1, wherein said decoder unit further comprises:

The deformable convolution neural network is used for acquiring the characteristics sampled on the transposed convolution network and calculating each pixel point in the characteristics by utilizing the convolution layer of the characteristics to obtain the offset O_pFor each pixel point p, at offset O_pIs guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution neural network to obtain an output characteristic F_o；

6. The model of claim 5, wherein the deformable convolutional neural network is divided into two steps: the first step is to generate a positional offset, the feature F_iinputting the data into a convolution layer with convolution kernel size of 3 multiplied by 3 and output channel number of 18, calculating all pixel points in the characteristics by the convolution layer to obtain offset O, and recording the offset of each pixel point p as O_p(ii) a Second, for each pixel p, at offset O_pIs guided by the input feature F_iPerforming convolution calculation with the filter weight w of the deformable convolution network to obtain an output characteristic F_o。

7. The dynamic convolutional network-based face sketch generation model under natural scenes as claimed in claim 5, characterized in that: the decoder unit is formed by stacking 3 layers of networks, each layer of network comprises a transposed convolution network and a deformable convolution network, after each layer of transposed convolution network, the splicing module splices the features with the same resolution in the full convolution encoder by using a jump connection technology, and can add a layer of standard convolution network after the 3 rd spliced feature to reduce the number of channels.

8. the model for generating face sketch under natural scene based on dynamic convolution network as claimed in claim 1, wherein said face sketch generating unit further comprises:

A weight calculation module for inputting the pooled features into three continuous full-connection networks with dimensions of 256, 512 and 18432, and the output of the last full-connection network is reorganized to 64 × 3 × 3 × 32, and recorded as the weight w of the adaptive convolutional network_i；

A convolution module for calculating the weight w obtained by the three-layer full-connection network_iand region R_iThe characteristics of the target are subjected to convolution calculation to obtain new specialized characteristics

9. The dynamic convolutional network-based natural scene human face as claimed in claim 8the sketch generative model is characterized in that: the characteristic combination module combines the characteristics of all the areas at each scaleReorganizing the features with the resolution of H multiplied by W, splicing the features under all scales, and inputting the spliced features into a standard convolution network with the convolution kernel size of 1 multiplied by 1 to generate the final face sketch.

10. A face sketch generation method under a natural scene based on a dynamic convolution network comprises the following steps:

And step S6, performing step S2-S5 training in a multi-iteration mode until the training process converges or the maximum iteration number is reached to obtain the final model.