CN108229381B

CN108229381B - Face image generation method and device, storage medium and computer equipment

Info

Publication number: CN108229381B
Application number: CN201711480435.0A
Authority: CN
Inventors: 李岸; 肖安东
Original assignee: Hunan Vision Miracle Intelligent Technology Co ltd
Current assignee: Hunan Vision Miracle Intelligent Technology Co ltd
Priority date: 2017-12-29
Filing date: 2017-12-29
Publication date: 2021-01-08
Anticipated expiration: 2037-12-29
Also published as: CN108229381A

Abstract

The invention relates to a face image generation method, a device, a storage medium and computer equipment, comprising a target detection network constructed based on a residual error network and a fast area network with convolutional neural network characteristics; constructing an confrontation generating network, wherein the confrontation generating network is used for reconstructing the face characteristics; cascading a target detection network and a countermeasure generation network to obtain a super-resolution network; inputting the face image to be detected into a super-resolution network to obtain a super-resolution face image; the target detection network in the super-resolution network extracts and identifies the face features from the face image to be detected, and the anti-generation network reconstructs the face features to obtain the super-resolution face image.

Description

Face image generation method and device, storage medium and computer equipment

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to a method and an apparatus for generating a face image, a storage medium, and a computer device.

Background

In the fields of computer vision, image processing and pattern recognition, a face image is one of research hotspots, and a face recognition detection technology is an important component of a biological recognition technology. In video monitoring, due to the defects of hardware camera equipment and the influence of a shooting environment, the acquired face image may be unclear to different degrees, so that the image quality is reduced.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a face image generation method, apparatus, storage medium, and computer device that can improve image quality.

A face image generation method comprises the following steps:

constructing a target detection network based on a residual error network and a fast regional network with convolutional neural network characteristics, wherein the target detection network is used for extracting and identifying face characteristics from a face image to be detected;

constructing an confrontation generating network, wherein the confrontation generating network is used for reconstructing the face characteristics;

cascading a target detection network and a countermeasure generation network to obtain a super-resolution network;

and inputting the face image to be detected into a super-resolution network to obtain a super-resolution face image.

A face image generation apparatus comprising:

the target detection network construction module is used for constructing a target detection network based on a residual error network and a fast regional network with convolutional neural network characteristics, and the target detection network is used for extracting and identifying face characteristics from a face image to be detected;

the confrontation generation network construction module is used for constructing a confrontation generation network which is used for reconstructing the face characteristics;

the super-resolution network module is used for cascading the target detection network and the countermeasure generation network to obtain a super-resolution network;

and the face image output module is used for inputting the face image to be detected into the super-resolution network to obtain the super-resolution face image.

A storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the above-mentioned method.

A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the steps of the method being performed when the program is executed by the processor.

The face image generation method, the face image generation device, the storage medium and the computer equipment comprise a target detection network constructed on the basis of a residual error network and a fast area network with convolutional neural network characteristics; constructing an confrontation generating network, wherein the confrontation generating network is used for reconstructing the face characteristics; cascading a target detection network and a countermeasure generation network to obtain a super-resolution network; inputting the face image to be detected into a super-resolution network to obtain a super-resolution face image; the target detection network in the super-resolution network extracts and identifies the face features from the face image to be detected, and the anti-generation network reconstructs the face features to obtain the super-resolution face image.

Drawings

FIG. 1 is a schematic flow chart diagram of a face image generation method in one embodiment;

FIG. 2 is a diagram illustrating a residual error network in the face image generation method according to an embodiment;

FIG. 3 is a flow diagram illustrating one step of a method for generating a face image according to one embodiment;

FIG. 4 is a schematic structural diagram of a face image generation apparatus according to an embodiment;

FIG. 5 is a diagram illustrating a target detection network in the face image generation method according to an embodiment;

fig. 6 is a schematic diagram of a countermeasure generation network in the face image generation method in one embodiment.

Detailed Description

As shown in fig. 1, a face image generation method includes:

s100, constructing a target detection network based on the residual error network and the fast regional network with the convolutional neural network characteristics, wherein the target detection network is used for extracting and identifying the human face characteristics from the human face image to be detected.

The Residual Network (ResNet) provides a reference for the input of each layer, and a Residual function is formed by learning, so that the Residual function is easier to optimize, and the number of Network layers can be greatly deepened. In computer vision, the level of image features becomes higher as the depth of a network becomes deeper, which is an important factor in achieving a good image processing effect. However, gradient dispersion or explosion becomes an obstacle to training the deep network, resulting in failure to converge. The depth of the converged network can be improved to ten times by means of normalization initialization, input normalization of each layer and the like, but the network starts to degrade, namely, increasing the number of network layers can cause larger errors. By stacking layers of y ═ x on a shallow network basis, the network can be made to grow with depth without degradation.

ResNet learns the residual function f (x) ═ h (x) -x, without introducing additional parameters and computational complexity, the residual function will typically have less response fluctuation. In practice, considering the cost of computation, the residual block is computationally optimized by replacing two 3 × 3 convolutional layers with 1 × 1+3 × 3+1 × 1, as shown in fig. 2, the middle 3 × 3 convolutional layer is first reduced in computation under one dimension-reduced 1 × 1 convolutional layer, and then reduced under another 1 × 1 convolutional layer, which both maintains the accuracy and reduces the computation amount.

Fast Regions with convolutional Neural network features (fast Regions with convolutional Neural Networks, fast Rcnn), including inputting a test image, inputting a whole picture into a CNN (convolutional Neural Networks, convolutional Neural network) for feature extraction; generating suggestion windows by using an RPN (resilient packet network), and generating 300 suggestion windows for each picture; mapping the suggestion window to the last layer of convolution characteristic graph of the CNN; generating a feature map of fixed size for each RoI (region of interest) by posing layer; the classification probability and Bounding box regression (Bounding box regression) are jointly trained using Softmax Loss and Smooth L1 Loss.

The convolutional neural network is composed of neurons according to a hierarchical structure like a common neural network, and weights and offsets between the neurons can be obtained through training. Similarly, the input data and the weight are operated, the output result is input into the excitation neuron, and then the result is output. In general, the entire neural network calculates scores of the final categories for the image data inputted at the pixel level using a score function, and then obtains an optimal weight by minimizing a loss function.

S200, constructing a confrontation generating network, wherein the confrontation generating network is used for reconstructing the face features.

The confrontation generation network GAN is inspired by two-player zero-sum games (two-player games) in the game theory, and two game parties in the GAN model are respectively served by a generating model (generating model) and a discriminant model (discriminant model). The generative model G captures the distribution of sample data, and noise z obeying certain distribution (such as uniform distribution, Gaussian distribution and the like) is used for generating a sample similar to real training data, wherein the pursuit effect is that the more the real sample is, the better the pursuit effect is; the discriminant model D is a two-classifier that estimates the probability that a sample is from training data (rather than from generative data), and if the sample is from real training data, D outputs a large probability; otherwise, D outputs a small probability. In the training process, one party is fixed, the network weight of the other party is updated, and the two parties are iterated alternately, in the process, the two parties optimize own networks to the utmost extent, so that competitive confrontation is formed until the two parties reach a dynamic balance (Nash equilibrium), at the moment, the generated model G recovers the distribution of training data (a sample which is the same as real data is created), the result cannot be judged by the discriminant model, the accuracy is 50%, and the guess is approximately equal to the misstatement. When G is fixed, for the optimization of D, it can be understood that: inputting real data, and optimizing a network structure to output 1 by the D; the input comes from the generated data and the D-optimized network structure outputs 0 itself. When D is fixed, G optimizes the network of itself to make itself output the sample as much as the real data, and makes D output high probability after the generated sample is judged by D.

And S300, cascading the target detection network and the countermeasure generation network to obtain the super-resolution network.

And connecting the trained target detection network with the countermeasure generation network to obtain a super-resolution network, and outputting a face image with low resolution to obtain a high-definition face image and facial five-sense feature characteristics after the face image with low resolution is input into the super-resolution network.

And S400, inputting the face image to be detected into a super-resolution network to obtain a super-resolution face image.

The super-resolution is to improve the resolution of the original image, and the process of obtaining a high-resolution image through a low-resolution image is super-resolution reconstruction. In a number of electronic image applications, high resolution images are often desired. High resolution means that the density of pixels in the image is high, providing more detail that is essential in many practical applications.

The face image generation method comprises the steps of constructing a target detection network based on a residual error network and a fast area network with convolutional neural network characteristics; constructing an confrontation generating network, wherein the confrontation generating network is used for reconstructing the face characteristics; cascading a target detection network and a countermeasure generation network to obtain a super-resolution network; inputting the face image to be detected into a super-resolution network to obtain a super-resolution face image; the target detection network in the super-resolution network extracts and identifies the face features from the face image to be detected, and the anti-generation network reconstructs the face features to obtain the super-resolution face image.

In one embodiment, as shown in fig. 3, the step S100 of constructing the target detection network in the face image generation method based on the residual error network and the fast area network having the convolutional neural network feature includes:

s120, constructing a feature extraction layer for extracting face features in the face image based on a residual error network;

s140, connecting the feature extraction layer with a target detection layer in a fast regional network with convolutional neural network features to form a sample target detection network, wherein the sample target detection network is used for identifying the human face features in the human face image;

and S160, training the sample target detection network to obtain the target detection network.

Specifically, a residual network (RestNet-101 layer) can be built, the residual network has 5 blocks in total, except for the first input block, the number of cycles of the convolutional network in the other four blocks is 3, 4, 23 and 3 respectively, the network is trained by adopting data in an ImageNet database to enable parameters to be fitted, and the accuracy of the previous classification can be 72.6% and the accuracy of the previous five is 93.7% after training is completed. The last convolutional layer of each cycle in the middle block is extracted by adopting picture input with the size of 321 × 321 pixels, the sizes of the four convolutional layers are respectively 160 × 160 × 256, 80 × 80 × 512, 40 × 40 × 1024 and 20 × 20 × 2048, and the parameters of the convolutional layers can be represented by (s, n), wherein n is the number of convolution kernels, and s is the size of the convolution kernels. After the characteristic quantity is reduced by one convolution layer from the convolution layer at the bottommost layer, the characteristic quantity is amplified by 2 times by using the transposition convolution layer and then is connected with the convolution layer at the upper layer, finally, the output layer of 160 multiplied by 256 is obtained, and the residual error network is used as a characteristic extraction network. Adjusting 1000 types of ResNet which are trained and based on ImageNet, changing the classification number of the ResNet into 6 types, inputting human faces and other 4 types of images as data sets, retraining parameters, achieving 100% accuracy, not fitting, and obtaining the required feature extraction network after the first-stage training is finished. And after the trained feature extraction network is obtained, connecting the obtained feature extraction network with a subsequent target detection layer according to the connection mode of the Faster Rcnn, and detecting five sense organs and a human face by using the fast Rcnn, wherein the categories to be classified are eyes, eyebrows, a nose, a mouth, the human face and other six categories.

In one embodiment, the step of training the sample target detection network to obtain the target detection network includes: inputting the face image carrying the feature marks into a sample target detection network, and outputting the positions and classification types of features in the face image, wherein the features comprise face five sense organs; and correcting the sample target detection network through a loss function to obtain the target detection network, wherein the loss function is the Euclidean distance between the position marked by the five sense organs and the position of the output five sense organs. When the loss function is minimized, iterative solution can be performed step by step through a gradient descent method, and the minimized loss function and the model parameter value are obtained. Euclidean distance is a commonly used definition of distance, referring to the true distance between two points in m-dimensional space, or the natural length of a vector (i.e., the distance of the point from the origin).

In one embodiment, the step of constructing a countermeasure generation network in the face image generation method includes: building a convolution network and a pixel displacement network; and connecting the convolution network with the pixel displacement network to form a countermeasure generation network. In another embodiment, the step of constructing a countermeasure generation network in the face image generation method includes: building a convolution layer and a transposition convolution layer, and connecting the convolution layer with the transposition convolution layer to form a convolution network; and connecting the transposed convolution layer in the convolution network with the pixel displacement network to form a countermeasure generation network. Wherein, the step of building the convolution layer and the transposition convolution layer, connecting the convolution layer with the transposition convolution layer and forming the convolution network comprises the following steps: building a first convolution layer, a second convolution layer and a third convolution layer which are connected in sequence; building a first transposition convolution layer and a second transposition convolution layer which are connected in sequence; and connecting the third convolution layer with the first convolution layer to form a convolution network.

Specifically, the countermeasure generation network may adopt 5 layers of convolutional networks plus 1 layer of pixel shift networks, the first three layers are convolutional layers with a step size of 2, the picture may be reduced by 8 times, the second two layers are transposed convolutional layers with a step size of 2, and the picture is enlarged by 4 times. The number of the characteristic layers is 132, the last layer of pixel displacement layer extracts the 132 characteristic layers into 3-bit channels, the length and the width of the picture are amplified by 8 times, a shallow confrontation generation network with the length and the width respectively amplified by 4 times is obtained, and the network is trained to be fitted.

In one embodiment, the step of cascading the target detection network and the countermeasure generation network in the face image generation method to obtain the super-resolution network further includes: reducing a network loss function by a gradient descent method, and correcting the super-resolution network, wherein the network loss function is the Euclidean distance between the facial image features output by the super-resolution network and the facial image features of a preset sample; the method for obtaining the super-resolution face image comprises the following steps of inputting a face image to be detected into a super-resolution network, and obtaining the super-resolution face image: and inputting the face image to be detected into the corrected super-resolution network to obtain the super-resolution face image.

In the machine learning algorithm, when the loss function is minimized, iterative solution can be performed step by a gradient descent method to obtain the minimized loss function and the model parameter value.

In one embodiment, a storage medium is further provided, on which a computer program is stored, wherein the program is executed by a processor to implement any one of the face image generation methods in the above embodiments. The storage medium may be an optical disc, a read-only memory, a random access memory, or the like.

The storage medium and the stored computer program can obtain super-resolution face images by implementing the processes of the embodiments of the face image generation methods, and the processing can effectively improve the definition of input images, thereby improving the image quality of the input images.

In one embodiment, a computer device is further provided, which includes a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the face image generation method as in any one of the above embodiments. When a processor of the computer device executes a program, the super-resolution face image can be obtained by implementing any one of the face image generation methods in the embodiments, and the processing can effectively improve the definition of the input image, thereby improving the image quality of the input image.

In one embodiment, a face image generation apparatus, as shown in fig. 4, includes:

the target detection network construction module 100 is configured to construct a target detection network based on a residual error network and a fast area network with convolutional neural network characteristics, where the target detection network is used to extract and identify facial characteristics from a facial image to be detected;

the confrontation generation network construction module 200 is used for constructing a confrontation generation network, and the confrontation generation network is used for reconstructing the face characteristics;

the super-resolution network module 300 is used for cascading the target detection network and the countermeasure generation network to obtain a super-resolution network;

the face image output module 400 is configured to input the face image to be detected to a super-resolution network to obtain a super-resolution face image.

The face image generation device comprises a target detection network construction module 100, an antagonism generation network construction module 200, a super-resolution network module 300 and a face image output module 400, wherein the target detection network construction module 100 constructs a target detection network based on a residual error network and a fast area network with convolutional neural network characteristics; the confrontation generation network construction module 200 constructs a confrontation generation network, and the confrontation generation network is used for reconstructing the face features; the super-resolution network module 300 cascades the target detection network and the countermeasure generation network to obtain a super-resolution network; the face image output module 400 inputs the face image to be detected into a super-resolution network to obtain a super-resolution face image; the target detection network in the super-resolution network extracts and identifies the face features from the face image to be detected, and the anti-generation network reconstructs the face features to obtain the super-resolution face image.

In one embodiment, a method for generating a face image specifically includes: 1. and (5) building a feature extraction network. And (2) a residual error network (ResNet-101) is built, 5 blocks are arranged in the residual error network in total, except a first input block, the cycle number of the convolution network in each block is respectively 3, 4, 23 and 3, the data of an ImageNet database is adopted to train the network so as to enable parameters to be fitted, and the accuracy of the previous classification can reach 72.6 percent and the accuracy of the previous five classification can reach 93.7 percent after the training is completed. The last convolutional layer of each cycle in the middle block is extracted by adopting picture input with the size of 321 × 321 pixels, the sizes of the four convolutional layers are respectively 160 × 160 × 256, 80 × 80 × 512, 40 × 40 × 1024 and 20 × 20 × 2048, and the parameters of the convolutional layers can be represented by (s, n), wherein n is the number of convolution kernels, and s is the size of the convolution kernels. After the characteristic quantity is reduced by one convolution layer from the convolution layer at the lowest layer, the characteristic quantity is amplified by 2 times by using the transposition convolution layer and is connected with the convolution layer at the previous layer, and finally, the output layer of 160 multiplied by 256 is obtained.

After the trained feature extraction layer is obtained, a fast regional network with convolutional neural network features is adopted to detect five sense organs and a human face, the categories to be classified are eyes, eyebrows, noses, mouths, the human face and other six categories, so that 1000 categories ResNet based on ImageNet which are trained need to be adjusted, the classification number of the 1000 categories ResNet is changed into 6 categories, the human face and other 4 categories of images are made into a data set to be input, parameters are retrained, the accuracy rate can reach 100%, fitting cannot be achieved, the first-stage training is finished, and the required feature extraction network is obtained.

2. And constructing a target detection network. The obtained feature extraction network is connected with a subsequent target detection layer according to the connection mode of the fast area network with the convolutional neural network feature to form a target detection network, as shown in fig. 5. And inputting the face picture with the five sense organs mark into the target detection network, outputting the face picture as the position and classification category of each five sense organs, and outputting a loss function as the Euclidean distance between the mark position and the predicted position. Based on the human face characteristics, the aspect ratio of the detection frame of the eyes and the mouth is 2: 1, the length-width ratio of the detection frame of the nose is 1: 2. the positions and the types of the facial features can be accurately detected through the trained target detection network.

3. And constructing a countermeasure generation network. There are two kinds of confrontation generating networks, one adopts deep network to extract better characteristics; and the other method adopts a shallow network and keeps more characteristics of the original image. As shown in fig. 6, the picture can be reduced by 8 times by using 5 layers of convolutional network plus 1 layer of pixel shift network, and the first three layers are convolutional layers with step length of 2; the latter two layers are transposed convolution layers with step size 2, magnifying the picture by 4 times. The number of the characteristic layers is 132, the last layer of pixel displacement layer extracts the 132 characteristic layers into 3-bit channels, the length and the width of the picture are amplified by 8 times, then a shallow confrontation generation network with the length and the width respectively amplified by 4 times is obtained, and the network is trained to be fitted.

4. The countermeasure generation network is cascaded with the target detection network. And connecting the trained target detection network with the countermeasure generation network, and obtaining a high-definition picture and the characteristics of the five sense organs of the high-definition picture after a low-resolution picture is input into the connected network. And simultaneously, inputting the average value of the high-definition face pictures in the database to obtain the comprehensive facial feature value. And for corresponding features, using an L2 universal number as a loss function, deriving the high-definition picture by using the loss function, and reducing the loss function by using a gradient descent method to improve the accuracy of the five-sense-organ features in the high-definition picture.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A face image generation method is characterized by comprising the following steps:

constructing a target detection network based on a residual error network and a fast regional network with convolutional neural network characteristics, wherein the target detection network is used for extracting and identifying human face characteristics from a human face image to be detected, and the target detection network is obtained by connecting a characteristic extraction layer constructed by the residual error network with a target detection layer in the fast regional network;

constructing a confrontation generating network, wherein the confrontation generating network is used for reconstructing the human face features;

cascading the target detection network and the countermeasure generation network to obtain a super-resolution network;

inputting the face image to be detected into the super-resolution network to obtain a super-resolution face image;

the step of constructing the target detection network based on the residual error network and the fast area network with the convolutional neural network characteristics comprises the following steps:

constructing a feature extraction layer for extracting face features in a face image based on a residual error network, wherein the feature extraction layer is an output layer obtained by starting from a convolution layer at the bottommost layer in the residual error network, reducing the number of features through one convolution layer, amplifying by using a transposed convolution layer and then connecting with a convolution layer at the previous layer;

connecting the feature extraction layer with a target detection layer in a fast area network with convolutional neural network features to form a sample target detection network, wherein the sample target detection network is used for identifying the human face features in the human face image;

and training the sample target detection network to obtain a target detection network.

2. The method of claim 1, wherein the step of training the sample target detection network to obtain a target detection network comprises:

inputting a face image carrying feature marks into the sample target detection network, and outputting positions and classification categories of features in the face image, wherein the features comprise face five sense organs;

and correcting the sample target detection network through a loss function to obtain a target detection network, wherein the loss function is the Euclidean distance between the position of the feature mark and the output feature position.

3. The method of generating a face image according to claim 1, wherein the step of constructing a countermeasure generating network comprises:

building a convolution network and a pixel displacement network;

and connecting the convolution network with the pixel displacement network to form a countermeasure generation network.

4. The face image generation method according to claim 3, wherein the step of constructing a countermeasure generation network includes:

building a convolutional layer and a transposed convolutional layer, and connecting the convolutional layer with the transposed convolutional layer to form a convolutional network;

and connecting the transposed convolution layer in the convolution network with the pixel displacement network to form a countermeasure generation network.

5. The method according to claim 4, wherein the step of building a convolutional layer and a transpose convolutional layer, connecting the convolutional layer and the transpose convolutional layer to form a convolutional network comprises:

building a first convolution layer, a second convolution layer and a third convolution layer which are connected in sequence;

building a first transposition convolution layer and a second transposition convolution layer which are connected in sequence;

and connecting the third convolution layer with the first convolution layer to form a convolution network.

6. The face image generation method according to claim 1,

the step of cascading the target detection network and the countermeasure generation network to obtain a super-resolution network further comprises:

reducing a network loss function by a gradient descent method, and correcting the super-resolution network, wherein the network loss function is the Euclidean distance between the facial image features output by the super-resolution network and the facial image features of a preset sample;

the step of inputting the face image to be detected into the super-resolution network to obtain the super-resolution face image comprises the following steps:

and inputting the face image to be detected into the corrected super-resolution network to obtain a super-resolution face image.

7. A face image generation apparatus, comprising:

the target detection network construction module is used for constructing a target detection network based on a residual error network and a fast regional network with convolutional neural network characteristics, the target detection network is used for extracting and identifying human face characteristics from a human face image to be detected, and the target detection network is obtained by connecting a characteristic extraction layer constructed by the residual error network with a target detection layer in the fast regional network;

the confrontation generation network construction module is used for constructing a confrontation generation network, and the confrontation generation network is used for reconstructing the face features;

the face image output module is used for inputting the face image to be detected into the super-resolution network to obtain a super-resolution face image;

the target detection network construction module is also used for constructing a feature extraction layer for extracting face features in the face image based on a residual error network, wherein the feature extraction layer is an output layer obtained by starting from a convolution layer at the bottommost layer in the residual error network, reducing the feature quantity through one convolution layer, amplifying by using a transposed convolution layer and then connecting with a convolution layer at the previous layer;

8. The apparatus of claim 7, wherein the target detection network construction module is further configured to input a face image carrying feature labels to the sample target detection network, and output positions and classification categories of features in the face image, where the features include facial features;

9. A storage medium on which a computer program is stored, which program, when being executed by a processor, is adapted to carry out the steps of the method of any one of claims 1 to 6.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1-6 are implemented when the program is executed by the processor.