CN112509144A

CN112509144A - Face image processing method and device, electronic equipment and storage medium

Info

Publication number: CN112509144A
Application number: CN202011428827.4A
Authority: CN
Inventors: 王杉杉; 胡文泽; 王孝宇
Original assignee: Shenzhen Intellifusion Technologies Co Ltd
Current assignee: Shenzhen Intellifusion Technologies Co Ltd
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2021-03-16

Abstract

The embodiment of the invention provides a face image processing method, a face image processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring a preset number of continuous face frame images; inputting the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and rendering the three-dimensional face images into two-dimensional face images; carrying out image denoising and deblurring on the continuous face frame images through the two-dimensional face images and a pre-trained image enhancement network to obtain target face images; the three-dimensional reconstruction network and the image enhancement network are in cascade arrangement. The continuous face frame images are subjected to three-dimensional face reconstruction and rendered to obtain two-dimensional face images, so that the two-dimensional face images have clearer characteristics and can be used as priori information, and after the image enhancement network carries out denoising and deblurring on the continuous face frame images according to the two-dimensional face images used as the priori information, the face images with higher visual quality can be obtained.

Description

Face image processing method and device, electronic equipment and storage medium

Technical Field

The invention relates to the field of artificial intelligence, in particular to a face image processing method and device, electronic equipment and a storage medium.

Background

With the intensive research on artificial intelligence, image recognition technology is constantly falling on the ground. In the image recognition technology, the image quality is an important basis of image recognition, the accuracy and the effectiveness of an image recognition system are directly influenced, and the human face image recognition technology is characterized in that the image with higher visual quality is required to ensure the accuracy and the effectiveness of the human face image recognition. In the current image recognition using scene, mostly, a camera is used to collect a scene image, and due to considerations of cost and space, in order to obtain a larger image collecting range, the camera is often installed at a relatively higher position in the using scene, so that although the larger image collecting range is obtained, in the shot image, the image of the target object is often not high in visual quality, and there are situations of a large amount of blur or large noise. Based on this problem, the most direct way can be solved by improving the resolution and imaging quality of the camera, which requires a monetary cost. The traditional image denoising and deblurring algorithm needs to know the blurring parameters and the noise distribution type which cause the image to generate blurring to be effective, but the fuzzy method and the noise distribution type are searched for, which is an ill-posed problem, and infinite solutions exist, so that the method is very difficult to realize. Therefore, in face image recognition, there is a problem that the face image quality is low.

Disclosure of Invention

The embodiment of the invention provides a face image processing method, which can perform denoising and deblurring processing on a face image, thereby improving the quality of the face image in face image recognition.

In a first aspect, an embodiment of the present invention provides a face image processing method, including:

acquiring a preset number of continuous face frame images;

inputting the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and rendering the three-dimensional face images into two-dimensional face images;

carrying out image denoising and deblurring on the continuous face frame images through the two-dimensional face images and a pre-trained image enhancement network to obtain target face images;

the three-dimensional reconstruction network and the image enhancement network are in cascade arrangement.

Optionally, the three-dimensional reconstruction network includes a parameter extraction network and a parameter reconstruction network, and the method includes inputting the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and rendering a three-dimensional face image into a two-dimensional face image after obtaining the three-dimensional face image, including:

extracting the face reconstruction parameters of the continuous face frame images through the parameter extraction network;

reconstructing the face reconstruction parameters through the parameter reconstruction network to obtain a three-dimensional face;

and performing two-dimensional rendering on the three-dimensional face to obtain a two-dimensional face image.

Optionally, the image enhancement network includes a denoising convolution kernel and a deblurring convolution kernel, and the image denoising and deblurring are performed on the continuous face frame image through the two-dimensional face image and a pre-trained image enhancement network to obtain a target face image, including:

inputting the two-dimensional face image into the image enhancement network as prior information of the continuous face frame image;

taking the prior information as a reference, and denoising the continuous human face frame images through the denoising convolution kernel; and

taking the prior information as a reference, and performing deblurring operation on the continuous human face frame image through the deblurring convolution kernel;

and outputting to obtain a target face image.

Optionally, the three-dimensional reconstruction network and the image enhancement network are jointly trained, where the joint training includes:

constructing a sample set, wherein the sample set comprises a sample face image, a face key point position label, a pixel mean square error label and a reconstruction parameter label which correspond to the sample face image;

inputting the sample set into the three-dimensional reconstruction network and the image enhancement network which are cascaded to train the three-dimensional reconstruction network and the image enhancement network;

calculating a first loss function of a predicted reconstruction parameter and a reconstruction parameter label in the three-dimensional reconstruction network; and

calculating a second loss function corresponding to the positions of the face key points in the two-dimensional face image and the position labels of the face key points in the three-dimensional reconstruction network; and

calculating a third loss function of the predicted face image and the pixel mean square error label in the image enhancement network;

and calculating the total loss of the first loss function, the second loss function and the third loss function, adjusting parameters of the three-dimensional reconstruction network and the image enhancement network according to back propagation, and performing iterative training to minimize the total loss to obtain a trained three-dimensional reconstruction network and a trained image enhancement network.

Optionally, the sample set includes a plurality of groups of sample face images, and the constructing the sample set includes:

acquiring a preset number of continuous face frame images as a group of to-be-processed sample face images;

randomly selecting at least one fuzzy kernel and at least one noise kernel from pre-prepared fuzzy kernels and performing fuzzy addition and noise addition on a current sample face image to be processed to obtain a processed sample face image;

acquiring a human face key point position label, a pixel mean square error label and a reconstruction parameter label corresponding to the human face image of the sample to be processed according to the human face image of the sample to be processed;

and adding the processed sample face image, and the face key point position label, the pixel mean square error label and the reconstruction parameter label corresponding to the sample face image to be processed into the sample set as a group of sample face images.

acquiring 2n +1 continuous face frame images as a group of to-be-processed sample face images;

randomly selecting at least one fuzzy core and at least one noise core from the prepared fuzzy cores and noise cores to perform fuzzy addition and noise addition on the current sample face image to be processed, so as to obtain 2n +1 processed sample face images;

performing channel connection on the 2n +1 processed sample face images to obtain a target sample face image;

acquiring a human face key point position label, a pixel mean square error label and a reconstruction parameter label corresponding to the nth sample human face image to be processed according to the sample human face image to be processed;

and adding the target sample face image, and the face key point position label, the pixel mean square error label and the reconstruction parameter label corresponding to the nth sample face image to be processed into the sample set as a group of sample face images.

In a second aspect, an embodiment of the present invention provides a face image processing apparatus, including:

the acquisition module is used for acquiring a preset number of continuous face frame images;

the first processing module is used for inputting the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and rendering the three-dimensional face images into two-dimensional face images;

the second processing module is used for carrying out image denoising and deblurring on the continuous face frame images through the two-dimensional face images and a pre-trained image enhancement network to obtain target face images;

In a third aspect, an embodiment of the present invention provides an electronic device, including: the human face image processing method comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the computer program to realize the steps in the human face image processing method provided by the embodiment of the invention.

In a fourth aspect, the embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps in the face image processing method provided by the embodiment of the present invention.

In the embodiment of the invention, a preset number of continuous face frame images are obtained; inputting the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and rendering the three-dimensional face images into two-dimensional face images; carrying out image denoising and deblurring on the continuous face frame images through the two-dimensional face images and a pre-trained image enhancement network to obtain target face images; the three-dimensional reconstruction network and the image enhancement network are in cascade arrangement. The two-dimensional face image is obtained by performing three-dimensional face reconstruction on the continuous face frame image and rendering, so that the image quality of the two-dimensional face image is very high, the two-dimensional face image has clearer characteristics and can be used as prior information, and after the image enhancement network performs denoising and deblurring on the continuous face frame image according to the two-dimensional face image used as the prior information, the face image with higher visual quality can be obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a face image processing method according to an embodiment of the present invention;

fig. 2 is a flowchart of a three-dimensional face reconstruction method according to an embodiment of the present invention;

fig. 3 is a flowchart of two-dimensional face acquisition according to an embodiment of the present invention;

FIG. 4 is a flowchart illustrating denoising and deblurring of a face image according to an embodiment of the present invention;

FIG. 5 is a flowchart of a three-dimensional reconstruction network and image enhancement network joint training method according to an embodiment of the present invention;

FIG. 6 is a flow chart of constructing a sample set according to an embodiment of the present invention;

FIG. 7 is a flowchart of another method for jointly training a three-dimensional reconstruction network and an image enhancement network according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of a face image processing apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of another face image processing apparatus according to an embodiment of the present invention;

fig. 10 is a schematic structural diagram of another face image processing apparatus according to an embodiment of the present invention;

fig. 11 is a schematic structural diagram of another face image processing apparatus according to an embodiment of the present invention;

fig. 12 is a schematic structural diagram of another face image processing apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of another face image processing apparatus according to an embodiment of the present invention;

fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1, fig. 1 is a flowchart of a face image processing method according to an embodiment of the present invention, as shown in fig. 1, including the following steps:

101. and acquiring a preset number of continuous face frame images.

The continuous face frame images can be obtained by shooting with a camera and sending in real time, or obtained by intercepting in a video. The camera can be a three-dimensional depth camera and can shoot a human face frame image with depth information.

The continuous face frame images can be understood as continuous in the frame sequence, and can be understood as that the camera takes 30 frames of images in one second, and then the 30 frames of images are continuous in the frame sequence. The continuous facial frame images can also be understood as being continuous in time sequence, for example, when the camera takes one frame image every millisecond, the continuous facial frame images can be 5 facial frame images taken by 5 milliseconds, and in some possible cases, a facial frame image with higher quality can also be selected every 3 milliseconds, and a total of 5 facial frame images are selected as the continuous facial frame images.

In an embodiment of the present invention, the preset number may be 2n +1, where n is an integer greater than or equal to 1. At this time, when the camera is shooting in real time, the face quality of the currently shot face frame image is calculated in real time, and when the face quality is higher than a preset threshold value, the currently shot face frame image is recorded as the (n +1) th frame, at this time, n face frame images can be taken forward, n face frame images can be taken backward, and 2n +1 face frame images are taken in total. For example, when the camera is used for real-time shooting, the currently shot face frame image is a 256 th frame image in a video, if the face quality in the 256 th frame image is calculated to be higher than a preset threshold value, the 256 th frame image is recorded as an n +1 th frame (i.e., a 3 rd frame in the continuous face frame image), 2 frames are taken forward, 2 frames are taken backward, and 5 face frame images are obtained as continuous face frame images. Certainly, since the camera is used for shooting in real time, when 2 frames are taken backwards, if the quality of the face of the shot face frame image is higher in the next 2 frames, the face frame image is selected again from the next n +1 th frame as the face frame image. If the continuous face frame images are obtained by intercepting in the video, selecting a frame with the highest face quality of the face frame images in the video as the (n +1) th frame, then taking the n face frame images forwards, taking the n face frame images backwards, and taking the 2n +1 face frame images as the continuous face frame images in total. Thus, the input quality of the continuous face frame images can be improved. In the 2n +1 face frame images, the (n +1) th face frame image may also be referred to as a face frame image representative.

Of course, the preset number may be 2n, where n is greater than or equal to 2, a face frame image representative needs to be selected from 2n face frame images, and if n is 2, 2n is 4, and the 2(n) th or 3(n +1) th face frame image in 4 face frame images may be selected as a face frame image representative.

102. And inputting the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and rendering the three-dimensional face frame images into two-dimensional face images.

In this step, the three-dimensional reconstruction network may be a multi-input convolutional neural network to satisfy the input of the continuous face frame images. Further, the three-dimensional reconstruction network in the embodiment of the present invention may be a full convolution neural network, and may support images of different sizes as input.

Three-dimensional face reconstruction can be understood as a process of restoring a three-dimensional face from a two-dimensional face image, and can also be understood as representing any one face by a set of face reconstruction parameters. For example, in the (BFM2017 database), for each 2217 th facial patch vertex of the facial shape base represents the left outer eye corner point of the face, by performing feature extraction on a two-dimensional facial image, feature information representing the point cloud number or the facial patch number can be obtained, which is equivalent to obtaining corresponding facial reconstruction parameters, so that each textured three-dimensional face can be represented by the corresponding facial reconstruction parameters. It is further understood that each three-dimensional face can be represented by face reconstruction shape parameters, face reconstruction texture parameters, and face motion parameters.

In an embodiment of the present invention, the three-dimensional face reconstruction network includes a parameter extraction network and a parameter reconstruction network.

Specifically, as shown in fig. 2, fig. 2 is a flowchart of a three-dimensional face reconstruction method according to an embodiment of the present invention, and the flowchart includes:

201. and extracting the face reconstruction parameters of the continuous face frame images through a parameter extraction network.

The face reconstruction parameters may include: the human face reconstruction method comprises the following steps of reconstructing a human face shape parameter, reconstructing a human face texture parameter, a human face action, a human face posture parameter, a human face angle parameter or a human face key point parameter and the like.

In this step, the continuous face frame images are input to a parameter extraction network, and since the parameter extraction network is a pre-trained convolutional neural network, face reconstruction parameters contained in the continuous face frame images can be extracted through a trained convolutional kernel and a corresponding activation function, and the face reconstruction parameters include: face reconstruction shape parameters, face reconstruction texture parameters and face action parameters.

Specifically, the feature corresponding to the face reconstruction parameter reconstruction _ params output by the parameter extraction network is expressed as a four-dimensional vector [ _3dface _ shape _ params, _3d face _ texture _ params, (R, T) ], where _3dface _ shape _ params is expressed as a face reconstruction shape parameter, _3dface _ texture _ params is expressed as a face reconstruction texture parameter, and (R, T) is expressed as a face motion parameter.

The face reconstruction shape parameter represents the spatial position of each point in the three-dimensional face to be reconstructed, the face reconstruction texture parameter represents the pixel value (color) of each point in the three-dimensional face to be reconstructed, in the face action parameter, R represents a rotation matrix, and T represents the displacement of the characteristic point.

In a possible embodiment, the parameter extraction network may be a convolutional neural network based on resnet18, where there are 18 convolutional layers with weights and a fully-connected layer in resnet18, that is, the depth of the convolutional neural network is 18, and since in resnet18, through residual connection and identity mapping, the problem that the model degrades due to the depth in the convolutional neural network is solved, and more accurate features can be extracted from consecutive face frame images.

202. And reconstructing the human face reconstruction shape parameters, the human face reconstruction texture parameters and the human face action parameters through a parameter reconstruction network to obtain the three-dimensional human face.

In this step, the above-mentioned parameter reconstruction network may be a reconstruction network based on a baseline model, and the face reconstruction shape parameter, the face reconstruction texture parameter, and the face action parameter extracted in step 201 are input to the parameter reconstruction network, and the three-dimensional face is reconstructed by the parameter reconstruction network.

Specifically, the three-dimensional face can be reconstructed by the following formula:

Face3d(shape,texture)＝(Face_shape_3d,Face_tex_3d) (formula 3)

Wherein the content of the first and second substances,

t, s and m are known parameters in the baseline model,

the average shape parameter is represented by the average shape parameter,

representing the mean texture parameter, t_iAnd s_iRespectively a main component of the face texture and a main component of the face shape in the baseline model, and m is the number of key points of the face. Due to the fact that

t, s, m are known parameters in the baseline model, and Face _ shape in formula 1 and formula 2_3dFace _ tex, a three-dimensional Face shape in the baseline model_3dFor the three-dimensional face texture in the bag model, 3dface _ shape _ params, _3dface _ texture _ params, (R, T) are all face reconstruction parameters extracted by the parameter extraction network, and are also known; therefore, in the parametric reconstruction network, Face3d (shape, texture) can be obtained according to formula 3, where Face3d (shape, texture) in formula 3 represents a corresponding three-dimensional Face, and Face3d (shape, texture) represents that the three-dimensional Face includes a shape (spatial position of each point) and a texture (color of each point).

In the embodiment of the present invention, after the three-dimensional face is obtained in step 202, the three-dimensional face is rendered into a two-dimensional face through a rendering component, for example, an OpenGL (open image library) renders the three-dimensional face into a two-dimensional face. In the rendering process, the face orientation feature in the face frame image representation of the continuous face frame images can be extracted, and the three-dimensional face is rendered into a two-dimensional face in the face orientation according to the face orientation feature, so that the two-dimensional face orientation is the same as the face orientation in the face frame image representation.

In a possible embodiment, the number of the consecutive face frame images is 2n +1, the step 102 is specifically shown in fig. 3, fig. 3 is a schematic diagram of two-dimensional face acquisition according to an embodiment of the present invention, and taking 2n +1 as an example or 3, the method specifically includes:

301. and acquiring continuous face frame images im _0, im _1 and im _ 2.

The im _0, im _1, im _2 can be obtained by the method provided in step 101. Wherein im _1 is represented by a human face frame image.

It should be noted that, for convenience of description, the number of consecutive face frame images in this embodiment is 3 for example, and should not be considered as a limitation to the present invention, and the number of consecutive face frame images may also be 5, 7, and so on. Of course, the number of the continuous face frame images may also be an even number, but since the three-dimensional face reconstruction network in the embodiment of the present invention is a multi-input single-output network, when the number of the continuous face frame images is an even number, the face frame image representations of the continuous face frame images should be predetermined.

302. Im _0, im _1 and im _2 are connected on the channel.

In the step, im _0, im _1 and im _2 are connected on channels to obtain a multi-channel image, and the multi-channel image is input to a three-dimensional face reconstruction network to predict face reconstruction parameters.

303. And inputting the connected im _0, im _1 and im _2 into a parameter extraction network in the three-dimensional face reconstruction network to extract the face reconstruction parameters.

In this step, the face reconstruction parameters include: face reconstruction shape parameters, face reconstruction texture parameters and face action parameters.

304. And reconstructing the extracted face reconstruction parameters through a parameter extraction network in the three-dimensional face reconstruction network to obtain a reconstructed three-dimensional face.

In this step, the three-dimensional face reconstruction step may refer to step 202, and the three-dimensional face is reconstructed according to the face reconstruction shape parameter, the face reconstruction texture parameter, and the face motion parameter.

305. And performing two-dimensional rendering on the reconstructed three-dimensional face to obtain a two-dimensional face.

Wherein, the two-dimensional face and the face frame image representation im _1 have similar face key points. The face image has very clear prior information, such as face key points, face shape parameters, face texture parameters, and the like, which can be used as prior information, so the face key points can be used as prior information in subsequent steps, and certainly the prior information in the subsequent steps can be any one or combination of the face key points, the face shape parameters, and the face texture parameters.

103. And carrying out image denoising and deblurring on the continuous face frame images through the two-dimensional face images and the pre-trained image enhancement network to obtain the target face image.

In this step, the image enhancement network may be a multi-input convolutional neural network to satisfy the input of the continuous face frame image and the two-dimensional face image. Further, the image enhancement network may be a Face enhancement network (Face Enhance Net).

In the embodiment of the invention, the two-dimensional face image is used as prior information to form reference information for denoising and deblurring in an image enhancement network.

The image enhancement network comprises a denoising convolution kernel and a deblurring convolution kernel, wherein the denoising convolution kernel and the deblurring convolution kernel are respectively used for denoising and deblurring continuous face frame images, so that the obtained face images are higher in quality. Note that the above-described denoising convolution kernel and deblurring convolution kernel may also be referred to as convolution kernels.

Specifically, as shown in fig. 4, fig. 4 is a flowchart for denoising and deblurring a human face image according to an embodiment of the present invention, which specifically includes the following steps:

401. and inputting the two-dimensional face image into an image enhancement network as prior information of the continuous face frame image.

In this step, in the two-dimensional face image, any one or a combination of a face key point, a face shape parameter, and a face texture parameter may be used as the prior information. In the image enhancement network, the prior information of the two-dimensional face image can be extracted through convolution operation.

The image enhancement network is used for carrying out denoising operation and deblurring operation on continuous face frame images

402. And (4) taking the prior information as a reference, and performing denoising operation on the continuous human face frame images through denoising convolution kernel.

The denoising operation refers to filtering noise of continuous face frame images, such as light noise, electric signal noise, and the like, by a convolution kernel. Further, the image enhancement network is a multi-input single-output network, and the denoising operation can be understood as filtering noise represented by a human face frame image through a convolution kernel.

403. And taking the prior information as a reference, and carrying out deblurring operation on the continuous face frame images through deblurring convolution check.

The deblurring operation refers to filtering out blurring factors of continuous face frame images through a convolution kernel, such as defocus blur, motion blur, shake blur and the like. Further, the image enhancement network is a multi-input single-output network, and the above-mentioned deblurring operation can also be understood as filtering out the blurring factor represented by the face frame image through a convolution kernel.

In the

above steps

402 and 403, the prior information may be a face key point, and the face key point may be, for example: eyebrow, eye, nose, mouth, face contour, etc. The face key points may be extracted when the two-dimensional image is generated in step 102, or extracted when the two-dimensional image is input to an image enhancement network. Because the key points of the human face have the characteristics of tidiness and strong robustness, the key points can be used as prior information to guide the denoising and deblurring operations represented by the image of the human face frame, namely in the denoising and deblurring processes, the feature values of the corresponding key points of the human face need to be kept from being reduced.

Similarly, when the prior information is a face shape parameter and a face texture parameter, the face shape parameter may be described by a face key point, and the face texture parameter may be described by pixel or color gradient distribution, and both may be used as the prior information to guide denoising and deblurring operations represented by a face frame image, that is, in the denoising and deblurring processes, the corresponding face shape parameter and the face texture parameter need to be kept unchanged or within a predetermined range.

It should be noted that the execution sequence of the step 402 and the step 403 may not be sequential, and in some possible embodiments, the denoising convolution kernel and the deblurring convolution kernel may be fused, and after the training of the fused convolution kernel is completed, the denoising and the deblurring may be performed on the continuous face frame images at the same time.

404. And outputting to obtain a target face image.

When the continuous face frame images are denoised and deblurred, the image enhancement network can output a target face image, and the target face image can be used for face recognition. Due to the fact that the target face image is subjected to denoising and deblurring, the image quality of the face is high, and the accuracy of face recognition can be improved in the follow-up face recognition.

It should be noted that the three-dimensional reconstruction network and the image enhancement network are in a cascade configuration, and the three-dimensional reconstruction network and the image enhancement network are cascaded, so that the overall performance of the two networks can be improved, and the training set is uniform, so that the learning speed is high, and the training speed of the networks is improved.

It should be noted that the face image processing method provided by the embodiment of the present invention can be applied to devices such as a mobile phone, a monitor, a computer, and a server that can perform face image processing.

Referring to fig. 5, fig. 5 is a flowchart of a three-dimensional reconstruction network and image enhancement network joint training method according to an embodiment of the present invention. As shown in fig. 5, the method includes:

501. and constructing a sample set.

The sample set comprises a sample face image, a face key point position label, a pixel mean square error label and a reconstruction parameter label which correspond to the sample face image.

In this step, the face images in the sample set are consecutive face frame images of a preset number, and further, in the training process and the inference process, the preset number of consecutive face frame images is the same. For example, in the training process, the number of consecutive face frame images in the sample set is 5, and in the inference process, the number of input consecutive face frame images is also 5. It should be noted that the preset number of consecutive face frame images may be used as a group of sample face images, for example, a group of sample face images includes 5 consecutive face frame images.

The human face key point position label, the pixel mean square error label and the reconstruction parameter label corresponding to the sample human face image can be labeled manually or labeled through a corresponding automatic labeling network.

The sample face image is a sample face image with a blurring factor and a noise factor. The sample face can be obtained by adding blurring factors and noise factors to a clear face image.

Specifically, as shown in fig. 6, fig. 6 is a flowchart for constructing a sample set according to an embodiment of the present invention, and the flowchart includes:

601. and acquiring a preset number of continuous face frame images as a group of to-be-processed sample face images.

In this step, the sample face image to be processed may be a high-quality face image captured by a camera, a high-quality face image captured in a video, or a high-quality face image generated by a GAN (global area network) network or other face generators. The high-quality face image can be subjected to standard calibration to obtain a more accurate label. Of course, in some automatic labeling networks, because the automatic labeling network learns the fuzzy face label, in the embodiment of the present invention, it is not necessary to require that the sample face image to be processed is of high quality.

In the embodiment of the invention, the three-dimensional reconstruction network and the image enhancement network are both multi-input neural networks, so that the sample face image can be a group of continuous face frame images. Furthermore, because the three-dimensional reconstruction network and the image enhancement network are both single-output neural networks, one face image can be selected from a group of sample face images as a representative face frame image.

Specifically, the preset number may be 2n +1, that is, in this step, 2n +1 continuous face frame images are acquired as a group of sample face images to be processed. Therefore, the (n +1) th frame in the middle can be selected as a face frame image representative, so that upper and lower frame image information which are uniformly distributed can be obtained, and the accuracy of the three-dimensional reconstruction network and the image enhancement network can be improved.

602. And randomly selecting at least one fuzzy kernel and at least one noise kernel from the prepared fuzzy kernels and the noise kernels to perform fuzzy addition and noise addition on the current sample face image to be processed, so as to obtain the processed sample face image.

In this step, the previously prepared blur kernel and noise kernel may be obtained from image blur data and image noise data supplied from corresponding databases. Specifically, a fuzzy kernel library and a noise kernel library can be established according to the image fuzzy data and the image noise data, and the current sample face image to be processed can be processed by randomly selecting or appointing to select a fuzzy kernel and a noise kernel in the fuzzy kernel library and the noise kernel library, so that the sample face image added with the fuzzy factor and the noise factor is obtained and used as the processed sample face image. It should be noted that, because the current sample face image to be processed is a set of sample face frame images with a preset number, the same blurring factor and noise factor are added to a set of sample face frame images with a preset number through the same blurring kernel and noise kernel.

In a possible embodiment, the blurring factor and the noise are added to the current sample face image to be processed, or the processed sample face image may be generated by simultaneously inputting the current sample face image to be processed and the type of the blurring factor and the type of the noise to be added to the image generator.

The sample face image to be processed may be understood as an image F (x, y), and the sample face image after being processed may be understood as an image G (x, y), where G (x, y) ═ H (F (x, y)) + N, where H is a blur kernel and N is a noise kernel.

603. And acquiring a position label of a key point of the face, a pixel mean square error label and a reconstruction parameter label corresponding to the current sample face image to be processed according to the current sample face image to be processed.

In this step, a human face key point position label, a pixel mean square error label and a reconstruction parameter label corresponding to the current sample human face image to be processed can be obtained through manual labeling or automatic labeling. The human face key point position label, the pixel mean square error label and the reconstruction parameter label can be used as training guidance, so that the reconstruction parameter of a three-dimensional face output by a three-dimensional reconstruction network in the training process is as close as possible to the reconstruction parameter label, the human face key point position of a two-dimensional human face image obtained by rendering the three-dimensional reconstruction network is as close as possible to the human face key point position label, and a target human face image output by an image enhancement network is as close as possible to the pixel mean square error label, so that the output target human face image can be trained to be an expected human face image by the cascaded three-dimensional reconstruction network and the image enhancement network.

604. And adding the processed sample face image, and a face key point position label, a pixel mean square error label and a reconstruction parameter label which correspond to the current sample face image to be processed into a sample set as a group of sample face images.

In the step, fuzzy factors and noise factors are added in the processed sample face images, so that the processed sample face images are closer to the face images acquired by a real camera. It should be noted that a group of sample face images includes a preset number of continuous face frame images, and a face key point position label, a pixel mean square error label, and a reconstruction parameter label corresponding to the face frame image representation.

Furthermore, in the embodiment of the present invention, consecutive face frame images in a group of sample face images may be connected on a channel to form an input whole.

502. And inputting the sample set into a three-dimensional reconstruction network and an image enhancement network which are cascaded to train the three-dimensional reconstruction network and the image enhancement network.

In the step, the three-dimensional reconstruction network and the image enhancement network are in cascade connection, and the input of the two networks comprises the same sample face image.

The training of the three-dimensional reconstruction network includes: (1) enabling the three-dimensional reconstruction network to learn the reconstruction parameter prediction of the three-dimensional face and render the three-dimensional face into a two-dimensional face; (2) and enabling the image enhancement network to learn the prediction of deblurring and denoising.

In the training process, the error between the predicted result and the expected result (label) can be calculated through a loss function, and parameters of the three-dimensional reconstruction network and the image enhancement network are adjusted through reverse error propagation, so that the training is completed.

503. And calculating a first loss function of the predicted reconstruction parameters and the reconstruction parameter labels in the three-dimensional reconstruction network.

In this step, the first loss function is expressed by the following equation:

the parameter value of the sample face image in the sample set is represented by a parameter value of the sample face image, and the parameter value of the sample face image in the sample set is represented by a parameter value of the sample face image.

504. And calculating a second loss function corresponding to the predicted human face key point position and the human face key point position label in the two-dimensional human face image in the three-dimensional reconstruction network.

In this step, the second loss function is expressed by the following equation:

the Landmark _ Loss represents a second Loss function, the connected _ mark represents a predicted face key point position, the origin _ mark represents a face key point position label, N represents the total number of sample face images in a sample set, connected _ mark [ i ] [ x ] represents an x coordinate in the predicted face key point position of the current sample face image, origin _ mark [ i ] [ x ] represents an x coordinate in the face key point position label of the current sample face image, and similarly, connected _ mark [ i ] [ y ] represents a y coordinate in the predicted face key point position of the current sample face image, and origin _ mark [ i ] [ y ] represents a y coordinate in the face key point position label of the current sample face image.

505. And calculating a third loss function of the predicted pixel mean square error and the pixel mean square error label of the target face image in the image enhancement network.

In this step, the third loss function is expressed by the following equation:

ImageMse_Loss＝mse_loss(Face_enhanced，im_n+1) (formula 6)

Wherein the ImageMse _ Loss represents a third Loss function, and the mse _ Loss (Face _ enhanced, im1) represents a predicted Face image andloss of pixel mean square error label. Wherein, the Face _ enhanced is a predicted Face image, im, output by the image enhancement network_n+1Representing the representation of the face frame image when the preset number is 2n + 1.

506. And calculating the total loss of the first loss function, the second loss function and the third loss function, adjusting parameters of the three-dimensional reconstruction network and the image enhancement network according to the back propagation, and performing iterative training to minimize the total loss to obtain a trained three-dimensional reconstruction network and a trained image enhancement network.

In this step, the total loss of the first loss function, the second loss function, and the third loss function may be the sum of the first loss function, the second loss function, and the third loss function, or may be the weighted sum of the first loss function, the second loss function, and the third loss function, and may be specifically selected or set by the user. For example, when the gradient of the image enhancement network decreases faster, the weighting coefficient of the third loss function can be adjusted to decrease the gradient of the total loss, so that the training of the three-dimensional reconstruction network and the training of the image enhancement network can be synchronized, and the image enhancement network is prevented from generating gradient explosion due to the fact that the gradient of the image enhancement network decreases too fast.

And after the total loss is obtained through calculation, calculating a corresponding error through the total loss, and adjusting parameters of the three-dimensional reconstruction network and the image enhancement network in a back propagation mode according to the error.

For further explaining the embodiment of the present invention, please refer to fig. 7, and fig. 7 is a schematic diagram of a joint training further provided by the embodiment of the present invention, as shown in fig. 7, a sample face image is (im0, im1, im2), where im1 is a face frame image representation, im1 is labeled to obtain a corresponding face keypoint position label, pixel mean square error label, and reconstruction parameter label, and a blurring factor and a noise factor are added to im0, im1, and im2, respectively, im0, im1, and im2 with the blurring factor and the noise factor added are connected, and after the connection, the parameter is input to a parameter prediction network in a three-dimensional reconstruction network to be processed to obtain a predicted reconstruction parameter, and a loss of the predicted reconstruction parameter and the reconstruction parameter label is calculated by a first loss function; carrying out three-dimensional reconstruction on the face through a parameter reconstruction network in a three-dimensional reconstruction network to obtain a three-dimensional face; rendering the three-dimensional face to obtain a two-dimensional face image, extracting the predicted face key point position of the two-dimensional face image, and calculating the predicted face key point position and the loss of a face key point position label through a second loss function; and connecting the two-dimensional face image with im0, im1 and im2 added with fuzzy factors and noise factors, inputting the two-dimensional face image into a cascaded image enhancement network for denoising and deblurring after connection, outputting a high-quality target face image by the image enhancement network, calculating the predicted pixel mean square error of the target face image at the moment, and calculating the predicted pixel mean square error and the loss of a pixel mean square error label through a third loss function.

It should be noted that, the forward inference is different from the training process described above, and after training, when performing forward inference on continuous face frame images, there is no step of labeling, no step of adding a fuzzy factor and a noise factor, and no step of calculating a loss through a first loss function, a second loss function, and a third loss function, which are shown by dotted lines in fig. 7.

In the embodiment of the invention, the three-dimensional reconstruction network and the image enhancement network both perform image processing on the human face, so that the commonality is high, and the time required by training can be reduced by using combined training. Through the trained three-dimensional reconstruction network and the image enhancement network, the face image with higher image quality can be obtained, and high-quality input is provided for the subsequent recognition network, so that the recognition accuracy of the subsequent recognition network is improved.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a face image processing apparatus according to an embodiment of the present invention, and as shown in fig. 8, the apparatus includes:

an obtaining module 801, configured to obtain a preset number of consecutive face frame images;

the first processing module 802 is configured to input the continuous face frame images into a pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction, and render a three-dimensional face image into a two-dimensional face image;

the second processing module 803 is configured to perform image denoising and deblurring on the continuous face frame image through the two-dimensional face image and a pre-trained image enhancement network to obtain a target face image;

Optionally, as shown in fig. 9, the first processing module 802 includes:

a parameter extraction submodule 8021, configured to extract, through the parameter extraction network, face reconstruction parameters of the continuous face frame image;

a parameter reconstruction submodule 8022, configured to reconstruct the face reconstruction parameters through the parameter reconstruction network, so as to obtain a three-dimensional face;

and the rendering submodule 8023 is configured to perform two-dimensional rendering on the three-dimensional face to obtain a two-dimensional face image.

Optionally, the image enhancement network includes a denoising convolution kernel and a deblurring convolution kernel, as shown in fig. 10, the second processing module 803 includes:

a first input sub-module 8031, configured to input the two-dimensional face image to the image enhancement network as prior information of the continuous face frame image;

a first processing sub-module 8032, configured to perform denoising operation on the continuous face frame image through the denoising convolution kernel with the prior information as a reference; and

a second processing sub-module 8033, configured to perform a deblurring operation on the continuous face frame image through the deblurring convolution kernel with the prior information as a reference;

and the output sub-module 8034 is used for outputting the obtained target face image.

Optionally, the three-dimensional reconstruction network and the image enhancement network are jointly trained, as shown in fig. 11, the apparatus further includes a joint training module 804, and the joint training module 804:

a sample submodule 8041, configured to construct a sample set, where the sample set includes a sample face image, and a face key point position label, a pixel mean square error label, and a reconstruction parameter label that correspond to the sample face image;

a second input sub-module 8042, configured to input the sample set into the three-dimensional reconstruction network and the image enhancement network in cascade connection to train the three-dimensional reconstruction network and the image enhancement network;

the first calculating submodule 8043 is configured to calculate a first loss function of a predicted reconstruction parameter and a reconstruction parameter tag in the three-dimensional reconstruction network; and

a second calculating submodule 8044, configured to calculate a second loss function corresponding to the predicted face key point position and the face key point position label in the two-dimensional face image in the three-dimensional reconstruction network; and

a third computing submodule 8045, configured to compute a third loss function of a predicted pixel mean square error and a pixel mean square error label of the target face image in the image enhancement network;

a fourth calculating submodule 8046, configured to calculate total loss of the first loss function, the second loss function, and the third loss function, adjust parameters of the three-dimensional reconstruction network and the image enhancement network according to back propagation, perform iterative training to minimize the total loss, and obtain a trained three-dimensional reconstruction network and a trained image enhancement network.

Optionally, the sample set includes a plurality of groups of sample face images, as shown in fig. 12, the sample submodule 8041 includes:

a first obtaining unit 80411, configured to obtain a preset number of consecutive face frame images as a set of sample face images to be processed;

a first adding unit 80412, configured to randomly select at least one blur kernel and at least one noise kernel from the pre-prepared blur kernels and noise kernels to perform blur addition and noise addition on the current sample face image to be processed, so as to obtain a processed sample face image;

a second obtaining unit 80413, configured to obtain, according to the current sample face image to be processed, a face key point position label, a pixel mean square error label, and a reconstruction parameter label that correspond to the current sample face image to be processed;

a second adding unit 80414, configured to add the processed sample face image, and the face key point position label, the pixel mean square error label, and the reconstruction parameter label corresponding to the current sample face image to be processed, as a group of sample face images, to the sample set.

Optionally, the sample set includes a plurality of groups of sample face images, as shown in fig. 13, the sample submodule 8041 includes:

a third obtaining unit 80415, configured to obtain 2n +1 continuous face frame images as a group of sample face images to be processed;

a third adding unit 80416, configured to randomly select at least one blur kernel and at least one noise kernel from the pre-prepared blur kernels and noise kernels to perform blur addition and noise addition on the current sample face image to be processed, so as to obtain 2n +1 processed sample face images;

a connection unit 80417, configured to perform channel connection on the 2n +1 processed sample face images to obtain target sample face images;

a fourth obtaining unit 80418, configured to obtain, according to the sample face image to be processed, a face key point position label, a pixel mean square error label, and a reconstruction parameter label corresponding to the nth sample face image to be processed;

a fourth adding unit 80419, configured to add the target sample face image, and the face keypoint location label, the pixel mean square error label, and the reconstruction parameter label corresponding to the nth sample face image to be processed to the sample set as a group of sample face images.

It should be noted that the face image processing apparatus provided in the embodiment of the present invention may be applied to a mobile phone, a monitor, a computer, a server, and other devices that can perform face image processing.

The face image processing device provided by the embodiment of the invention can realize each process realized by the face image processing method in the method embodiment, and can achieve the same beneficial effect. To avoid repetition, further description is omitted here.

Referring to fig. 14, fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, as shown in fig. 14, including: a memory 1402, a processor 1401, and a computer program stored on the memory 1402 and executable on the processor 1401, wherein:

the processor 1401 is used for calling the computer program stored in the memory 1402, and executing the following steps:

acquiring a preset number of continuous face frame images;

Optionally, the three-dimensional reconstruction network includes a parameter extraction network and a parameter reconstruction network, and the processor 1401 is configured to input the continuous face frame images into a pre-trained three-dimensional reconstruction network to perform three-dimensional face reconstruction, and render a three-dimensional face image into a two-dimensional face image after obtaining the three-dimensional face image, and includes:

Optionally, the image enhancement network includes a denoising convolution kernel and a deblurring convolution kernel, and the image denoising and deblurring performed by the processor 1401 on the continuous face frame image through the two-dimensional face image and the pre-trained image enhancement network to obtain the target face image includes:

and outputting to obtain a target face image.

Optionally, the three-dimensional reconstruction network and the image enhancement network are joint training, and the processor 1401 further performs the joint training, including:

calculating a second loss function corresponding to the predicted human face key point position and the human face key point position label in the two-dimensional human face image in the three-dimensional reconstruction network; and

calculating a third loss function of a predicted pixel mean square error and a pixel mean square error label of the target face image in the image enhancement network;

Optionally, the sample set includes a plurality of groups of sample face images, and the constructing the sample set performed by the processor 1401 includes:

acquiring a human face key point position label, a pixel mean square error label and a reconstruction parameter label corresponding to the current sample human face image to be processed according to the current sample human face image to be processed;

and adding the processed sample face image, and the face key point position label, the pixel mean square error label and the reconstruction parameter label corresponding to the current sample face image to be processed into the sample set as a group of sample face images.

The electronic device may be a device that can be applied to a mobile phone, a monitor, a computer, a server, and the like that can perform face image processing.

The electronic device provided by the embodiment of the invention can realize each process realized by the face image processing method in the method embodiment, can achieve the same beneficial effects, and is not repeated here for avoiding repetition.

The embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements each process of the face image processing method provided in the embodiment of the present invention, and can achieve the same technical effect, and in order to avoid repetition, the computer program is not described herein again.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A face image processing method is characterized by comprising the following steps:

acquiring a preset number of continuous face frame images;

2. The method of claim 1, wherein the three-dimensional reconstruction network comprises a parameter extraction network and a parameter reconstruction network, and the inputting the continuous face frame images into the pre-trained three-dimensional reconstruction network for three-dimensional face reconstruction to obtain a three-dimensional face and rendering the three-dimensional face into a two-dimensional face image comprises:

3. The method of claim 2, wherein the image enhancement network comprises a de-noising convolution kernel and a de-blurring convolution kernel, and the image de-noising and de-blurring of the continuous face frame images by the two-dimensional face image and a pre-trained image enhancement network to obtain a target face image comprises:

and outputting to obtain a target face image.

4. The method of claim 1, wherein the three-dimensional reconstruction network and the image enhancement network are jointly trained, the joint training comprising:

5. The method of claim 4, wherein the sample set includes a plurality of sets of sample face images, and wherein constructing the sample set includes:

6. The method of claim 4, wherein the sample set includes a plurality of sets of sample face images, and wherein constructing the sample set includes:

7. A face image processing apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the three-dimensional reconstruction network comprises a parameter extraction network and a parameter reconstruction network, and the first processing module comprises:

the parameter extraction submodule is used for extracting the face reconstruction parameters of the continuous face frame images through the parameter extraction network;

the parameter reconstruction submodule is used for reconstructing the face reconstruction parameters through the parameter reconstruction network to obtain a three-dimensional face;

and the rendering submodule is used for performing two-dimensional rendering on the three-dimensional face to obtain a two-dimensional face image.

9. An electronic device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing the steps in the face image processing method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon, which, when being executed by a processor, carries out the steps in the face image processing method according to any one of claims 1 to 6.