CN110009573B

CN110009573B - Model training method, image processing method, device, electronic equipment and storage medium

Info

Publication number: CN110009573B
Application number: CN201910087661.5A
Authority: CN
Inventors: 刘思阳
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-01-29
Filing date: 2019-01-29
Publication date: 2022-02-01
Anticipated expiration: 2039-01-29
Also published as: CN110009573A

Abstract

The invention provides a model training method, an image processing method, a model training device, an image processing device, electronic equipment and a computer readable storage medium, wherein the training method comprises the following steps: acquiring a training sample set, wherein the training sample set comprises a first image and a third image with clear face regions, and a second image obtained by adding noise to the face region of the first image, and the face regions in the third image and the first image belong to the same user; acquiring a first face semantic segmentation result of a second image and a second face semantic segmentation result of a third image; inputting the second image, the first face semantic segmentation result, the third image and the second face semantic segmentation result into a neural network model to obtain a fourth image; identifying difference data on the face region between the fourth image and the first image; iteratively updating the neural network model according to the difference data; and the neural network model after iterative update is used for carrying out face region reconstruction on the image with the fuzzy face region to generate an image with a clear face region.

Description

Model training method, image processing method, device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of machine learning technologies, and in particular, to a method and an apparatus for model training and image processing, an electronic device, and a storage medium.

Background

With the continuous development of computer technology, more and more electronic devices are capable of capturing images. However, for various reasons, the problem that the face area in the captured image or the post-processed image is blurred tends to occur. The image blur is caused by many reasons, such as a low hardware configuration of an electronic device that captures an image, a poor capturing environment, a reduction in resolution of an image due to post-processing of an image, image damage, and the like.

For an image with a small blurring degree in a face region, a face deblurring method is mainly adopted in the related technology to process the blurred image.

However, when the degree of blurring of the face region is large, that is, the damaged face region is relatively serious, the image deblurring method in the related art has a poor repairing effect when the damaged face region in the image is repaired.

Disclosure of Invention

The invention provides a model training method, an image processing method, a model training device, an image processing device, an electronic device and a storage medium, and aims to solve the problem that a face deblurring method in the related technology has a poor repairing effect when repairing severely damaged images in a face area.

In order to solve the above problem, according to a first aspect of the present invention, there is disclosed a model training method comprising:

acquiring a training sample set, wherein the training sample set comprises a first image, a second image and a third image, the second image is an image obtained by adding noise to a face region of the first image, the face region of the third image and the face region of the first image belong to the same user, and the first image and the third image both comprise clear face regions;

obtaining a first face semantic segmentation result of the second image;

acquiring a second face semantic segmentation result of the third image;

inputting the second image, the first face semantic segmentation result, the third image and the second face semantic segmentation result into a neural network model to obtain a fourth image;

identifying difference data on a face region between the fourth image and the first image;

iteratively updating the neural network model according to the difference data;

the neural network model after the iterative update is used for reconstructing a face region of any blurred image of the face region to generate a clear image of the face region.

According to a second aspect of the present invention, there is disclosed an image processing method comprising:

acquiring a first image to be reconstructed, wherein the first image comprises a damaged face region;

acquiring a second image matched with the first image, wherein the face region in the second image and the face region in the first image belong to the same user, and the second image comprises a clear face region;

obtaining a first face semantic segmentation result of the first image;

acquiring a second face semantic segmentation result of the second image;

and inputting the first image, the first face semantic segmentation result, the second image and the second face semantic segmentation result into a face reconstruction model which is trained in advance, so that the face reconstruction model reconstructs a face region of the first image according to the second image, the second face semantic segmentation result and the first face semantic segmentation result, and a third image with a clear face region is generated.

According to a third aspect of the present invention, there is disclosed a model training apparatus comprising:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the training sample set comprises a first image, a second image and a third image, the second image is an image obtained by adding noise to a face area of the first image, the face area in the third image and the face area in the first image belong to the same user, and the first image and the third image both comprise clear face areas;

the second acquisition module is used for acquiring a first face semantic segmentation result of the second image;

the third acquisition module is used for acquiring a second face semantic segmentation result of the third image;

the input module is used for inputting the second image, the first face semantic segmentation result, the third image and the second face semantic segmentation result into a neural network model to obtain a fourth image;

the identification module is used for identifying difference data on a face area between the fourth image and the first image;

the updating module is used for carrying out iterative updating on the neural network model according to the difference data;

According to a fourth aspect of the present invention, there is disclosed an image processing apparatus comprising:

the system comprises a first acquisition module, a reconstruction module and a second acquisition module, wherein the first acquisition module is used for acquiring a first image to be reconstructed, and the first image comprises a damaged face region;

the second acquisition module is used for acquiring a second image matched with the first image, wherein the face area in the second image and the face area in the first image belong to the same user, and the second image comprises a clear face area;

the third acquisition module is used for acquiring a first face semantic segmentation result of the first image;

the fourth acquisition module is used for acquiring a second face semantic segmentation result of the second image;

and the reconstruction module is used for inputting the first image, the first face semantic segmentation result, the second image and the second face semantic segmentation result into a face reconstruction model which is trained in advance, so that the face reconstruction model reconstructs a face region of the first image according to the second image, the second face semantic segmentation result and the first face semantic segmentation result, and a third image with a clear face region is generated.

According to a fifth aspect of the present invention, the present invention also discloses an electronic device, comprising: a memory, a processor, and a model training program or an image processing program stored on the memory and executable on the processor, the model training program implementing the steps of the model training method as described in any one of the above when executed by the processor, the image processing program implementing the steps of the image processing method as described above when executed by the processor.

According to a sixth aspect of the present invention, the present invention also discloses a computer readable storage medium having stored thereon a model training program or an image processing program, the model training program, when executed by a processor, implementing the steps in the model training method according to any one of the above, the image processing program, when executed by the processor, implementing the steps of the image processing method according to the above.

Compared with the prior art, the invention has the following advantages:

when the embodiment of the invention trains the neural network model for face reconstruction, not only the second image with seriously damaged face region is input into the face reconstruction model, but also the face semantic segmentation result of the second image is input into the face reconstruction model together, and further a third image with clear face region of the user in the second image and the face semantic segmentation result of the third image are also input into the neural network model to be trained to train the neural network model, and the neural network model is iteratively updated according to the difference data between a fourth image output by the neural network model and the first image with clear face region, so that the trained neural network model can refer to the face semantic segmentation result of the damaged second image with damaged face region of the same user and the face semantic segmentation result of the third image with undamaged face region, the damaged face area in the second image is repaired, and the clear face image of the user is referred to during repair, so that a clear fourth image of the face area can be generated, and the repair effect on the seriously damaged face area is improved.

Drawings

FIG. 1 is a flow chart of the steps of one embodiment of a model training method of the present invention;

FIG. 2 is a flow chart of steps in another embodiment of a model training method of the present invention;

FIG. 3 is a flow chart of steps of yet another embodiment of a model training method of the present invention;

FIG. 4 is a flow chart of steps of yet another embodiment of a model training method of the present invention;

FIG. 5 is a flow chart of steps of yet another embodiment of a model training method of the present invention;

FIG. 6 is a flow chart of the steps of an embodiment of an image processing method of the present invention;

FIG. 7 is a block diagram of a model training apparatus according to an embodiment of the present invention;

fig. 8 is a block diagram of an image processing apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to fig. 1, a flowchart illustrating steps of an embodiment of a model training method according to the present invention is shown, which may specifically include the following steps:

101, acquiring a training sample set, wherein the training sample set comprises a first image, a second image and a third image;

in order to train the neural network model of the embodiment of the present invention, so that the trained neural network model can be used as a face reconstruction model to repair a face region of an image with a damaged face region, a training sample set needs to be obtained, where each group of samples in the training sample set includes three images, which are the first image, the second image, and the third image.

The first image and the third image are both clear images of human face regions, and there are many parameters affecting the image definition, such as hardware configuration parameters of an electronic device for shooting images, shooting environment parameters, post-processing parameters of images, image resolution, image damage degree, and the like.

The aim of training the model in the embodiment of the invention is to enable the trained model to reconstruct the face of a face region with serious damage in an image and output a clear image of the face region.

The second image is an image obtained by performing noise addition processing on the face region of the first image, so that the face region in the second image is seriously damaged (for example, partial image information in the face region is missing or wrong, and the like).

In addition, the first image and the third image both include a clear face region, and the face region in the third image and the face region in the first image belong to the same user, that is, the third image is another clear image of the face region of the user in the first image.

When a training sample set is obtained, a plurality of first images with clear face regions can be obtained, then, noise adding processing is carried out on the face regions in each first image respectively, and therefore a plurality of second images matched with each first image are obtained; and acquiring another image of the user corresponding to the face area in each first image, thereby obtaining a plurality of third images. Therefore, a plurality of groups of image samples can be obtained, wherein each group of image samples comprises a sharp image and a blurred image corresponding to the same face image and another image with a sharp face of the user corresponding to the face image.

When the first image and the third image are obtained, a frame of image including a clear face region can be extracted in a mode of extracting a frame from a video stream, and the image including the clear face region can also be directly obtained from an image set.

Optionally, before performing step 102, if the neural network model has a size requirement on the input image, the method of the embodiment of the present invention may further include a step of preprocessing the first image, the third image, and the second image in the training sample set.

The specific preprocessing method may be to stretch, compress, fill (i.e., add a white edge to the outer edge of the image to make the image reach a preset size), and so on, to adjust the size of the image in the training sample set to a preset size (e.g., 400 × 400) required by the neural network model.

102, acquiring a first face semantic segmentation result of the second image;

before the neural network model is trained, which parts of the face need to be referred to by the neural network model for face region repair can be configured in advance. For example, the pre-configured face components may include, but are not limited to, a nose, eyes, a mouth, eyebrows, and ears, and the semantic segmentation result of the face here respectively expresses the position information of the face components in the second image. Namely, the human face semantic segmentation result comprises the following steps: information of which region in the second image is the nose, which region is the eyes, which region is the mouth, which region is the eyebrows, which region is the ears.

When obtaining the human face semantic segmentation result of an image, the method can be implemented by using a traditional human face semantic segmentation model or a human face semantic segmentation model developed in the future, and can also be implemented by adopting other modes of obtaining the human face semantic segmentation result.

103, acquiring a second face semantic segmentation result of the third image;

the execution principle of this step is similar to that of step 102, and is not described here again.

The execution sequence of steps 102 and 103 is not limited in the present invention.

104, inputting the second image, the first face semantic segmentation result, the third image and the second face semantic segmentation result into a neural network model to obtain a fourth image;

in the embodiment of the present invention, when training the neural network model, not only the second image (hereinafter, the damaged image is described) is input into the neural network model, but also the semantic segmentation result of the first face corresponding to the damaged image is input into the neural network model together, so that the trained neural network model can refer to the positions of the parts in the face region of the damaged image to reconstruct the damaged image;

in addition, the embodiment of the present invention may further input a third image (hereinafter, described as a standard image) and a second face semantic segmentation result corresponding to the standard image to the neural network model together, so that the trained neural network model can reconstruct the damaged image with reference to the positions of the parts in the face region of the standard image. Even if the face region in the damaged image is damaged seriously, the neural network model (i.e., the face reconstruction model) trained by the embodiment of the invention can still repair the face region in the damaged image, and a fourth image (described as a reconstructed image) with a clear face region is generated.

The network structure of the face reconstruction model in the embodiment of the present invention may adopt any network structure of a neural network model, which is not limited in the present invention.

Step 105, identifying difference data on a face area between the fourth image and the first image;

because a group of image samples in the training sample set also includes a first image with a clear face corresponding to an image with a damaged face region, difference data between a reconstructed image output by the neural network model and a clear image in the group of image samples can be identified, and the difference data can be understood as total loss of the neural network model in the training process.

106, performing iterative update on the neural network model according to the difference data;

the neural network model after iterative update, namely the face reconstruction model, is used for reconstructing a face region of any blurred (especially seriously damaged) image of the face region to generate a clear image of the face region.

Here, the parameters of each network layer in the face reconstruction model can be iteratively updated by using the total loss of the current round of training.

Then, any one group of image samples in the training sample set may be used to execute the above-mentioned steps 101 to 106, so as to complete one round of iterative update of the neural network model, and then in the model training process, the training sample set may be used to perform multiple rounds of iterative update on the neural network model, and the method shown in fig. 1 is executed in a loop for multiple times until the difference data converges, that is, until the total loss does not decrease any more, and remains stable.

The number of times for iterative updating (i.e. the number of rounds) may be determined from empirical values, preferably the number of rounds is higher than empirical values. For example, an empirical value of 2000 rounds, 3000 rounds may be trained here.

And finally, the neural network model after multi-round iterative updating can realize the restoration of the face area in the image, so that the image with the original seriously damaged face area can be restored by the neural network model.

Therefore, the neural network model updated through multiple rounds of iteration is used for repairing the face region of any image with seriously damaged face region to generate an image with clear face region.

By means of the technical solution of the above embodiment of the present invention, when training a neural network model for face reconstruction, in the embodiment of the present invention, not only the second image with a seriously damaged face region is input into the face reconstruction model, but also the face semantic segmentation result of the second image is input into the face reconstruction model together, and further, the third image with a clear face region of the user in the second image and the face semantic segmentation result of the third image are also input into the neural network model to be trained to train the neural network model, and the neural network model is iteratively updated according to the difference data between the fourth image output by the neural network model and the first image with a clear face region, so that the trained neural network model can refer to the face semantic segmentation result of the damaged face region of the same user, and the human face semantic segmentation result of the third image with the undamaged human face region is used for repairing the damaged human face region in the second image, and the clear human face image of the user is referred during repairing, so that a fourth image with a clear human face region can be generated, and the repairing effect of the image with the seriously damaged human face region is improved.

Alternatively, as shown in fig. 2, when step 102 is executed, it may be implemented by S201 and S202:

s201, acquiring a second image matrix matched with the second image;

the second image is an RGB image, and therefore, each pixel in the second image includes a R (red) value, a G (green) value, and a B (blue) value, for example, the size of the second image is W × H, that is, the width of the second image is W, the second image includes W pixels in the width direction, the second image has a length of H, and the second image includes H pixels in the length direction. Then any color in the second image may form a matrix of W x H1, the R values in the second image form a matrix, the G values form a matrix, the B values form a matrix, and each matrix has a length of H and a width of W, so that the second image matrix of the second image is a W x H3 image matrix, i.e. a matrix comprising three layers of W x H.

Thus, an image matrix of an image is the image data expressed in a matrix manner, or the matrix structure of the image.

S202, inputting the second image matrix into a pre-trained first face semantic segmentation model to obtain a first global face semantic segmentation matrix matched with a plurality of face parts.

The network structure of the face semantic segmentation model may be any semantic segmentation structure, such as a VGG (Oxford Visual Geometry Group) model.

In one example, as shown in fig. 3, a second image matrix (i.e., a damaged image matrix) may be input to the face semantic segmentation model 1; a third image matrix, i.e. a standard image matrix, may be input into the face semantic segmentation model 2. The network structures of the two face semantic segmentation models may be the same (the network structure of the face semantic segmentation model may refer to the related description later, and will not be described herein again). But the training modes of the two have certain difference.

Specifically, because the human face semantic segmentation model has a poor learning effect on damaged images, in the embodiment of the present invention, when the human face semantic segmentation model 2 is trained, the human face semantic segmentation model 2 may be trained by using a clear image, and after the loss is no longer reduced and remains stable after training, the trained human face semantic segmentation model 2 is obtained.

Then, the parameters of the face semantic segmentation model 2 are optimized by using the damaged image to obtain the face semantic segmentation model 1 of the embodiment of the present invention, so that the finally trained face semantic segmentation model 1 of the embodiment of the present invention can perform relatively accurate face semantic segmentation on the input damaged image, i.e., the second image, and output a first global face semantic segmentation matrix, i.e., the global face semantic segmentation matrix 1 in fig. 3.

After the damaged image matrix is input to the human face semantic segmentation model 1 trained in advance, the human face semantic segmentation model 1 may perform semantic segmentation on the damaged image matrix, and segment each human face component to be segmented, which may be embodied as setting the value of a pixel point of a human face component (e.g., the nose, the eyes, the mouth, the eyebrows, and the ears described in the above step 102) in the damaged image matrix to 1, and setting the value of other pixel points to 0, so as to obtain the global human face semantic segmentation matrix 1 matched with a plurality of human face components (e.g., the nose, the eyes, the mouth, the eyebrows, and the ears described in the above step 102) configured in advance. The pre-configured face components herein are a plurality of face components that the face semantic segmentation model 1 can support segmenting and recognizing after training.

Alternatively, as shown in fig. 2, when step 103 is executed, it may be implemented by S301 and S302:

s301, acquiring a third image matrix matched with the third image;

the specific implementation of S301 is similar to S201, except that the second image in S201 is replaced by the third image, and details are not repeated here.

S302, inputting the third image matrix into a pre-trained second face semantic segmentation model to obtain a second global face semantic segmentation matrix matched with a plurality of face parts;

the specific implementation of S302 is similar to S202, and the difference is only that the second image matrix in S202 is replaced with the third image matrix (i.e., the standard image matrix in fig. 3) here, so as to obtain a second global face semantic segmentation matrix (i.e., the global face semantic segmentation matrix 2 in fig. 3), which is not described herein again.

Alternatively, as shown in fig. 2, when step 104 is executed, it can be realized by S401 to S403:

s401, performing matrix connection processing on the second image matrix and the first global face semantic segmentation matrix to obtain first matrix data;

in one example, as shown in fig. 3, a damaged image matrix of a damaged image and a global face semantic segmentation matrix 1 output by the face semantic segmentation model 1 may be subjected to matrix concatenation. For example, if the number of the face components configured in advance is 11, the global face semantic segmentation matrix is a matrix W × H × 11, and the damaged image matrix (refer to the description of the second image matrix) is a matrix W × H × 3, the first matrix data of W × H × 14 can be obtained by matrix connection.

S402, performing matrix connection processing on the third image matrix and the second global face semantic segmentation matrix to obtain second matrix data;

the execution principle of S402 is similar to that of S401, and reference may be specifically made to fig. 3, which is not described herein again.

And S403, inputting the first matrix data and the second matrix data into a neural network model to obtain a fourth image.

As shown in fig. 3, the first matrix data of W × H × 14 and the second matrix data of W × H × 14 are both input to a neural network model to be trained (i.e., the face reconstruction model shown in fig. 3), the face reconstruction model may be divided into two branches, the two sets of matrix data are processed respectively, then the two processed matrix data are combined into one path of data, and a face region in the damaged image is reconstructed and repaired, so as to output a fourth image, i.e., a reconstructed image.

For the face reconstruction model shown in fig. 3, that is, the network structure of the neural network model to be trained according to the embodiment of the present invention, reference may be made to the following description, and details are not repeated here.

In this way, the embodiment of the present invention obtains the RGB matrix of the second image with damaged face region (i.e. the second image matrix), and inputs the RGB matrix to the first face semantic segmentation model to obtain the first global face semantic segmentation matrix matched with the plurality of face components; acquiring an RGB matrix (namely a third image matrix) of a third image with a clear face region, and inputting the RGB matrix into a second face semantic segmentation model to acquire a second global face semantic segmentation matrix matched with a plurality of face parts; and splicing the RGB matrix with damaged face region and the global face semantic segmentation matrix and inputting the RGB matrix with clear face region and the global face semantic segmentation matrix into the neural network model to be trained to train the neural network model so as to obtain a fourth image output by the neural network model, and iteratively updating the neural network model by combining difference data between the fourth image and the first image. In the process of model training, the face part areas in the second image and the third image, the whole second image and the whole third image are expressed in a matrix form, so that the trained neural network model can accurately repair the face part areas in the second image with seriously damaged face areas based on the face semantic segmentation result of the third image with clear face areas, and the face areas in the output image after repair become clear.

Alternatively, as shown in fig. 4, when step 105 is executed, it can be realized by S501 to S504:

s501, identifying first loss data on image features between the fourth image and the first image;

in one example, as shown in fig. 3, the fourth image is the reconstructed image in fig. 3, and the first image is the sharp image in fig. 3, then in this step, the difference of the two images on the high-dimensional feature, i.e. the first loss data, can be calculated. This first loss data expresses the difference in human eye perception between the two images, and therefore, the first loss data herein may be referred to as a perceptual loss.

S502, identifying second loss data between the fourth image and the first image on a target face part according to the first face semantic segmentation result;

as described above, if the number of the face parts configured in advance is 11, the first global face semantic segmentation matrix is a matrix W × H × 11, but the user generally only pays attention to whether some face parts in the 11 face parts are clear, where the face parts that the user pays attention to and need to be clear are target face parts in the 11 face parts configured in advance.

Generally, the user wants the eyes, nose, mouth and eyebrows to be clear, but does not need to have high definition in the cheek area, and the visual effect of face thinning and beauty can be achieved when the definition is low (namely the blurring degree is high) and the face is inverted. Whereas the first face semantic segmentation result (the first global face semantic segmentation matrix is a matrix of W × H × 11) describes the position information of 11 pre-configured face components, but here the above four target face components (eyes, nose, mouth, eyebrows) are concerned, so here the total loss data of the reconstructed image and the sharp image on the eyes, nose, mouth, eyebrows can be identified based on the face semantic segmentation result of the damaged image. Since these loss data represent the loss of different parts of the face region, the second loss data is called structural loss.

In one example, as shown in fig. 3, this step may identify the structural loss on the target face part between the sharp image and the reconstructed image, i.e. the above-mentioned second loss data, according to the global face semantic segmentation matrix 1 output by the face semantic segmentation model 1.

S503, identifying third loss data between the fourth image and the first image at a pixel point;

in one example, as shown in fig. 3, this step may also identify the difference between each pixel point of the reconstructed image and the sharp image, and the sum of the differences between all the pixel points of the two images is the third loss data. Since the third loss data expresses a loss at the pixel level, the third loss data may also be referred to as a pixel-level loss. Specifically, the sum of losses between pixel points of the two images one by one can be calculated to obtain a pixel-level loss.

When loss between a reconstructed image and a clear image on any two pixel points is calculated, the two pixel points respectively correspond to the same position of the reconstructed image and the same position of the clear image.

S504, according to preset image feature weight, face part weight and pixel point weight, carrying out weighted summation on the first loss data, the second loss data and the third loss data to obtain difference data between the fourth image and the first image on a face area.

The embodiment of the invention can pre-configure the weight aiming at the three types of losses, and the three weights of the three types of losses are flexibly configured according to the requirement. Optionally, the three weights are greater than zero and less than one such that the sum of the three weights is 1; alternatively, the three weights may each be greater than 1.

The global Loss between the fourth image (reconstructed image) and the first image (sharp image) can be calculated by equation 1_totalI.e. the difference data over the face area.

Loss_total＝λ_l2Loss_l2+λ_sLoss_s+λ_pLoss_pFormula 1;

therein, Loss_l2For the third loss data (i.e., pixel level L2 loss), λ_l2The weight of the preset pixel point is obtained; loss_sFor second loss data (i.e., structural loss), λ_sThe weight of the preset face part is set; loss_pFor the first loss data (i.e., perceptual loss), λ_pIs a preset image characteristic weight.

Alternatively, due to Loss_sExpresses the difference between the two images on the target face part, and Loss_l2Expressing the pixel level Loss between the two images, Loss_pExpressing the feature level loss between two images, in order to make the trained neural network model have high repairability for the target human face component, λ can be set when the three weights are configured_s＞λ_p，λ_s＞λ_l2。

As shown in fig. 3, the face reconstruction model may be iteratively updated using the global penalty calculated by equation 1.

In this way, when difference data between the clear first image and the restored fourth image is acquired, three types of losses between the two images are respectively identified, the first loss data expresses the difference of the image characteristics of the two images, and the first loss data reflects the difference of human eye perception between human face areas; the second loss data expresses the difference between the two images on a target face part (the face part which is concerned by the user and needs to be improved in definition); the third loss data expresses the difference of the two images at the pixel level; then the neural network model is iteratively updated by weighted summation of the three types of loss data and using the summed global loss. Because the three types of losses comprise the second loss data, loss calculation with different weights can be performed on different human face semantic regions, so that the neural network model can perform high-weight learning on a specified region (namely, a region where a target human face part is located) in a targeted manner, and the repair capability and the repair effect of the neural network model on the human face part concerned by a user in a damaged image are improved.

In addition, the execution order of S501 to S503 is not limited in the present invention. In other optional embodiments, only one or two steps of the above S501 to S503 may be selected to be executed according to the difference between the image restoration requirement and the restoration standard, so as to achieve the purpose of training the neural network model.

Optionally, when S501 is executed to identify a perception loss, image feature data of the fourth image and image feature data of the first image may be obtained by inputting the fourth image and the first image to a pre-trained image feature extraction model, respectively; then, first loss data on image features between the fourth image and the first image is acquired according to a difference between the image feature data of the fourth image and the image feature data of the first image.

The image feature extraction model trained in advance may include, but is not limited to, one of the following: VGG-16, VGG-19, VGG-Face, and the like.

Specifically, the first Loss data (perceptual Loss) Loss can be obtained according to equation 2_p。

Wherein α represents the neural network model; gamma represents a human face semantic segmentation model 1 which is trained in advance; delta represents an image feature extraction model which is trained in advance;

b represents a second image matrix of the input sharp image (i.e., second image); c represents a first image matrix of the input damaged image (i.e., first image);

δ_l(x) Representing the l-th layer of features in the model delta extracted from the image x by the image feature extraction model trained in advance; for example, the image feature extraction model is a VGG model, and is the l-th layer, δ, of the VGG model_l(x) Representing the image features extracted for the input image x by the l-th layer of the VGG model.

The colon represents a matrix connection; gamma (C) represents the global human face semantic segmentation matrix; alpha (C: gamma (C)) represents a fourth image (namely a reconstructed image) output by the neural network model;

according to the formula 2, the image feature data of the fourth image includes the ith layer feature extracted from the reconstructed image by the image feature extraction model; similarly, the image feature data of the first image comprises the ith layer of features extracted from the clear image by the image feature extraction model.

According to the formula 2, the L2 Loss (i.e. mean square error) between the L-th layer feature of the reconstructed image and the L-th layer feature of the sharp image can be calculated, when L takes a plurality of values (for example, L takes 1, 2 and 3 respectively), the three-layer features of the two images respectively correspond to three L2 losses, and the three L2 losses are summed according to the formula 2, so that the perceived Loss is obtained_p。

Because different layers in the network of the image feature extraction model have different features for image extraction, the image feature of which layer (namely the value of L) in the extraction network can be flexibly set according to actual needs, and the L2 loss calculation is carried out on the image feature of the layer; in addition, taking the image feature data of the first image as an example, the extracted image feature data may be an extraction result of a one-layer network of the image feature extraction model, or may be an extraction result of a multi-layer network of the image feature extraction model, that is, l may take one value or a plurality of values.

In this way, the embodiment of the present invention obtains the image characteristics of the reconstructed image output by the neural network model, and obtains the image characteristics of the sharp image corresponding to the reconstructed image, so that the differences between the reconstructed image and the sharp image in various image characteristics can be calculated, and the total difference in various image characteristics is used as the first loss data, i.e., the perceptual loss, between the two images in the image characteristics. When the neural network model is trained by using the perception loss, the capability of the trained neural network model for repairing each human face feature in the damaged image can be improved.

Alternatively, in the case where step 102 is implemented by S201 and S202 described above, then when S502 is executed, it may be implemented by a method as shown in fig. 5:

s601, acquiring a fourth image matrix matched with the fourth image;

the execution principle of this step is similar to that of S201 described above, and is not described here again.

S602, acquiring a first image matrix matched with the first image;

The subsequent steps of S603 to S606 can be executed with reference to formula 3;

the same symbol marks in formula 3 as those in formula 2 are not repeated here, and reference may be made to the related description of formula 2.

Wherein, an indicates a matrix multiplication, i.e., a dot product operation;

M_kand (y) performing a dot product operation on the output result y of the semantic segmentation model trained in advance (i.e. the global face semantic segmentation matrix of the above embodiment) and the kth mask. The value of k is a positive integer, and can take one or more numerical values.

The mask is used for extracting the region of interest, and the region of interest image can be obtained by performing dot product operation on the pre-manufactured mask of the region of interest and the image to be processed. Wherein the image values within the region of interest remain unchanged and the image values outside the region of interest are all 0.

Therefore, in the embodiments of the present invention, masks may be respectively set in advance for target component regions of interest, for example, the target component regions include the above-mentioned eyes, eyebrows, nose, and mouth, four types of masks corresponding to each target component region may be respectively configured, for example, k is 1, and for example, the type 1 mask corresponds to an eye mask, M is a mask corresponding to an eye mask_kAnd (y) multiplying the eye mask by the global face semantic segmentation matrix y of the damaged image to obtain a matrix formed by the region where the eye part in the damaged image is located, namely extracting matrix information describing the eye region from the global face semantic segmentation matrix.

S603, acquiring a difference matrix between the fourth image matrix and the first image matrix;

wherein the difference matrix may be obtained by performing a matrix subtraction operation on the fourth image matrix and the first image matrix.

Wherein the difference matrix corresponds to α (C: γ (C)) -B in equation 3.

S604, obtaining a local human face semantic segmentation matrix matched with the target human face component in the first global human face semantic segmentation matrix;

for example, if there are 11 face components configured in advance, the global face semantic division matrix 1 is W × H × 11, and one of the 11-dimensional matrices of the local face semantic division matrix corresponds to, for example, an eye partThe matrix W H1 of the pieces. The local face semantic segmentation matrix corresponds to (M) in equation 3_k(γ (C): where k takes multiple values if there are multiple target face features, the difference between the two pictures can be obtained for multiple feature regions of interest.

S605, performing dot multiplication operation on the local face semantic segmentation matrix and the difference matrix to obtain sub-loss data matched with the target face component;

the sub-loss data corresponding to the target face component corresponds to (M) in equation 3_k(γ(C))⊙(α(C：γ(C))-B)。

And S606, summing the plurality of sub-loss data matched with the plurality of target face parts to obtain second loss data on the target face parts between the fourth image and the first image.

When k takes a plurality of values, a plurality of sub-loss data matched with different target face parts need to be summed, that is, the result of formula 3 is the second loss data.

Thus, the embodiment of the invention obtains the difference matrix between the two image matrixes corresponding to the reconstructed image and the clear image, extracts the local human face semantic segmentation matrix corresponding to the target human face part from the global human face semantic segmentation matrix corresponding to the damaged image, and then performs the dot product operation on the local human face semantic segmentation matrix and the difference matrix representing the overall difference between the two images, thereby obtaining the difference between the reconstructed image and the clear image on the target human face part. Then, the neural network model is iteratively updated by using the difference of each concerned target face part, so that the repair capability of the face reconstruction model obtained after training on each target face part can be enhanced, the region where the target face part is located in the face image with serious damage can be reconstructed and repaired in a targeted manner, and the definition of the target face part in the fourth image after repair is improved.

It should be noted that the present invention does not limit the execution sequence between S601 and S602, and does not limit the execution sequence between S603 and S604.

Alternatively, in performing S503, it may be calculated according to equation 4

The same symbol marks in formula 4 as those in formula 2 are not repeated here, and reference may be made to the related description of formula 2.

Formula 4 can calculate the L2 loss between the reconstructed image and the sharp image at each pixel point, and then average the L2 losses between all pixel points to obtain the pixel level loss.

The network structures of the face semantic segmentation model 1 and the face semantic segmentation model 2 are as follows:

the network reads the picture input as a matrix of (w x h x 3), and the network has 38 layers:

layer 1 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input size of w × h × 3 and an output size of w × h × 128.

Layer 2 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input size of w × h × 128 and an output size of w × h × 128.

Layer 3 is the largest pooling layer with a 2 × 2 pooling kernel, with input size w × h × 128 and output size w/2 × h/2 × 128.

The 4 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with an input size of w/2 × h/2 × 128 and an output size of w/2 × h/2 × 256.

The 5 th layer is a convolutional layer having 256 convolution kernels of 3 × 3, with an input size of w/2 × h/2 × 256 and an output size of w/2 × h/2 × 256.

The 6 th layer is the largest pooling layer with a 2 × 2 pooling kernel, with an input size of w/2 × h/2 × 256 and an output size of w/4 × h/4 × 256.

The 7 th layer is a convolutional layer having 512 convolution kernels of 3 × 3, with an input size of w/4 × h/4 × 256 and an output size of w/4 × h/4 × 512.

The 8 th layer is a convolutional layer with 512 convolution kernels of 3 × 3, the input size is w/4 × h/4 × 512, and the output size is w/4 × h/4 × 512.

The 9 th layer is the largest pooling layer with a 2 × 2 pooling kernel, with an input size of w/4 × h/4 × 512 and an output size of w/8 × h/8 × 512.

The 10 th layer is a convolutional layer with 1024 3 × 3 convolutional kernels, the input size is w/8 × h/8 × 512, and the output size is w/8 × h/8 × 1024.

The 11 th layer is a convolutional layer having 1024 convolutional kernels of 3 × 3, the input size is w/8 × h/8 × 1024, and the output size is w/8 × h/8 × 1024.

The 12 th layer is the largest pooling layer with a 2 × 2 pooling kernel, with an input size of w/8 × h/8 × 1024 and an output size of w/16 × h/16 × 1024.

The 13 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, with an input size of w/16 × h/16 × 1024 and an output size of w/16 × h/16 × 2048.

The 14 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, with an input size of w/16 × h/16 × 2048 and an output size of w/16 × h/16 × 2048.

The 15 th layer is a maximum pooling layer with a 2 × 2 pooling core, with an input size of w/16 × h/16 × 2048 and an output size of w/32 × h/32 × 2048.

The 16 th layer is an upsampled layer with a row and column upsampling factor of (2,2), an input size of w/32 × h/32 × 2048, and an output size of w/16 × h/16 × 2048.

The 17 th layer is a convolutional layer with 1024 3 × 3 convolutional kernels, with an input size of w/16 × h/16 × 2048 and an output size of w/16 × h/16 × 1024.

The 18 th layer is a splicing layer, the output of the 17 th layer and the output of the 12 th layer are spliced, the input size is two w/16 Xh/16X 1024, and the output size is w/16 Xh/16X 2048.

The 19 th layer is a convolutional layer with 1024 3 × 3 convolutional kernels, the input size is w/16 × h/16 × 2048, and the output size is w/16 × h/16 × 1024.

The 20 th layer is an upsampled layer with a row and column upsampling factor of (2,2), the input size is w/16 × h/16 × 1024, and the output size is w/8 × h/8 × 1024.

The 21 st layer is a convolutional layer with 512 convolution kernels of 3 × 3, the input size is w/8 × h/8 × 1024, and the output size is w/8 × h/8 × 512.

The 22 nd layer is a splicing layer, the output of the 21 st layer and the output of the 9 th layer are spliced, the input size is two w/8 Xh/8X 512, and the output size is w/8 Xh/8X 1024.

The 23 rd layer is a convolutional layer with 512 convolution kernels of 3 × 3, the input size is w/8 × h/8 × 1024, and the output size is w/8 × h/8 × 512.

The 24 th layer is an upsampled layer with a row and column sampling factor of (2,2), the input size is w/8 xh/8 × 512, and the output size is w/4 xh/4 × 512.

The 25 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with an input size of w/4 × h/4 × 512 and an output size of w/4 × h/4 × 256.

The 26 th layer is a splicing layer, the output of the 25 th layer and the output of the 6 th layer are spliced, the input size is two w/4 xh/4 x 256, and the output size is w/4 xh/4 x 512.

The 27 th layer is a convolutional layer having 256 convolution kernels of 3 × 3, with an input size of w/4 × h/4 × 512 and an output size of w/4 × h/4 × 256.

The 28 th layer is an upsampled layer with a row and column sampling factor of (2,2), the input size is w/4 × h/4 × 256, and the output size is w/2 × h/2 × 256.

The 29 th layer is a convolutional layer with 128 3 × 3 convolutional kernels, the input size is w/2 × h/2 × 256, and the output size is w/2 × h/2 × 128.

The 30 th layer is a splicing layer, the output of the 29 th layer and the output of the 3 rd layer are spliced, the input size is two w/2 xh/2 x 128, and the output size is w/2 xh/2 x 256.

The 31 st layer is a convolutional layer having 128 convolution kernels of 3 × 3, with an input size of w/2 × h/2 × 256 and an output size of w/2 × h/2 × 128.

The 32 nd layer is an upsampled layer with a row and column sampling factor of (2,2), the input size is w/2 × h/2 × 128, and the output size is w × h × 128.

The 33 rd layer is a convolutional layer with 128 3 × 3 convolutional kernels, with an input size of w × h × 128 and an output size of w × h × 128.

The 34 th layer is a splicing layer, the output of the 33 th layer and the output of the 1 st layer are spliced, the input size is two w × h × 128, and the output size is w × h × 256.

The 35 th layer is a convolutional layer having 128 convolution kernels of 3 × 3, with an input size of w × h × 256 and an output size of w × h × 128.

The 36 th layer is a convolutional layer having 64 convolution kernels of 3 × 3, with an input size of w × h × 128 and an output size of w × h × 64.

The 37 th layer is a convolutional layer having 32 convolution kernels of 3 × 3, with an input size of w × h × 64 and an output size of w × h × 32.

The 38 th layer is a convolutional layer with 11 1 × 1 convolutional kernels, with an input size of w × h × 32 and an output size of w × h × 11.

Network structure of neural network model for face reconstruction:

the network has two inputs, respectively: the standard picture information input is a w × h × 14 matrix formed by superimposing w × h × 3 of the standard picture and w × h × 11 of the standard picture through the semantic segmentation network 1, and is represented by IMAGE 1; damaged picture information input is a w × h × 14 matrix formed by superimposing w × h × 3 of damaged pictures on w × h × 11 of standard pictures through the semantic segmentation network 2, and is represented by IMAGE 2;

the network has 51 layers:

layer 1 is a convolutional layer with 64 3 × 3 convolutional kernels, with an input of IMAGE1, size w × h × 14, and an output of IMAGE1_ L1, size w × h × 64.

Layer 2 is a convolutional layer with 64 3 × 3 convolutional kernels, with an input of IMAGE2, size w × h × 14, and an output of IMAGE2_ L2, size w × h × 64.

Layer 3 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input of IMAGE1_ L1, size w × h × 64, and an output of IMAGE1_ L3, size w × h × 128.

Layer 4 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input of IMAGE2_ L2, size w × h × 64, and an output of IMAGE2_ L4, size w × h × 128.

Layer 5 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input of IMAGE1_ L3, of size w × h × 128, and an output of IMAGE1_ L5, of size w × h × 128.

Layer 6 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input of IMAGE2_ L4, of size w × h × 128, and an output of IMAGE2_ L6, of size w × h × 128.

Layer 7 is the largest pooling layer with a 2 × 2 pooling kernel, input IMAGE1_ L5, size w × h × 128, output IMAGE1_ L7, size w/2 × h/2 × 128.

The 8 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with input of IMAGE1_ L7, size w/2 × h/2 × 128, and output of IMAGE1_ L8, size w/2 × h/2 × 256.

The 9 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with input of IMAGE1_ L8, size w/2 × h/2 × 256, and output of IMAGE1_ L9, size w/2 × h/2 × 256.

Layer 10 is the largest pooling layer with a 2 × 2 pooling kernel, input IMAGE2_ L6, size w × h × 128, output IMAGE2_ L10, size w/2 × h/2 × 128.

The 11 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with input of IMAGE2_ L10, size w/2 × h/2 × 128, and output of IMAGE2_ L11, size w/2 × h/2 × 256.

The 12 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with input of IMAGE2_ L11, size w/2 × h/2 × 256, and output of IMAGE2_ L12, size w/2 × h/2 × 256.

Layer 13 is the largest pooling layer with a 2 × 2 pooling kernel, input IMAGE1_ L9, size w/2 × h/2 × 256, output IMAGE1_ L13, size w/4 × h/4 × 256.

The 14 th layer is a convolutional layer with 512 3 × 3 convolutional kernels, with input of IMAGE1_ L13, size w/4 × h/4 × 256, and output of IMAGE1_ L14, size w/4 × h/4 × 512.

The 15 th layer is a convolutional layer with 512 3 × 3 convolutional kernels, with input of IMAGE1_ L14, size w/4 × h/4 × 512, and output of IMAGE1_ L15, size w/4 × h/4 × 512.

Layer 16 is the largest pooling layer with a 2 × 2 pooling kernel, input IMAGE2_ L12, size w/2 × h/2 × 256, output IMAGE2_ L16, size w/4 × h/4 × 256.

The 17 th layer is a convolutional layer with 512 3 × 3 convolutional kernels, with input of IMAGE2_ L16, size w/4 × h/4 × 256, and output of IMAGE2_ L17, size w/4 × h/4 × 512.

Layer 18 is a convolutional layer with 512 3 × 3 convolutional kernels, with input of IMAGE2_ L17, size w/4 × h/4 × 512, and output of IMAGE2_ L18, size w/4 × h/4 × 512.

Layer 19 is the largest pooling layer with a 2 × 2 pooling kernel, input IMAGE1_ L15, size w/4 × h/4 × 512, output IMAGE1_ L19, size w/8 × h/8 × 512.

The 20 th layer is a convolutional layer with 1024 3 × 3 convolutional kernels, with input of IMAGE1_ L19, size w/8 × h/8 × 512, and output of IMAGE1_ L20, size w/8 × h/8 × 1024.

Layer 21 is a convolutional layer of 1024 convolutional kernels of 3 × 3 convolution kernels, input IMAGE1_ L20, of size w/8 × h/8 × 1024, output IMAGE1_ L21, of size w/8 × h/8 × 1024.

Layer 22 is the largest pooling layer with a 2 × 2 pooling kernel, input IMAGE2_ L18, size w/4 × h/4 × 512, output IMAGE2_ L22, size w/8 × h/8 × 512.

Layer 23 is a convolutional layer of 1024 convolutional kernels of 3 × 3 convolution kernels, input IMAGE2_ L22, of size w/8 × h/8 × 512, output IMAGE2_ L23, of size w/8 × h/8 × 1024.

Layer 24 is a convolutional layer of 1024 convolutional kernels of 3 × 3 convolution kernel, input IMAGE2_ L23, size w/8 × h/8 × 1024, output IMAGE2_ L24, size w/8 × h/8 × 1024.

Layer 25 is the largest pooling layer with 2 × 2 pooling kernels, input IMAGE1_ L21, size w/8 × h/8 × 1024, output IMAGE1_ L25, size w/16 × h/16 × 1024.

The 26 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, the input is IMAGE1_ L25 with a size of w/16 × h/16 × 1024, and the output is IMAGE1_ L26 with a size of w/16 × h/16 × 2048.

The 27 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, the input is IMAGE1_ L26 with a size of w/16 × h/16 × 2048, and the output is IMAGE1_ L27 with a size of w/16 × h/16 × 2048.

Layer 28 is the largest pooling layer with 2 × 2 pooling kernels, input IMAGE2_ L24, size w/8 × h/8 × 1024, output IMAGE2_ L28, size w/16 × h/16 × 1024.

The 29 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, the input is IMAGE2_ L28, the size is w/16 × h/16 × 1024, the output is IMAGE2_ L29, the size is w/16 × h/16 × 2048.

The 30 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, the input is IMAGE2_ L29 with a size of w/16 × h/16 × 2048, and the output is IMAGE2_ L30 with a size of w/16 × h/16 × 2048.

The 31 st layer is a splicing layer, the IMAGE1_ L27 output by the 27 th layer and the IMAGE2_ L30 output by the 30 th layer are spliced, the input size is two w/16 × h/16 × 2048, the output size is IMAGE _ L31, and the size is w/16 × h/16 × 4096.

The 32 nd layer is a convolutional layer with 4096 3 × 3 convolutional kernels, with input of IMAGE _ L31, size w/16 × h/16 × 4096, and output of IMAGE _ L32, size w/16 × h/16 × 4096.

Layer 33 is an upsampled layer with a row and column sampling factor of (2,2), input IMAGE _ L32, of size w/16 × h/16 × 4096, and output IMAGE _ L33, of size w/8 × h/8 × 4096.

Layer 34 is a convolutional layer with 4096 3 × 3 convolutional kernels, with input IMAGE _ L33, size w/8 × h/8 × 4096, and output IMAGE _ L35, size w/8 × h/8 × 4096.

The 35 th layer is a splicing layer, the IMAGE _ L34 output by the 34 th layer, the IMAGE1_ L21 output by the 21 st layer and the IMAGE2_ L24 output by the 24 th layer are spliced, the input sizes are w/8 xh/8 x4096, w/8 xh/1024 and w/8 xh/8 x1024, the output size is IMAGE _ L35, and the size is w/8 xh/8 x6144.

The 36 th layer is a convolutional layer with 2048 convolution kernels of 3 × 3, input IMAGE _ L35, size w/8 × h/8 × 6144, output IMAGE _ L36, size w/8 × h/8 × 2048.

Layer 37 is an upsampled layer with a row and column upsampling factor of (2,2), input IMAGE _ L36, of size w/8 xh/8 × 2048, and output IMAGE _ L37, of size w/4 xh/4 × 2048.

The 38 th layer is a convolutional layer with 2048 convolution kernels, 3 × 3, with input IMAGE _ L37, size w/4 × h/4 × 2048, and output IMAGE _ L38, size w/4 × h/4 × 2048.

The 39 th layer is a splicing layer, the IMAGE _ L38 output by the 38 th layer, the IMAGE1_ L15 output by the 15 th layer and the IMAGE2_ L18 output by the 18 th layer are spliced, the input sizes are w/4 xh/2048, w/4 xh/512 and w/4 xh/4 x512, the output size is IMAGE _ L39, and the size is w/4 xh/4 x3072.

Layer 40 is a convolutional layer of 1024 3 × 3 convolutional kernels, with input of IMAGE _ L39, size w/4 × h/4 × 3072, and output of IMAGE _ L40, size w/4 × h/4 × 1024.

Layer 41 is an upsampled layer with a row and column sampling factor of (2,2), with an input of IMAGE _ L40, of size w/4 × h/4 × 1024, and an output of IMAGE _ L37, of size w/2 × h/2 × 1024.

Layer 42 is a convolutional layer of 1024 3 × 3 convolutional kernels, with input IMAGE _ L41, size w/2 × h/2 × 1024, and output IMAGE _ L42, size w/2 × h/2 × 1024.

The 43 th layer is a splicing layer, splicing the IMAGE _ L42 output by the 42 th layer, the IMAGE1_ L9 output by the 9 th layer and the IMAGE2_ L12 output by the 12 th layer, wherein the input sizes are w/2 × h/2 × 1024, w/2 × h/2 × 256 and w/2 × h/2 × 256, the output size is IMAGE _ L43, and the size is w/2 × h/2 × 1536.

Layer 44 is a convolutional layer with 512 3 × 3 convolutional kernels, with input IMAGE _ L43, size w/2 × h/2 × 1536, and output IMAGE _ L44, size w/2 × h/2 × 512.

Layer 45 is an upsampled layer with a row and column sampling factor of (2,2), input IMAGE _ L44, of size w/2 × h/2 × 512, and output IMAGE _ L45, of size w × h × 512.

Layer 46 is a convolutional layer with 512 3 × 3 convolutional kernels, with an input of IMAGE _ L45, size w × h × 512, and an output of IMAGE _ L46, size w × h × 512.

The 47 th layer is a splicing layer, and the IMAGE _ L46 output by the 46 th layer, the IMAGE1_ L4 output by the 4 th layer and the IMAGE2_ L6 output by the 6 th layer are spliced, wherein the input sizes are w multiplied by h multiplied by 512, w multiplied by h multiplied by 128 and w multiplied by h multiplied by 128, the output is IMAGE _ L47, and the size is w multiplied by h multiplied by 768.

The 48 th layer is a convolutional layer with 256 convolution kernels of 3 × 3, with an input of IMAGE _ L47, size w × h × 768, and an output of IMAGE _ L48, size w × h × 256.

Layer 49 is a convolutional layer with 128 3 × 3 convolutional kernels, with an input of IMAGE _ L48, size w × h × 256, and an output of IMAGE _ L49, size w × h × 128.

Layer 50 is a convolutional layer with 64 3 × 3 convolutional kernels, with an input of IMAGE _ L49, size w × h × 128, and an output of IMAGE _ L50, size w × h × 64.

Layer 51 is a convolutional layer with 3 × 3 convolutional kernels, with an input of IMAGE _ L50, size w × h × 64, and an output of IMAGE _ L51, size w × h × 3.

Referring to fig. 6, a flowchart illustrating steps of an embodiment of an image processing method according to the present invention is shown, which may specifically include the following steps:

step 701, acquiring a first image to be reconstructed;

the first image is different from the first image described in the above embodiment of fig. 1 to 5, but is an image that needs to be reconstructed in practical applications, wherein the first image includes a damaged face region. For example, the face region in the first image is damaged more seriously.

Step 702, acquiring a second image matched with the first image;

the face area in the second image and the face area in the first image belong to the same user, and the second image comprises a clear face area;

wherein the second image is different from the second image described above in the embodiment of fig. 1-5, but is similar to the third image (i.e., the standard image) in the embodiment of fig. 1-5.

Step 703, obtaining a first face semantic segmentation result of the first image;

for specific implementation of obtaining the human face semantic segmentation result, reference may be made to the relevant description in the embodiments of fig. 1 to 5, which is not described herein again.

Step 704, obtaining a second face semantic segmentation result of the second image;

Step 705, inputting the first image and the first face semantic segmentation result, and the second image and the second face semantic segmentation result into a face reconstruction model trained in advance, so that the face reconstruction model reconstructs a face region of the first image according to the second image, the second face semantic segmentation result, and the first face semantic segmentation result, and generates a third image with a clear face region.

The face reconstruction model is used for reconstructing a face region of the first image according to the second image, the second face semantic segmentation result and the first face semantic segmentation result to generate a third image with a clear face region.

For the specific structure of the input data of the face reconstruction model, reference may be made to the related descriptions in the embodiments of fig. 1 to 5, which are not described herein again. In this step, since the face reconstruction model has been trained in any one of the alternative embodiments in fig. 1 to fig. 5, the face reconstruction model herein may combine the first face semantic segmentation result of the first image to be repaired and the second face semantic segmentation result of the second image (standard image) with a clear face region to repair the face region of the first image, thereby obtaining a third image with a clear face region.

It should be noted that the third image is similar to the fourth image, but is different from the third image in the above embodiment of the model training method.

The face reconstruction model of the embodiment of the invention can refer to the face semantic segmentation result of the first image to be repaired and the face semantic segmentation result of the second image with clear face region of the other user corresponding to the face to be repaired to reconstruct the damaged face region in the first image and improve the definition of the face region of the reconstructed image.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Corresponding to the model training method provided in the embodiment of the present invention, referring to fig. 7, a structural block diagram of an embodiment of a model training apparatus according to the present invention is shown, which may specifically include the following modules:

a first obtaining module 801, configured to obtain a training sample set, where the training sample set includes a first image, a second image, and a third image, where the second image is an image obtained by adding noise to a face region of the first image, a face region in the third image and a face region in the first image belong to a same user, and the first image and the third image both include a clear face region;

a second obtaining module 802, configured to obtain a first face semantic segmentation result of the second image;

a third obtaining module 803, configured to obtain a second face semantic segmentation result of the third image;

an input module 804, configured to input the second image and the first face semantic segmentation result, and the third image and the second face semantic segmentation result to a neural network model, so as to obtain a fourth image;

an identifying module 805 configured to identify difference data on a face region between the fourth image and the first image;

an updating module 806, configured to iteratively update the neural network model according to the difference data;

Optionally, the second obtaining module 802 includes:

the first obtaining submodule is used for obtaining a second image matrix matched with the second image;

the first input submodule is used for inputting the second image matrix into a pre-trained first human face semantic segmentation model to obtain a first global human face semantic segmentation matrix matched with a plurality of human face components;

the third obtaining module 803 includes:

the second obtaining submodule is used for obtaining a third image matrix matched with the third image;

the second input submodule is used for inputting the third image matrix into a second human face semantic segmentation model which is trained in advance to obtain a second global human face semantic segmentation matrix matched with a plurality of human face components;

the input module 804 includes:

the first splicing submodule is used for performing matrix connection processing on the second image matrix and the first global human face semantic segmentation matrix to obtain first matrix data;

the second splicing submodule is used for performing matrix connection processing on the third image matrix and the second global face semantic segmentation matrix to obtain second matrix data;

and the third input submodule is used for inputting the first matrix data and the second matrix data into a neural network model to obtain a fourth image.

Optionally, the identifying module 805 includes:

a first identification submodule for identifying first loss data on image features between the fourth image and the first image;

a second identification submodule, configured to identify second loss data on a target face component between the fourth image and the first image according to the first face semantic segmentation result;

a third identifying submodule, configured to identify third loss data between the fourth image and the first image at a pixel point;

and the third obtaining submodule is used for carrying out weighted summation on the first loss data, the second loss data and the third loss data according to preset image characteristic weight, face part weight and pixel point weight so as to obtain difference data between the fourth image and the first image in a face area.

Optionally, the first identification submodule includes:

the first input unit is used for respectively inputting the fourth image and the first image into a pre-trained image feature extraction model to obtain image feature data of the fourth image and image feature data of the first image;

a first obtaining unit configured to obtain first loss data on an image feature between the fourth image and the first image according to a difference between the image feature data of the fourth image and the image feature data of the first image.

Optionally, the second obtaining module 802 includes the first obtaining sub-module and the first input sub-module;

the second identification submodule includes:

a second acquisition unit configured to acquire a fourth image matrix matched with the fourth image;

a third acquisition unit configured to acquire a first image matrix matched with the first image;

a fourth acquiring unit configured to acquire a difference matrix between the fourth image matrix and the first image matrix;

a fifth obtaining unit, configured to obtain a local face semantic segmentation matrix that is matched with the target face component in the first global face semantic segmentation matrix;

the processing unit is used for performing point multiplication operation on the local face semantic segmentation matrix and the difference matrix to obtain sub-loss data matched with the target face component;

and the sixth acquisition unit is used for summing the plurality of sub-loss data matched with the plurality of target face parts to obtain second loss data on the target face parts between the fourth image and the first image.

Optionally, the updating module 806 is further configured to iteratively update the neural network model according to the difference data until the difference data converges.

For the embodiment of the device, since it is basically similar to the embodiment of the model training method, the description is simple, and for the relevant points, refer to the partial description of the corresponding method embodiment.

Corresponding to the image processing method provided by the above embodiment of the present invention, referring to fig. 8, a block diagram of an image processing apparatus according to an embodiment of the present invention is shown, and the image processing apparatus may specifically include the following modules:

a first obtaining module 901, configured to obtain a first image to be reconstructed, where the first image includes a damaged face region;

a second obtaining module 902, configured to obtain a second image matched with the first image;

a third obtaining module 903, configured to obtain a first face semantic segmentation result of the first image;

a fourth obtaining module 904, configured to obtain a second face semantic segmentation result of the second image;

the reconstruction module 905 is configured to input the first image and the first face semantic segmentation result, and the second image and the second face semantic segmentation result into a face reconstruction model trained in advance, so that the face reconstruction model reconstructs a face region of the first image according to the second image, the second face semantic segmentation result, and the first face semantic segmentation result, and generates a third image with a clear face region.

For the embodiment of the apparatus, since it is basically similar to the embodiment of the image processing method, the description is relatively simple, and for the relevant points, refer to the partial description of the corresponding method embodiment.

According to still another embodiment of the present invention, there is also provided an electronic apparatus including: a memory, a processor, and a model training program or an image processing program stored in the memory and executable on the processor, wherein the model training program when executed by the processor implements the steps of the model training method according to any of the above embodiments, and the image processing program when executed by the processor implements the steps of the image processing method according to any of the above embodiments.

According to still another embodiment of the present invention, there is also provided a computer-readable storage medium having stored thereon a model training program or an image processing program, the model training program, when executed by a processor, implementing the steps in the model training method according to any one of the above-mentioned embodiments, and the image processing program, when executed by the processor, implementing the steps in the image processing method according to any one of the above-mentioned embodiments.

The embodiments in the present specification are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing terminal to cause a series of operational steps to be performed on the computer or other programmable terminal to produce a computer implemented process such that the instructions which execute on the computer or other programmable terminal provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications of these embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the embodiments of the invention.

Finally, it should also be noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or terminal that comprises the element.

The present invention provides a model training method, a model training apparatus, an image processing method, an image processing apparatus, an electronic device, and a computer-readable storage medium, which have been described in detail above, and the principles and embodiments of the present invention are explained herein by applying specific examples, and the descriptions of the above examples are only used to help understanding the method and the core ideas of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A method of model training, comprising:

obtaining a first face semantic segmentation result of the second image, wherein the first face semantic segmentation result comprises: specific position information of each human face part in the second image;

obtaining a second face semantic segmentation result of the third image, wherein the second face semantic segmentation result includes: specific position information of each human face part in the third image;

inputting the second image, the first human face semantic segmentation result and the third image and the second human face semantic segmentation result into a neural network model, so that the neural network model simultaneously refers to the positions of all parts of the human face area of the second image and the positions of all parts of the human face area of the third image to reconstruct the second image, and a fourth image is obtained;

identifying difference data between the fourth image and the first image over a face region, comprising: identifying first loss data on image features between the fourth image and the first image; identifying second loss data on a target face part between the fourth image and the first image according to the first face semantic segmentation result; identifying third loss data between the fourth image and the first image at a pixel point; according to preset image feature weight, face part weight and pixel point weight, carrying out weighted summation on the first loss data, the second loss data and the third loss data to obtain difference data of a face area between the fourth image and the first image;

iteratively updating the neural network model according to the difference data;

2. The method of claim 1,

the obtaining of the first face semantic segmentation result of the second image includes:

acquiring a second image matrix matched with the second image;

inputting the second image matrix into a pre-trained first face semantic segmentation model to obtain a first global face semantic segmentation matrix matched with a plurality of face parts;

the obtaining of the second face semantic segmentation result of the third image includes:

acquiring a third image matrix matched with the third image;

inputting the third image matrix into a pre-trained second face semantic segmentation model to obtain a second global face semantic segmentation matrix matched with a plurality of face parts;

inputting the second image, the first face semantic segmentation result, and the third image and the second face semantic segmentation result into a neural network model, so that the neural network model reconstructs the second image by simultaneously referring to the positions of the parts of the face region of the second image and the positions of the parts of the face region of the third image, thereby obtaining a fourth image, including:

performing matrix connection processing on the second image matrix and the first global face semantic segmentation matrix to obtain first matrix data;

performing matrix connection processing on the third image matrix and the second global face semantic segmentation matrix to obtain second matrix data;

and inputting the first matrix data and the second matrix data into a neural network model to obtain a fourth image.

3. The method of claim 1, wherein the identifying first loss data between the fourth image and the first image over image features comprises:

inputting the fourth image and the first image into a pre-trained image feature extraction model respectively to obtain image feature data of the fourth image and image feature data of the first image;

and acquiring first loss data on image characteristics between the fourth image and the first image according to the difference between the image characteristic data of the fourth image and the image characteristic data of the first image.

4. The method of claim 1,

acquiring a second image matrix matched with the second image;

inputting the second image into a human face semantic segmentation model which is trained in advance to obtain a first global human face semantic segmentation matrix matched with a plurality of human face components;

the identifying second loss data on the target face component between the fourth image and the first image according to the first face semantic segmentation result comprises:

acquiring a fourth image matrix matched with the fourth image;

acquiring a first image matrix matched with the first image;

acquiring a difference matrix between the fourth image matrix and the first image matrix;

acquiring a local face semantic segmentation matrix matched with the target face component in the first global face semantic segmentation matrix;

performing dot product operation on the local face semantic segmentation matrix and the difference matrix to obtain sub-loss data matched with the target face component;

and summing a plurality of sub-loss data matched with a plurality of target face parts to obtain second loss data on the target face parts between the fourth image and the first image.

5. The method of claim 1, wherein the iteratively updating the neural network model based on the difference data comprises:

and iteratively updating the neural network model according to the difference data until the difference data is converged.

6. An image processing method, comprising:

obtaining a first face semantic segmentation result of the first image, wherein the first face semantic segmentation result comprises: specific position information of each human face part in first image

Obtaining a second face semantic segmentation result of the second image, wherein the second face semantic segmentation result includes: specific position information of each human face part in the second image;

and inputting the first image, the first face semantic segmentation result and the second image and the second face semantic segmentation result into a face reconstruction model which is trained in advance, so that the face reconstruction model which is trained in advance refers to the positions of all parts of the face region of the first image and the positions of all parts of the face region of the second image at the same time to reconstruct the first image, and a third image with a clear face region is obtained.

7. A model training apparatus, comprising:

a second obtaining module, configured to obtain a first face semantic segmentation result of the second image, where the first face semantic segmentation result includes: specific position information of each human face part in the second image;

a third obtaining module, configured to obtain a second face semantic segmentation result of the third image, where the second face semantic segmentation result includes: specific position information of each human face part in the third image;

an input module, configured to input the second image and the first face semantic segmentation result, and the third image and the second face semantic segmentation result into a neural network model, so that the neural network model simultaneously refers to positions of components in a face region of the second image and positions of components in a face region of the third image to reconstruct the second image, thereby obtaining a fourth image;

a recognition module for recognizing difference data on a face region between the fourth image and the first image, the recognition module comprising: a first identification submodule for identifying first loss data on image features between the fourth image and the first image; a second identification submodule, configured to identify second loss data on a target face component between the fourth image and the first image according to the first face semantic segmentation result; a third identifying submodule, configured to identify third loss data between the fourth image and the first image at a pixel point; the third obtaining submodule is used for carrying out weighted summation on the first loss data, the second loss data and the third loss data according to preset image characteristic weight, face part weight and pixel point weight to obtain difference data of the face area between the fourth image and the first image;

8. The apparatus of claim 7,

the second acquisition module includes:

the third obtaining module includes:

the input module includes:

9. The apparatus of claim 8, wherein the first identification submodule comprises:

10. The apparatus of claim 8,

the second acquisition module includes:

the second identification submodule includes:

a fifth obtaining unit, configured to obtain a local face semantic segmentation matrix that is matched with a target face component in the first global face semantic segmentation matrix;

11. The apparatus of claim 7,

the updating module is further configured to iteratively update the neural network model according to the difference data until the difference data converges.

12. An image processing apparatus characterized by comprising:

a third obtaining module, configured to obtain a first face semantic segmentation result of the first image, where the first face semantic segmentation result includes: specific position information of each human face part in the first image;

a fourth obtaining module, configured to obtain a second face semantic segmentation result of the second image, where the second face semantic segmentation result includes: specific position information of each human face part in the second image;

and the reconstruction module is used for inputting the first image, the first face semantic segmentation result and the second image and the second face semantic segmentation result into a face reconstruction model which is trained in advance, so that the face reconstruction model which is trained in advance refers to the positions of all parts of the face region of the first image and the positions of all parts of the face region of the second image at the same time to reconstruct the first image, and a third image with a clear face region is obtained.

13. An electronic device, comprising: memory, a processor and a model training program or an image processing program stored on the memory and executable on the processor, the model training program, when executed by the processor, implementing the steps of the model training method as claimed in any one of claims 1 to 5, the image processing program, when executed by the processor, implementing the steps of the image processing method as claimed in claim 6.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a model training program or an image processing program, which when executed by a processor implements the steps in the model training method of any one of claims 1 to 5, and which when executed by the processor implements the steps of the image processing method of claim 6.