CN113762117A

CN113762117A - Training method of image processing model, image processing model and computer equipment

Info

Publication number: CN113762117A
Application number: CN202110996242.0A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2021-08-27
Filing date: 2021-08-27
Publication date: 2021-12-07
Anticipated expiration: 2041-08-27
Also published as: CN113762117B

Abstract

The embodiment of the application relates to the technical field of image processing, and discloses a training method of an image processing model, the image processing model and computer equipment, wherein after multiple rounds of training, a coding network can learn the age characteristics of a plurality of persons in multiple age groups, and the age characteristics corresponding to all age groups are represented in a coding mode, namely the codes of all age groups are obtained by learning the age characteristics of the plurality of persons through the coding network; the generation type countermeasure network fuses the codes of the same person in all age groups with the face images to learn the age characteristic difference of the same person in different age groups, so that the generated predicted images conform to the individual characteristics. Furthermore, the loss function characterizes a coding loss between the first coding and the second coding, a feature loss between the real face image and the predicted face image, and a confrontation loss, wherein the feature loss between the real face image and the predicted face image enables a generator in the generative confrontation network to control a degree of restoration of the face feature.

Description

Training method of image processing model, image processing model and computer equipment

Technical Field

The embodiment of the application relates to the technical field of image processing, in particular to a training method of an image processing model, the image processing model and computer equipment.

Background

As machine learning techniques continue to mature, the variety of services based on machine learning techniques is increasing. For example, the computer device can process the first face image through a machine learning technology to obtain a second face image. The first age corresponding to the face in the first face image is different from the second age corresponding to the face in the second face image, but the first age and the second age correspond to the same identity. The above-mentioned services are widely demanded in a plurality of scenes, for example, at the time of public security criminal investigation, the appearance of a lost child several years later is presumed from the existing photograph thereof for search; or predicting the photo of the suspect after the suspect escapes for years according to the old photo when the suspect is young; for another example, in the process of making the film and television industry, when the actor plays the old role, the aged appearance of the actor is presumed according to the current appearance of the actor, so that a cosmetic maker can conveniently draw the image of the old role; for another example, in leisure entertainment, people want to trace their own small photos through the current image.

The computer device usually processes a first face image input according to the first face image through a machine learning model based on a face age change operation of a user to obtain a second face image. Wherein the face age change operation is to instruct the computer device to make the second age corresponding to the face in the second face image greater than the first age corresponding to the face in the first face image or less than the first age. The machine learning model is obtained by training based on different face images of different age groups.

When the first face image is processed by the above method, the feature change is determined only by the age group, and the feature change of the individual is usually made the same, that is, when the images of the user a and the user B having the same age change from the first age to the second age, the feature change of the users a and B is the same, and the feature change is single.

Disclosure of Invention

The technical problem mainly solved by the embodiments of the present application is to provide a training method of an image processing model, an image processing model and a computer device, wherein the image processing model obtained by training in the method can enable feature changes based on age changes to conform to individual characteristics, and can more accurately predict aging images or traceable young images.

In order to solve the above technical problem, in a first aspect, an embodiment of the present application provides a method for training an image processing model, where the image processing model includes a coding network and a generative countermeasure network, the method includes:

acquiring a real face image, a training face image and an expected age corresponding to the training face image, wherein the training face image and the real face image reflect the face of the same person, the real face image is labeled with an age group, the expected age is located in the age group labeled by the real face image, and the expected age is different from the age corresponding to the training face image;

performing feature coding on the real face image by adopting a coding network to obtain a first code, wherein the first code reflects the face features of the real face image at an expected age;

performing feature fusion on the first code and the training face image by adopting a generating type countermeasure network to obtain a predicted face image, wherein the predicted face image is an image generated by fusing the training face image with the features of the first code;

and performing iterative training on the image processing model by using a loss function, and returning to the steps of acquiring a real face image, a training face image and an expected age corresponding to the training face image until the image processing model is converged, wherein the loss function is used for representing the coding loss between a first code and a second code, the characteristic loss between the real face image and a predicted face image and the countermeasure loss, the second code is obtained by performing characteristic coding on the predicted face image by using a coding network, and the countermeasure loss is the loss calculated by the generative countermeasure network.

In some embodiments, the generative confrontation network comprises a generator comprising a plurality of downsampling layers, a plurality of depth layers, and a plurality of upsampling layers arranged in sequence;

the system comprises a plurality of down-sampling layers, a plurality of depth layers and a plurality of up-sampling layers, wherein the down-sampling layers are respectively used for outputting intermediate feature maps with reduced resolution layer by layer, the depth layers are respectively used for outputting intermediate feature maps with consistent resolution, and the up-sampling layers are respectively used for outputting intermediate feature maps with increased resolution layer by layer;

the method for performing feature fusion on a first code and a training face image by using a generator in a generating type confrontation network to obtain a predicted face image comprises the following steps:

and respectively fusing the first codes with the intermediate characteristic graphs input into the plurality of upsampling layers.

In some embodiments, one up-sampling layer includes an inverse convolution layer and a fusion layer;

the step of fusing the first codes with the intermediate feature maps of the input multiple upsampling layers respectively includes:

acquiring the resolution of a target intermediate feature map used for inputting a target layer, wherein the target layer is a fusion layer in any one of the up-sampling layers;

performing linear transformation on the first code according to the resolution of the target intermediate characteristic diagram to obtain a parameter matrix;

carrying out normalization processing on the target intermediate characteristic diagram to obtain a normalized target intermediate characteristic diagram;

and performing linear transformation on the target intermediate characteristic diagram and the parameter matrix after the normalization processing to obtain an intermediate characteristic diagram which is output by the target layer and is fused with the first code.

In some embodiments, the performing linear transformation on the normalized target intermediate feature and the parameter matrix to obtain the intermediate feature fused with the first code output by the target layer includes:

acquiring a variable matrix and an offset matrix according to the parameter matrix;

calculating the intermediate feature fused with the first code output by the target layer by adopting the following formula:

Y＝(1+D1)*y+D2；

wherein y is the target intermediate feature after normalization, D1 is a variable matrix, and D2 is an offset matrix.

In some embodiments, the aforementioned loss function is:

wherein L is_styleFor coding loss, L_AdsTo combat losses, L_resIs a characteristic loss, σ_styleFor coding the weight lost, σ_AdsTo counter lost weights, σ_resIs the weight of feature loss, x is the training face image, T is the age group of the expected age, S (Y)_sT) is the first code, S (G (x, S (Y))_sT)), T) is the second code, E represents the expected value of the distribution function, D (x) represents the probability of discrimination as true or false for the training face image, D (G (x, S (Y))_sT))) is the probability of being judged true or false for the predicted face image, G (x, S (Y))_sT)) is a predicted face image, Y_sThe image is a real face image; MasK_GFor predicting labels of pixel points in the face image, when a pixel point in the predicted face image is positioned in the five sense organs region, the corresponding masK_GIs 1, otherwise is 0; MasK_YFor the label of the pixel point in the real face image, when a pixel point in the real face image is located in the five sense organs region, the corresponding masK_YIs 1, otherwise is 0.

In some embodiments, before the step of iteratively training the image processing model by using the loss function, the method further includes:

respectively acquiring a facial feature region of a real facial image and a facial feature region of a predicted facial image by adopting a facial key point algorithm;

and determining the characteristic loss between the real face image and the predicted face image according to the difference between the facial features of the real face image and the facial features of the predicted face image.

In some embodiments, the step of determining a feature loss between the real face image and the predicted face image according to a difference between a facial region of a real face image and a facial region of a predicted face image includes:

calculating the characteristic loss between the real face image and the predicted face image by adopting the following formula:

L_res＝||G(x,s(Y_s,T))*mask_G-Y_s*masK_Y||₁

wherein x is the image of the training face, T is the age group of the expected age, and S (Y)_sT) is the first code, G (x, S (Y)_sT)) is a predicted face image, Y_sThe image is a real face image; MasK_GFor predicting labels of pixel points in the face image, when a pixel point in the predicted face image is positioned in the five sense organs region, the corresponding masK_GIs 1, otherwise is 0; MasK_YFor the label of the pixel point in the real face image, when a pixel point in the real face image is located in the five sense organs region, the corresponding masK_YIs 1, otherwise is 0.

In some embodiments, before the step of performing feature coding on the real face image by using the coding network to obtain the first code, the method further includes:

the real face image and the training face image are respectively preprocessed, so that the resolutions of the preprocessed real face image and the preprocessed training face image are both preset resolutions, the preprocessed real face image and the preprocessed training face image are both face region images, and the faces are both front faces.

In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides an image processing method, including:

acquiring a face image to be processed and an expected age;

the face image to be processed and the expected age are input into the image processing model obtained by training using the method according to the first aspect, and an age change image reflecting the age of the person in accordance with the expected age is output.

In order to solve the above technical problem, in a second aspect, an embodiment of the present application provides a computer device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of the first aspect as described above.

In order to solve the above technical problem, in a third aspect, an embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for causing a computer device to perform the method according to the first aspect.

The beneficial effects of the embodiment of the application are as follows: different from the situation in the prior art, in the training method of the image processing model provided in the embodiment of the present application, the image processing model includes a coding network and a generative confrontation network, each training is performed to obtain a real face image, and in addition, a training face image and an expected age corresponding to the training face image are obtained, the training face image and the real face image are face images of the same person, and the expected age is located in an age range corresponding to the real face image, and the expected age is different from an age range corresponding to the training face image, for example, the age range corresponding to the real face image is 55-60 years old, the expected age is 58 years old, and an age of a person in the training face image is 40 years old. Then, a coding network is adopted to carry out feature coding on the real face image to obtain a first code, a generating type countermeasure network is adopted to carry out feature fusion on the first code and the training face image to obtain a predicted face image, finally, a total loss is calculated by using a loss function for representing the coding loss between the first code and the second code, the feature loss between the real face image and the predicted face image and the countermeasure loss, model parameters are adjusted according to the total loss to complete one-time training, an image sample set is traversed continuously, the real face image is replaced, new training is carried out until the model is converged, and an image processing model is obtained. The second code is obtained by performing feature coding on the predicted face image by adopting a coding network, and the confrontation loss is calculated by a generative confrontation network.

After multiple rounds of training, the coding network can learn the age characteristics of multiple persons in multiple age groups, and the age characteristics corresponding to all age groups are characterized in a coding mode, namely the codes of all age groups are obtained by learning the age characteristics of the multiple persons through the coding network; the generation type countermeasure network fuses the codes of the same person in all age groups with the face images to learn the age characteristic difference of the same person in different age groups, so that the generated predicted images conform to the individual characteristics. Furthermore, the loss function characterizes coding loss between the first coding and the second coding, feature loss between the real face image and the predicted face image, and countermeasure loss, wherein the feature loss between the real face image and the predicted face image enables a generator in the generative countermeasure network to control the degree of reduction of the face feature, i.e., enables the predicted face image and the real face image to reflect the same person identity, the predicted face image is not distorted, only the age feature (feature affected by age change) is changed, i.e., enables the generator to better learn the age feature, and increases model accuracy.

Therefore, the trained image processing model and codes corresponding to a plurality of age groups can be stored, when testing or application is carried out, the image processing model and the codes corresponding to the plurality of age groups are called, the corresponding codes are determined according to the input expected age, the codes corresponding to the expected age are fused with the face image to be processed, so that the face image to be processed is subjected to age characteristic change according to the codes corresponding to the expected age, an age change image is generated, the age characteristic change in the age change image accords with individual characteristics, and the image processing model can predict aging images or traced young images more accurately.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a schematic flowchart of a training method of an image processing model according to an embodiment of the present disclosure;

fig. 2 is a schematic network structure diagram of an image processing model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a network structure of an image processing model according to another embodiment of the present application;

FIG. 4 is a schematic diagram of a network structure of an image processing model according to another embodiment of the present application;

fig. 5 is a schematic structural diagram of an upsampling layer fusion first code of a generator according to an embodiment of the present application;

fig. 6 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 7 is a block diagram of a computer device according to an embodiment of the present application.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that various changes and modifications can be made by one skilled in the art without departing from the spirit of the application. All falling within the scope of protection of the present application.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

It should be noted that, if not conflicted, the various features of the embodiments of the present application may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.

To facilitate understanding of the method provided in the embodiments of the present application, first, terms referred to in the embodiments of the present application will be described:

(1) neural network

The neural network can be composed of neural units, and can be specifically understood as a neural network with an input layer, a hidden layer and an output layer, wherein generally the first layer is the input layer, the last layer is the output layer, and the middle layers are hidden layers. The operation of each layer in the neural network can be described by a mathematical expression y ═ a (W · x + b), and from a physical level, the operation of each layer in the neural network can be understood as that the transformation from an input space to an output space (i.e. the row space to the column space of a matrix) is completed through five operations on the input space (a set of input vectors), wherein the five operations include 1, ascending/descending; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". The operation of 2 and 3 is completed by W.x, the operation of 4 is completed by + b, and the operation of 5 is realized by a 'a ()' so that the operation is expressed by a 'space' two-word because the classified object is not a single object but a class of objects, and the space refers to the set of all individuals of the class of objects, wherein W is a weight matrix of each layer of the neural network, and each value in the matrix represents the weight value of one neuron of the layer. The matrix W determines the spatial transformation of the input space to the output space described above, i.e. W at each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

It should be noted that, in the embodiment of the present application, based on the model adopted by the machine learning task, the model is essentially a neural network. The common components in the neural network comprise a convolutional layer, a pooling layer, a normalization layer, a reverse convolutional layer and the like, the model is designed by assembling the common components in the neural network, and when model parameters (weight matrixes of all layers) are determined so that the error of the model meets a preset condition or the number of the adjusted model parameters reaches a preset threshold value, the model converges.

The convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The convolution operation aims to extract different features of an input image, a first layer of convolution layer can only extract some low-level features such as edges, lines, angles and other levels, and a deeper convolution layer can iteratively extract more complex features from the low-level features.

The inverse convolutional layer is used to map a space with a low dimension to a space with a high dimension, while maintaining the connection relationship/mode therebetween (the connection relationship here refers to the connection relationship during convolution). The reverse convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to perform deconvolution operation on the image. In general, an upscale () function is built in a framework library (e.g., a PyTorch library) for designing a neural network, and a low-dimensional to high-dimensional spatial mapping can be realized by calling the upscale () function.

Pooling (posing) is a process that mimics the human visual system in that data can be reduced in size or images can be represented with higher level features. Common operations of pooling layers include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Generally, pooling layers are periodically inserted between convolutional layers of a neural network to achieve dimensionality reduction.

The normalization layer is used to perform normalization operations on all neurons in the middle layer to prevent gradient explosion and gradient disappearance.

(2) Loss function

In the process of training the neural network, because the output of the neural network is expected to be as close as possible to the value really expected to be predicted, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (however, an initialization process is usually carried out before the first updating, namely parameters are configured in advance for each layer in the neural network), for example, if the predicted value of the network is high, the weight matrix is adjusted to be lower in prediction, and the adjustment is carried out continuously until the neural network can predict the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the neural network becomes a process of reducing the loss as much as possible.

(3) Generative countermeasure network

A Generative Adaptive Network (GAN) typically includes a Generator (G) and a Discriminator (D). Through mutual game between the generator and the discriminator, the unsupervised learning is realized. The generator randomly samples from a latent space (latency) as an input, and its output needs to mimic the real samples in the training set as much as possible. The input to the discriminator is either the true sample or the output of the generator, with the aim of distinguishing the output of the generator as the input sample of the input, as far as possible, from all input samples including the true sample. The generator should fool the arbiter as much as possible. And forming a confrontation relation between the generator and the discriminator so as to continuously adjust parameters, and generating pictures which are fake and spurious in life to finish the training of the model.

Before the embodiment of the present application is described, a simple description is first given to a currently common age-related image processing method based on machine learning, so that it is subsequently convenient to understand the embodiment of the present application.

The first method is as follows: the method comprises the steps of taking a plurality of images of different age groups with marked ages as training sets, and training a condition generation countermeasure network, wherein the condition generation countermeasure network comprises an image generator G, an image discriminator D, an age estimation network AEN and an identity recognition network FRN. Wherein G is trained for generating aged images, in particular, aged images are automatically and efficiently generated by inputting young images and preset age conditions. D is used for identifying whether the generated aged image is a real image or not, and the generated aged image can be ensured to be deceptive. The AEN is used for reducing the difference between the age of the generated aged image and a preset value, and the FRN is used for ensuring the consistency of the portrait identity in the generation process. In the model training process, the training sets are grouped according to age groups, so that the conditional countermeasure generation network learns the aging characteristics corresponding to each group of ages, and aging change of the face is realized.

In one way, in this scheme, aging characteristics are learned from each set of images such that aging characteristics are determined only by age group, such that aging characteristics of individuals are the same, and thus, the characteristic variations are the same for the aging variations of the face in each test data. However, in real life, the characteristic change of each person with age is influenced by individual factors and has specificity. Therefore, the scheme in the first mode does not take individual differences into consideration, and cannot accurately generate an aged image.

In view of the above problems, embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.

Referring to fig. 1, fig. 1 is a schematic flow chart of a training method of a model provided in an embodiment of the present application, where the model includes a coding network and a generative confrontation network, and the method specifically includes the following steps:

s21: the method comprises the steps of obtaining a real face image, a training face image and an expected age corresponding to the training face image, wherein the training face image and the real face image reflect the face of the same person, the expected age is located in an age range corresponding to the real face image, and the expected age is different from the age corresponding to the training face image.

The real face image is a face image under real age. For multiple training, an image sample set may be prepared, where the image sample set includes facial images of several individuals in multiple age groups, and each facial image is labeled with an age group.

The image sample set comprises face images of a plurality of people in all age groups, and the face images are images comprising face information. For example, the ages of 10-80 are divided into 14 age groups according to the age of 5, and facial images of 100 persons in each age group are collected, that is, the corresponding images of each age group include facial images of the 100 persons at the corresponding ages, for example, for the age group of [50,55 ], including facial images taken when the 100 persons are respectively at [50,55 ]. In this embodiment, the image sample set has 100 × 14 — 1400 face images.

Each face image is labeled with an age group indicating the real age of the face in the face image, for example, if a face image 1# is a frontal photograph of a person "three by three" at age 38, the labeled age group of the face image 1# may be "[ 35,40) years".

In a specific implementation, the frontal photographs of a plurality of individuals in each age group can be obtained from various public face databases to serve as the image sample set, and it is worth explaining that a plurality of frontal photographs of the same person cover each age group, that is, the number of face images of the same person is at least the group number of the age group.

During each training, a face image is randomly selected from the image sample set as a real face image, and the model is trained once by using the (real face image, training face image and expected age) as a set of training data. It is understood that the number of training times required for model convergence is usually hundreds, thousands, etc., and when one training is completed, the image sample set is traversed, a face image is reselected as a new real face image, and a corresponding new training face image and an expected age are obtained, i.e. (real face image ', training face image ', expected age ') are taken as another set of training data, and the model is trained next time. And repeating the steps until the image sample set is traversed, and performing the next round of training. It is understood that, in the next round of training, the iterative parameter training may still be performed with the image sample set as training data, and may also be performed with other image sample sets, which are prepared separately and conform to the above description, as training data until the image processing model converges.

Since the processing procedure of each set of training data (real face image, training face image, expected age) is the same in one training process, the embodiment of the present application introduces the training procedure of the image processing model with the processing procedure of one set of training data.

In a set of training data, the real face image is a face image in the image sample set, and the real age of the face in the real face image is within the labeled age range, for example, if the real face image is a right face of the person "liquad" at 50 years old, the labeled age range may be [50,55 ].

Then, a face image of the person "lie four" in another age group, for example, a photograph of "lie four" at the age of 32 years is acquired as a training face image. It is understood that the training face image and the real face image reflect the face of the same person, i.e. both are photos of the same person comprising face information. The expected age is used to indicate that a face image of a person at the expected age is acquired based on the training face image. In the training process, the expected age is different from the age corresponding to the training face image, and in order to train the model with the real face image as real data, the expected age is within the age range corresponding to the real face image, for example, the expected age for the person "lie four" may be 50, 51, 52, 53, or 54 years, continuing the above example. It will be appreciated that the expected age is the expected age entered by the user during testing or use of the model.

It will be appreciated that in the training phase, the training face image may be a face image that is otherwise prepared by one skilled in the art, for example, an additional "Liquan" photograph at the age of 32. In some embodiments, the face image of "lie four" in other age groups may also be selected as the training face image, for example, when the real face image is the face image of the person "lie four" in the age group [50,55), the training face image may be the face image of the person "lie four" in the age group [30, 35).

It is understood that the three data (real face image, training face image, expected age) are acquired without precedence. In some embodiments, the training face image and the expected age may be obtained first, and then the real face image of the person in the age group of the expected age may be found from the image sample set according to the expected age.

It is understood that when the expected age is greater than the age corresponding to the training face image, it is equivalent to using the image processing model to predict the aging face image, and when the expected age is less than the age corresponding to the training face image, it is equivalent to tracing the young face image using the image processing model. In the training process, the expected age can be selected, so that the trained image processing model has the function of predicting the aged face image and/or the function of tracing the young face image.

In some embodiments, the real face image is preprocessed to obtain a preprocessed real face image, and the training face image is preprocessed to obtain a preprocessed training face image, wherein the resolutions of the preprocessed real face image and the preprocessed training face image are both preset resolutions and are both face region images, and the faces are both front faces.

Before training, the real face image and the training face image are respectively preprocessed, so that the two processed images are structured (each image displays approximately the same face part in the same place, for example, the coordinate position of eyes in each image is approximately the same), and model convergence is facilitated. Specifically, the real face image is preprocessed to obtain a preprocessed real face image, the resolution (i.e., size) of the preprocessed real face image is a preset resolution, the preprocessed real face image is a face region image, the face region image is an image only including a face region and not including a background, that is, the preprocessed real face image only includes the face region and not including the background. In addition, the face in the real face image is a front face. The front face indicates that the face does not deflect left and right, the central axis of the face is approximately coincident with the central axis of the image, and the included angle between the central axis of the face and the central axis of the image is approximately zero.

Similarly, the training face image is preprocessed to obtain a preprocessed training face image, the resolution of the preprocessed training face image is also a preset resolution, that is, the size of the preprocessed training face image is the same as that of the preprocessed real face image, and the preprocessed training face image is also a face region image, that is, the preprocessed training face image only includes a face region and does not include a background. In addition, the face in the preprocessed training face image is a front face.

In some embodiments, the preset resolution may be 1024 × 1024. In other embodiments, the preset resolution may be set by a person skilled in the art according to actual situations, and is not limited herein, so that the preprocessed training face image and the preprocessed real face image are structured face images.

It is to be understood that the real face image in the following description may be a preprocessed real face image, and the training face image in the following description may be a preprocessed training face image, which are collectively referred to as a real face image and a training face image hereinafter for the sake of description.

S22: and performing feature coding on the real face image by adopting a coding network to obtain a first code, wherein the first code reflects the face features of the real face image at an expected age.

The coding network is a neural network and is used for acquiring vectors representing the human face features of the input image, namely codes, so that the neural network for converting the image into the codes is called the coding network. It will be appreciated that the coding network mainly serves to extract features and reduce dimensions, so that the code can characterize the facial features of the input image (the real facial image). In this embodiment, the facial features of the real face image at the expected age can be better represented by the first encoding, which helps the subsequent generator to accurately generate the predicted image based on the first encoding, and helps the subsequent discriminator to judge the reliability of the predicted image. Because the real face image is the real face image of the individual in a certain age range, under the action of the loss function, the coding network learns the age characteristics of the individual in the certain age range, so that the first coding can reflect the age characteristics of the individual in the certain age range. The age characteristics refer to characteristics affected by age change, such as wrinkles and apple myoptosis. It should be noted that the first codes generated by the same real face image at different expected ages are different, so that the feature changes of the face at different ages can be reflected more accurately.

In some embodiments, the real face image may be size-compressed in advance, for example, to 256 × 256, and then input into the encoding network.

In some embodiments, an encoding network having a neural network structure includes an input layer, a plurality of hidden layers, and an output layer, wherein the input layer is a first layer of the encoding network and the output layer is a last layer of the encoding network. The input layer may be a convolutional layer, a hidden layer includes a convolutional layer, a pooling layer and a normalization layer, and the output layer may include an activation function layer or a convolutional layer. As can be seen from the description in (1) above, the convolutional layer, the pooling layer, and the normalization layer are all common components in the neural network. It can be understood that in the encoding network, the output of the previous layer is the input of the next layer, and the real face image is subjected to multiple spatial transformations and then gradually reduced in size until the output layer outputs the first code (vector of a specific size) of a specific size. The first code is obtained by extracting the features of the real face image, and the age bracket marked by the real face image covers the expected age, so that the first code reflects the face features of the real face image at the expected age.

In some embodiments, the coding network is structured as shown in table 1, including 1 × 1 convolutional layers, hidden layers with convolutional layers, pooling layers, and normalization layers, and lretlu activation function layers, 4 × 4 convolutional layers. The output first code has dimension N512, where N is the number of groups in the age group. For example, when the age group number is 14, N is 14, the first code is a vector with dimensions 14 × 512. It is understood that each age group is a row of vectors in the first encoding of N × 512, i.e. an age group corresponds to a vector encoded as a 1 × 512, e.g. the encoding for the first action [10,15) year, and the encoding for the second action [15,20) year. In one training, one line is mainly adjusted, after the whole image sample set is traversed, namely after face images of all age groups are trained, each line vector in the first coding of N x 512 is adjusted, and each line vector can represent age characteristics of a plurality of persons of the corresponding age group in the image sample set. Since the people in each group are the same, each row vector in the first encoding of N × 512 characterizes the age of the same multiple people in an age group. It is understood that one dimension of the encoding is "512" for illustrative purposes only, and that other dimensions are possible. For convenience of expression, the dimension of the first code is described as N × M, where M can be flexibly set by those skilled in the art.

It is understood that the first code is a vector of N × M, wherein the "first" is only distinguished from the subsequent "second" and has no essential meaning. In essence, the vector of N × M is the N × M code. After training is completed, the trained N-M codes are obtained, and each row vector in the N-M codes represents the age characteristics of a plurality of identical characters in an age group. The N × M code is stored together with the trained image processing model, i.e. the N × M code corresponds to a part of the image processing model. This N x M coding may be invoked when the trained image processing model is invoked to help determine the coding at a desired age.

TABLE 1

Those skilled in the art will appreciate that the configuration of each convolution layer in table 1 (e.g., convolution kernel size, number, and step size) may be set according to actual circumstances, and will not be described in detail here. It is understood that the structure of the coding network in table 1 is merely an exemplary illustration, and those skilled in the art can design a neural network by themselves, so as to satisfy the function of obtaining the coding from the image through feature extraction.

S23: and performing feature fusion on the first code and the training face image by adopting a generating type countermeasure network to obtain a predicted face image, wherein the predicted face image is an image generated by fusing the feature of the first code into the training face image.

The generation type countermeasure network fuses the age characteristics of the real face image reflected by the first code into the training face image based on the first code and the training face image, so that the generated prediction face image is close to the real face image.

It is understood that the generative confrontation network comprises a generator, which is also a neural network, having a structure of a neural network, i.e. comprising a plurality of layers for spatial variation, the other layers may be referred to as intermediate layers, in addition to the input layer and the output layer, for which the input of each layer is the output of the next layer, and the feature map of the output of the intermediate layer may become an intermediate feature map. In the process of predicting the face image, at least one intermediate feature map and the first code or the deformation of the first code can be fused, and the fusion refers to that the intermediate feature map and the deformation of the first code or the first code are subjected to linear operation or nonlinear operation, so that the training face image is changed according to the feature direction indicated by the first code, and the predicted face image is generated. For example, if the expected age is 70 years, the real face image reflected by the first code is an image of a person at the age of [70,85), the real face image of the person at the age of [70,85) includes wrinkles, the features reflected by the first coded representation also include wrinkles, the training face image is a face image of the person at the age of 30, the first code representing wrinkles is fused in the training face image, so that the training face image develops towards the direction of wrinkles, and the predicted face image of the person at the age of 70 is generated and also includes wrinkles.

In this embodiment, after traversing the image sample set and performing multiple rounds of training, the generative countermeasure network learns the age characteristic differences of the same person in different age groups by fusing the codes of the same person in each age group with the face images, so that the generated predicted image conforms to the individual characteristics.

In some embodiments, as shown in fig. 2, the generator includes a plurality of downsampling layers, a plurality of depth layers, and a plurality of upsampling layers, which are sequentially arranged. The multiple down-sampling layers are respectively used for outputting intermediate feature maps with the resolution reduced layer by layer. It will be appreciated that the plurality of downsampled layers are part of the layers of the generator part1, each of the layers part1 includes a convolutional layer, a pooling layer and a normalization layer, and any of the layers part1 downsamples the input intermediate feature map to obtain a reduced resolution intermediate feature map. Since the role of the partial layer part1 is to perform downsampling operations, the layers in the partial layer part1 may be referred to as downsampling layers. Through a plurality of down-sampling layers, feature extraction is achieved, and features are not lost.

And then, the other part of the layer part2 of the generator is used for carrying out feature extraction on the intermediate feature map output by the last down-sampling layer, so as to obtain an intermediate feature map with consistent resolution. Each layer in the partial layer part2 comprises a convolution layer and a normalization layer, the feature extraction is carried out, the image size is not changed, the network depth of the generator is deepened, and the generator can learn better feature information. Since the role of this partial layer part2 is to extract features, deepening the network depth of the generator, the layers in this partial layer part2 may be referred to as depth layers. And the network depth is deepened through a plurality of depth layers, so that the characteristic information is beneficial to learning.

Then, the intermediate feature map output by the last depth layer is up-sampled by another partial layer part3 of the generator, and intermediate feature maps with increasing resolution layer by layer are output respectively. Each of the partial layers part3 includes an inverse convolution layer and a normalization layer. The up-sampling is achieved by inverse convolution layers such that the resolution of the intermediate feature map output by each layer in part3 increases layer by layer. Since the partial layer part3 functions as an upsampling to restore the intermediate feature maps output by the downsampling layer and the depth layer, i.e., to restore the features learned by the downsampling layer and the depth layer, each layer in the partial layer part3 may be referred to as an upsampling layer. And restoring the characteristics learned by the down-sampling layer and the depth layer through a plurality of up-sampling layers.

In this embodiment, through a plurality of down-sampling layers, a plurality of depth layers and a plurality of up-sampling layers, the training face image is subjected to feature learning and feature restoration in sequence, and in the process, the first encoding is fused, so that the training face image is changed according to the features indicated by the first encoding, and finally, a predicted image is generated.

In this embodiment, the step S23 specifically includes:

and S231, fusing the first codes with the intermediate characteristic graphs input into the plurality of upsampling layers respectively.

Specifically, as shown in fig. 3, in the feature recovery process, the first codes are respectively fused with the intermediate feature maps input to the upsampling layers. The intermediate characteristic diagram input into each up-sampling layer is subjected to linear operation or nonlinear operation with the first code or the deformation of the first code, so that the intermediate characteristic diagram output by each up-sampling layer is fused with the characteristics indicated by part of the first code. It will be appreciated that the feature fineness reflected by an intermediate feature map is the same as the feature fineness indicated by the portion of the first encoding to which the intermediate feature map is fused.

In some embodiments, the step S23 specifically includes: and fusing the first codes with the intermediate feature maps input into the plurality of upsampling layers and the intermediate feature maps input into at least partial depth layers respectively.

In this embodiment, the first encoding is fused while at least part of the depth layer is learning and the upsampling layer is reducing, as shown in fig. 4. The fusion process for each layer is identical to that described above. It will be appreciated that, based on the flexibility of neural network design, one skilled in the art can select at least two layers among multiple depth layers and multiple upsampling layers for feature fusion.

In the embodiment, in the process of feature restoration, the first codes are respectively fused with the intermediate feature maps input into each sampling layer, so that the feature learning of the generation type countermeasure network on the training face image is not influenced, and the method is beneficial to outputting more accurate predicted images.

In some embodiments, as shown in fig. 5, feature fusion is performed in the normalization layer of the upsampling layer, that is, the normalization layer performs linear calculation or nonlinear calculation on the input intermediate feature map and the first encoding or the deformation of the first encoding, so that the intermediate feature map output by the normalization layer is fused with the features indicated by the first encoding. It is to be appreciated that the normalization layer implements a fusion operation, and thus, may be referred to as a fusion layer. That is, in this embodiment, one up-sampling layer includes the reverse convolution layer and the fusion layer.

In this embodiment, the step S231 specifically includes:

and S2311, acquiring the resolution of a target intermediate characteristic diagram for inputting a target layer, wherein the target layer is a fusion layer in any one up-sampling layer.

And S2312, performing linear transformation on the first code according to the resolution of the target intermediate characteristic diagram to obtain a parameter matrix.

And S2313, normalizing the target intermediate characteristic diagram to obtain the normalized target intermediate characteristic diagram.

S2314: and performing linear transformation on the target intermediate characteristic diagram and the parameter matrix after the normalization processing to obtain an intermediate characteristic diagram which is output by the target layer and is fused with the first code.

Here, the fusion process will be described by taking as an example a fusion layer (i.e., a target layer) in any one of the up-sampling layers, and first, a target intermediate feature map V for input to the target layer and the resolution, i.e., the size, of the target intermediate feature map V are acquired. And linearly changing the first code according to the resolution of the target intermediate characteristic diagram V to obtain a parameter matrix. Specifically, the first code is linearly changed using the following formula:

D＝S*A^T+b；

wherein S is a first code, A^TIs a variable, b is a deviation value, and D is a parameter matrix. In this case, the first encoding is varied linearly, so that the dimension of the parameter matrix size is adapted to the resolution of the target intermediate feature map.

Next, the normalization layer having the fusion function is referred to as a fusion layer. Therefore, the fusion layer has a normalization function, and performs normalization processing on the target intermediate feature map, that is, normalization processing can be performed according to the following formula:

wherein V is a target intermediate characteristic diagram,

and the mean value of V, sigma is the standard deviation of V, and y is the target intermediate characteristic graph after normalization processing. ε is a very small value (1 e by default)^-5) And is used for preventing the generation of abnormality of divisor 0 when standard deviation is 0.

And then, carrying out linear change on the target intermediate characteristic diagram y and the parameter matrix D after the normalization processing to obtain an intermediate characteristic diagram fused with the characteristics indicated by the first codes.

In this embodiment, the normalized target intermediate feature map and the parameter matrix obtained by deforming the first code are fused, and compared with the fusion and normalization, the amount of calculation can be reduced because the pixel value of the normalized target intermediate feature map is between 0 and 1, which is beneficial to improving the operation speed.

In some embodiments, step S2314 specifically includes:

A. acquiring a variable matrix and an offset matrix according to the parameter matrix;

B. calculating the intermediate feature fused with the first code output by the target layer by adopting the following formula:

Y＝(1+D1)*y+D2；

wherein y is the target intermediate feature after the normalization processing, D1 is the variable matrix, and D2 is the offset matrix.

It is to be understood that, in step a, the parameter matrix D is split into the variable matrix D1 and the offset matrix D2, so that the parameter matrix D can be linearly transformed. In some embodiments, a first dimension of the parameter matrix D may be set to be the same as a first dimension of the target intermediate feature map V, and a second dimension of the parameter matrix D is twice the second dimension of the target intermediate feature map V. For example, if the dimension of the target middle feature map V is 18 × 18, the dimension of the parameter matrix D is 18 × 36.

In some embodiments, the first half (i.e., the first 18 × 18) of the parameter matrix D is used as the variable matrix D1, and the second half (i.e., the last 18 × 18) of the parameter matrix D is used as the offset matrix, so that the variable matrix D1, the offset matrix D2, and the normalized target intermediate feature y have the same size.

In this embodiment, the parameter matrix is split into the variable matrix and the offset matrix, and then the above formula is used to perform linear fusion in a simple manner, which is convenient for calculation and processing.

It is understood that the generative confrontation network further comprises a discriminator, and in the process of fusing the first code and the real face image, the discriminator calculates the confrontation loss of the predicted face image, and the confrontation loss is used for representing the similarity degree between the predicted face image and the training face image.

It can be understood that the countermeasure loss is a loss of whether the predicted face image is a training face image (real face image), and when the countermeasure loss is large, it indicates that the distribution of the predicted face image is largely different from that of the training face image, and when the countermeasure loss is small, it indicates that the distribution of the predicted face image is small and similar to that of the training face image. Here, the distribution of the face image refers to the distribution of five sense organs, such as the eye distance, the forehead width, the face shape, and the like.

In some embodiments, the discriminator comprises 4 convolutional layers, the first 3 convolutional layers each comprising a normalization layer and an lreul activation function layer, and the last convolutional layer comprising a Sigmoid activation function layer for converting the previously learned features into a score representing the confidence that the predicted face image is the second real image, the higher the score, the closer the distribution of the predicted face image is to that of the training face image.

And in order to calculate the coding loss, the coding network is adopted to carry out feature coding on the predicted face image to obtain a second code.

In order to train the coding network and calculate the coding loss output by the coding network, the coding network is adopted to perform characteristic coding on the predicted face image to obtain a second code. Specifically, the process of encoding the features of the predicted face image by the encoding network is the same as the process of encoding the real face image by the encoding network in step S22, and details are not repeated here.

It can be understood that the closer the predicted image is to the corresponding real face image, the more similar the second encoding is to the first encoding, and the higher the accuracy of the encoding network and the generative countermeasure network.

S24: and performing iterative training on the image processing model by using a loss function, and returning to the step S21 until the image processing model converges, wherein the loss function is used for representing the coding loss between the first coding and the second coding, the characteristic loss between the real face image and the predicted face image and the fighting loss.

It can be understood that the second coding is a coding obtained by performing feature coding on the predicted face image by using a coding network, and the countermeasure loss is a loss calculated by a generative countermeasure network.

The above steps S21-S24 are a training process of the training device for a face image in the image sample set, after obtaining the predicted face image and the second code, the training device further calculates the loss by using a loss function, then adjusts parameters of the image processing model (including parameters of the coding network and parameters of the generative confrontation network) according to the loss, then returns to step S21, repeats steps S21-S24, and performs iterative training until the image processing model converges. The model convergence condition may be that the loss is less than a preset value, or fluctuates within a preset range, or the number of training times reaches a preset number.

And the loss function is used for representing the coding loss between the first coding and the second coding, the characteristic loss between the real face image and the predicted face image and the antagonistic loss, so that the loss calculated by adopting the loss function comprises the coding loss, the characteristic loss and the antagonistic loss.

It will be appreciated that under the effect of coding losses, back propagation over multiple training processes enables the coding network to learn the age characteristics of real face images at an expected age, i.e., characteristics based on the expected age. Because the image sample set comprises the face images of a plurality of people in all age groups, after a plurality of rounds of training, the coding network can learn the age characteristics of the same person in all age groups and the age characteristics of one age group and different people, so that the coding network can output N M codes reflecting the age characteristics of all people in all age groups. The N × M code is stored together with the trained image processing model, i.e. the N × M code corresponds to a part of the image processing model. This N x M coding may be invoked when the trained image processing model is invoked to help determine the coding at a desired age.

Under the action of feature loss, the backward propagation in the multiple training processes enables the predicted image generated by the generator in the generative countermeasure network to be similar to the features of the corresponding real face image, for example, the features of the five sense organs are similar, so that the reduction degree of the predicted image relative to the real face image except for the age features, for example, the reduction degree of the features of the five sense organs can be more finely controlled, that is, the predicted face image and the real face image can reflect the identity of the same person, the predicted face image is not distorted, only the age features (features influenced by age change) are changed, that is, the generator can better learn the age features, and the model accuracy is improved. .

Under the action of the countermeasure loss, the backward propagation in the multiple training processes makes the prediction image generated by the generator in the generative countermeasure network similar to the distribution of the training face image. Therefore, the anti-type generation network can accurately fuse the training face image and the first code. In addition, the face images of a plurality of people in all ages are included based on the image sample set, so that the generative confrontation network can learn the characteristic difference of the same person in different ages and output an accurate predicted image.

Therefore, in the embodiment of the application, after undergoing multiple rounds of training, the coding network can learn the age characteristics of multiple persons in multiple age groups, and characterize the age characteristics corresponding to each age group in a coding form, that is, the code of each age group is obtained by learning the age characteristics of the multiple persons through the coding network; the generation type countermeasure network fuses the codes of the same person in all age groups with the face images to learn the age characteristic difference of the same person in different age groups, so that the generated predicted images conform to the individual characteristics. Furthermore, the loss function characterizes coding loss between the first coding and the second coding, feature loss between the real face image and the predicted face image, and countermeasure loss, wherein the feature loss between the real face image and the predicted face image enables a generator in the generative countermeasure network to control the degree of reduction of the face feature, i.e., enables the predicted face image and the real face image to reflect the same person identity, the predicted face image is not distorted, only the age feature (feature affected by age change) is changed, i.e., enables the generator to better learn the age feature, and increases model accuracy.

In some embodiments, before step S24, the method further includes:

step S31: and respectively acquiring the five sense organ regions of the real face image and the five sense organ regions of the predicted face image by adopting a face key point algorithm.

Step S32: and determining the characteristic loss between the real face image and the predicted face image according to the difference between the facial features of the real face image and the facial features of the predicted face image.

In this embodiment, the feature loss between the real face image and the predicted face image mainly includes the feature difference of five sense organs. The generator can better learn the facial features of the real face image through the facial feature difference between the real face image and the predicted face image, and therefore the facial features of the predicted face image can approach the facial features of the real face image infinitely.

Specifically, a plurality of key points of the face of the human face can be located according to the human face key point algorithm, wherein the key points comprise points of the areas such as eyebrows, eyes, a nose, a mouth, a face contour and the like. Thus, from these key points, the five sense organ regions are determined. The five sense organ regions include the eye region, the eyebrow region, the nose region, the mouth region, and the ear region. Therefore, the key point calculation is respectively carried out on the real face image and the predicted face image by adopting a face key point algorithm, the facial features region is obtained according to the key point of the real face image, and the facial features region is obtained according to the key point of the predicted face image.

The face keypoint algorithm may be Active Area Models (AAMs), Constrained Local Models (CLMs), Explicit Shape Regression (ESR), or explicit device method (SDM).

After the regions of the five sense organs of both are obtained, the feature loss between both is determined according to the difference between the regions of the five sense organs of both.

In this embodiment, the generator can better learn the facial features of the real face image through the facial feature difference between the real face image and the predicted face image, so that the facial features of the predicted face image approach the facial features of the real face image infinitely. Therefore, the trained model can change features only according to the age, namely, only the age features are changed, the learning of the age features by the model is not influenced, and the five sense organs reduction degree can be controlled.

In some embodiments, step S32 specifically includes:

L_res＝||G(x,s(Y_s,T))*mask_G-Y_s*masK_Y||₁

Setting and predicting label masK of pixel point in face image_GAnd the labels of the pixel points in the real face image can effectively distinguish the five sense organ regions from the non-five sense organ regions, so that the calculation is simplified.

In this embodiment, the loss function is:

wherein L is_styleFor coding loss, L_AdsTo combat losses, L_resIs a characteristic loss, σ_styleAs a weight of the coding loss, σ_AdsAs a weight of the countermeasure loss, σ_resWeight lost for said feature, S (Y)_sT) is the first code, S (G (x, S (Y))_sT)), T) isA second code, E represents the expected value of the distribution function, D (x) represents the probability of judging the training face image as true or false, and D (G (x, S (Y)) represents the probability of judging the training face image as true or false_sT))) is the probability of being judged to be true or false for the predicted face image.

It is known that the loss function is a weighted sum of the coding loss, the characteristic loss and the countermeasures loss. By minimizing coding losses under the effect of coding losses

The back propagation in the multiple training processes enables the coding network to learn the age characteristics of the real face image under the expected age; by minimizing feature loss under the effect of feature loss

The backward propagation in the multiple training processes enables the predicted image generated by the generator in the generative countermeasure network to be similar to the characteristics of the corresponding real face image, for example, the features of five sense organs are similar, so that the reduction degree of the predicted image relative to the real face image except the age characteristics can be more finely controlled; by maximizing the probability max, the back propagation in the course of multiple training leads to a generative confrontation network D under the effect of confrontation losses

The predicted image generated by the generator in (1) is similar to the distribution of the training face image. Therefore, the anti-type generation network can accurately fuse the training face image and the first code.

In some embodiments, the preprocessing the real face image specifically includes:

acquiring center coordinates of left and right eyeballs in a real face image by adopting a face key point algorithm;

calculating an included angle between a central coordinate connecting line of a left eyeball and a right eyeball in a real face image and the horizontal direction;

rotating the real face image according to the included angle by taking the central coordinates of the left eye eyeball and the right eye eyeball in the real face image as a base point;

and intercepting the face area in the rotated real face image, and adjusting the size of the face area to a preset resolution ratio to obtain the preprocessed real face image.

And determining a plurality of key points of the face in the real face image according to a face key point algorithm, wherein the key points comprise points in the areas such as eyebrows, eyes, nose, mouth, face contour and the like. Therefore, from the key points, the central coordinates of the left eyeball and the right eyeball in the real face image can be obtained, and the included angle theta between the central coordinate connecting line of the left eyeball and the right eyeball and the horizontal direction is calculated.

It can be understood that the included angle θ is an angle of the human face deviating from the frontal face, and in order to adjust the human face in the real human face image to the frontal face, the central coordinates of the left and right eyeballs in the real human face image are used as a base point, and the real human face image is rotated according to the included angle θ, so as to obtain the frontal face.

Specifically, the rotated real face image may be calculated by the following formula:

wherein, (x, y) are the two-dimensional coordinates of the pixel points in the real face image before rotation, and (x ', y') are the two-dimensional coordinates of the pixel points in the real face image after rotation.

And intercepting a face area in the rotated real face image based on the fact that the face in the rotated real face image is a front face, and adjusting the size of the face area to a preset resolution ratio to obtain a preprocessed real face image. Therefore, the size of the preprocessed real face image is the preset resolution, and the preprocessed real face image only includes a front face and does not include other background pixels.

The training face image is processed by adopting the same preprocessing mode, which specifically comprises the following steps:

acquiring central coordinates of left and right eyeballs in a training face image by adopting a face key point algorithm;

calculating an included angle between a central coordinate connecting line of a left eyeball and a right eyeball in a training face image and the horizontal direction;

rotating the training face image according to the included angle by taking the central coordinates of the left eye eyeball and the right eye eyeball in the training face image as a base point;

and intercepting the face area in the rotated training face image, and adjusting the size of the face area to a preset resolution ratio to obtain the preprocessed training face image.

And determining a plurality of key points of the face in the training face image according to a face key point algorithm, wherein the key points comprise points in the areas such as eyebrows, eyes, nose, mouth, face contour and the like. Therefore, the central coordinates of the left eyeball and the right eyeball in the training face image can be obtained from the key points, and the included angle alpha between the central coordinate connecting line of the left eyeball and the right eyeball and the horizontal direction is calculated.

It can be understood that the included angle α is an angle at which the face deviates from the frontal face, and in order to adjust the face in the training face image to the frontal face, the center coordinates of the left and right eyeballs in the training face image are used as a base point, and the training face image is rotated according to the included angle α to obtain the frontal face.

Specifically, the rotated training face image may be calculated by the following formula:

wherein, (w, z) are the two-dimensional coordinates of the pixel points in the training face image before rotation, and (w ', z') are the two-dimensional coordinates of the pixel points in the training face image after rotation.

And intercepting a face area in the rotated training face image based on the fact that the face in the rotated training face image is a front face, and adjusting the size of the face area to a preset resolution ratio to obtain the preprocessed training face image. Therefore, the size of the preprocessed training face image is the preset resolution, and the preprocessed training face image only comprises a front face and does not comprise other background pixels.

In this embodiment, the positions of the five sense organs of the preprocessed training face image and the preprocessed real face image are aligned by the preprocessing method, so that the model can better learn characteristics and can be helped to better converge.

After the image processing model is obtained through training of the training method of the image processing model, the image processing model can be used for image processing. Referring to fig. 6, fig. 6 is a schematic flowchart of an image processing method according to an embodiment of the present application, and as shown in fig. 6, the method includes the following steps:

s41: and acquiring a face image to be processed and an expected age.

The face image to be processed may be a certificate photograph of a person at an age, the expected age being different from that age.

In some embodiments, the face image to be processed may be preprocessed to obtain a preprocessed face image to be processed, the preprocessed face image to be processed is a face region image, the resolution of the preprocessed face image to be processed is a preset resolution, and a face in the preprocessed face image to be processed is a front face.

Before inputting the model, the face image to be processed is preprocessed, so that the preprocessed face image to be processed is structured (each image displays approximately the same face part in the same place, for example, the sitting position of eyes in each image is approximately the same), so as to prevent the background area in the face image to be processed from interfering with the image processing model and prevent the face characteristic position from interfering with the image processing model in a cluttered manner.

Specifically, the face image to be processed is preprocessed to obtain a preprocessed face image to be processed, and the preprocessed face image to be processed is a face region image, namely, the preprocessed face image to be processed only includes a face region and does not include other background regions and the like. The resolution ratio of the preprocessed face image to be processed is a preset resolution ratio, and the preset resolution ratio is consistent with the preset resolution ratio of the preprocessed real face image and the preprocessed training face image. In addition, the human face in the pre-processed human face image to be processed is a front face. It should be noted that the preprocessing is the same as the preprocessing in the training process. It is to be understood that the face image to be processed in the following description may also be a pre-processed face image to be processed, and for convenience of description, the image to be processed is collectively referred to as a face image to be processed.

S42: inputting the face image to be processed and the expected age into an image processing model, and outputting an age change image, wherein the age of a person reflected by the age change image is adaptive to the expected age.

The image processing model in this step refers to the image processing model obtained by training the method embodiments of fig. 1 to 5, and the image processing model is stored and called when testing or applying. As described above, the image processing model includes a coding network and a generative confrontation network, the trained image processing model includes N × M codes learned by the coding network, and the trained generative confrontation network. Each row vector in the N x M codes represents the age characteristics of a plurality of identical people in an age group.

Therefore, the trained image processing model is called, the expected age is input into the image processing model, the image processing model can find out a code (namely a certain line of the N M code) corresponding to the expected age from the N M code according to the expected age, the code reflects the age characteristics under the expected age, then the code and the face image to be processed are input into the trained generative confrontation network, and the generative confrontation network fuses the code and the face image to be processed, so that the face image to be processed is correspondingly changed according to the age characteristics reflected by the code, and an age change image is generated. The age change image reflects the age of the person in accordance with the expected age. Specifically, the method for fusing the coding and the face image to be processed may refer to the method for fusing the first coding and the training face image in the foregoing training method embodiment, and details are not repeated here.

Based on the above description, in the technical scheme corresponding to fig. 6, the trained image processing model and the codes corresponding to the multiple age groups are stored, when a test or an application is performed, the image processing model and the codes corresponding to the multiple age groups are called, the corresponding codes are determined according to the input expected age, and the codes corresponding to the expected age are fused with the face image to be processed, so that the face image to be processed is subjected to age feature change according to the codes corresponding to the expected age, an age change image is generated, and the age feature change in the age change image conforms to the individual characteristics, that is, the image processing model can predict an aging image or a young image more accurately.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer device provided in an embodiment of the present application, where the computer device 50 includes a processor 501 and a memory 502. The processor 501 is connected to the memory 502, for example, the processor 501 may be connected to the memory 502 through a bus.

The processor 501 is configured to support the computer device 50 to perform the respective functions in the methods of fig. 1-5 or the methods of fig. 6. The processor 501 may be a Central Processing Unit (CPU), a Network Processor (NP), a hardware chip, or any combination thereof. The hardware chip may be an Application Specific Integrated Circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof.

The memory 502 is used to store program codes and the like. Memory 502 may include Volatile Memory (VM), such as Random Access Memory (RAM); the memory 1002 may also include a non-volatile memory (NVM), such as a read-only memory (ROM), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD); the memory 502 may also comprise a combination of memories of the kind described above.

In some possible cases, processor 501 may call the program code to perform the following:

and performing iterative training on the image processing model by using a loss function, and returning to the steps of acquiring a real face image, a training face image and an expected age corresponding to the training face image until the image processing model is converged, wherein the loss function is used for representing the coding loss between a first code and a second code, the characteristic loss between the real face image and a predicted face image and the countermeasure loss, the second code is obtained by performing characteristic coding on the predicted face image by using a coding network, and the countermeasure loss is calculated by using a generative countermeasure network.

In other possible cases, processor 501 may call program code to perform the following:

and acquiring a face image to be processed and an expected age.

Inputting the face image to be processed and the expected age into an image processing model, and outputting an age change image, wherein the age of a person reflected by the age change image is adaptive to the expected age.

Embodiments of the present application also provide a computer-readable storage medium, which stores a computer program, where the computer program includes program instructions, and the program instructions, when executed by a computer, cause the computer to execute the method according to the foregoing embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a computer readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method for training an image processing model, wherein the image processing model comprises a coding network and a generative confrontation network, the method comprising:

performing feature coding on the real face image by adopting the coding network to obtain a first code, wherein the first code reflects the face features of the real face image at the expected age;

performing feature fusion on the first code and the training face image by adopting a generating type countermeasure network to obtain a predicted face image, wherein the predicted face image is an image generated by fusing the feature of the first code with the training face image;

and performing iterative training on the image processing model by using a loss function, and returning to the steps of acquiring a real face image, a training face image and an expected age corresponding to the training face image until the image processing model converges, wherein the loss function is used for representing the coding loss between the first coding and the second coding, the characteristic loss and the countermeasure loss between the real face image and the prediction face image, the second coding is the coding obtained by performing characteristic coding on the prediction face image by using the coding network, and the countermeasure loss is the loss calculated by the generative countermeasure network.

2. The method of claim 1, wherein the generative countermeasure network comprises a generator comprising a plurality of downsampling layers, a plurality of depth layers, and a plurality of upsampling layers arranged in sequence;

the plurality of down-sampling layers are respectively used for outputting intermediate feature maps with reduced resolution layer by layer, the plurality of depth layers are respectively used for outputting intermediate feature maps with consistent resolution, and the plurality of up-sampling layers are respectively used for outputting intermediate feature maps with increased resolution layer by layer;

the feature fusion is performed on the first code and the training face image by using a generator in a generating countermeasure network to obtain a predicted face image, and the method comprises the following steps:

and fusing the first codes with the intermediate characteristic graphs input into the plurality of upsampling layers respectively.

3. The method of claim 2, wherein one of the upsampled layers comprises an inverse convolutional layer and a fusion layer;

the step of fusing the first codes with the intermediate feature maps input to the plurality of upsampling layers respectively includes:

and performing linear transformation on the target intermediate characteristic diagram after the normalization processing and the parameter matrix to obtain an intermediate characteristic diagram which is output by the target layer and is fused with the first code.

4. The method of claim 3,

the linear transformation is performed on the target intermediate feature after the normalization processing and the parameter matrix to obtain the intermediate feature which is output by the target layer and fused with the first code, and the method comprises the following steps:

calculating the intermediate feature fused with the first code of the target layer output by adopting the following formula:

Y＝(1+D1)*y+D2；

5. The method according to any of claims 1-4, wherein the loss function is:

wherein L is_styleFor said coding loss, L_AdsTo combat the loss, L_resFor the characteristic loss, σ_styleAs a weight of the coding loss, σ_AdsAs a weight of the countermeasure loss, σ_resIs the weight of the feature loss, x is the training face image, T is the age group of the expected age, S (Y)_sT) is the first code, S (G (x, S (Y))_sT)), T) is a second code, E represents an expected value of a distribution function, D (x) represents a probability of discriminating true or false for the training face image, D (G (x, S (Y))_sT))) is the probability of being judged true or false for the predicted face image, G (x, S (Y))_sT)) is the predicted face image, Y_sThe real face image is obtained; MasK_GFor the label of the pixel point in the predicted face image, when a pixel point in the predicted face image is positioned in the five sense organs region, the corresponding masK_GIs 1, otherwise is 0; MasK_YFor the label of the pixel point in the real face image, when a pixel point in the real face image is positioned in the five sense organs region, the corresponding masK_YIs 1, otherwise is 0.

6. The method of claim 1, further comprising, prior to the step of iteratively training the image processing model using a loss function:

respectively acquiring a facial feature region of the real facial image and a facial feature region of the predicted facial image by adopting a facial key point algorithm;

7. The method according to claim 6, wherein the step of determining a feature loss between the real face image and the predicted face image according to a difference between a facial region of the real face image and a facial region of the predicted face image comprises:

calculating a feature loss between the real face image and the predicted face image using the following formula:

L_res＝||G(x,s(Y_s,T))*mask_G-Y_s*masK_Y||₁

wherein x is the training face image, T is the age group of the expected age, and S (Y)_sT) is the first code, G (x, S (Y)_sT)) is the predicted face image, Y_sThe real face image is obtained; MasK_GFor the label of the pixel point in the predicted face image, when a pixel point in the predicted face image is positioned in the five sense organs region, the corresponding masK_GIs 1, otherwise is 0; MasK_YFor the label of the pixel point in the real face image, when a pixel point in the real face image is positioned in the five sense organs region, the corresponding masK_YIs 1, otherwise is 0.

8. The method according to claim 1, wherein before the step of performing feature coding on the real face image by using the coding network to obtain a first code, the method further comprises:

the real face image and the training face image are respectively preprocessed, so that the real face image after preprocessing and the training face image after preprocessing have preset resolutions, the real face image after preprocessing and the training face image after preprocessing are face region images, and the face is a front face.

9. An image processing method, comprising:

acquiring a face image to be processed and an expected age;

inputting the face image to be processed and the expected age into an image processing model obtained by training according to the method of any one of claims 1 to 8, and outputting an age change image, wherein the age change image reflects the age of a person corresponding to the expected age.

10. A computer device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-9.

11. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer device to perform the method of any one of claims 1-9.