CN115631524A

CN115631524A - Method for training age detection model, age detection method and related device

Info

Publication number: CN115631524A
Application number: CN202211305785.4A
Authority: CN
Inventors: 陈仿雄
Original assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Current assignee: Shenzhen Shuliantianxia Intelligent Technology Co Ltd
Priority date: 2022-10-24
Filing date: 2022-10-24
Publication date: 2023-01-20

Abstract

The embodiment of the application relates to the technical field of age detection, and discloses a method for training an age detection model, an age detection method, electronic equipment and a storage medium, wherein the training method comprises the steps of constructing a comparison data set with a positive sample image and a negative sample image, adopting M personal face images and M comparison data sets, combining a loss function capable of reflecting the similarity loss between the target face image and the positive sample image and the comparison loss between the target face image and N-1 negative sample images, carrying out iterative training on a neural network, enabling the neural network to learn to reduce the face feature difference of the same age, carrying out comparison to enlarge the face feature differences of different ages, constraining the neural network to reduce the face feature difference of the same age by utilizing the contrast loss and the similarity loss, and enlarging the face feature difference direction convergence of different ages. The age detection model obtained by convergence can better identify representative age characteristics, and has higher age detection accuracy.

Description

Method for training age detection model, age detection method and related device

Technical Field

The embodiment of the application relates to the technical field of human face image attribute prediction, in particular to a method for training an age detection model, an age detection method and a related device.

Background

The face image comprises a plurality of kinds of face characteristic information, such as face shape, face skin state, face expression, face five sense organs, face age and the like, wherein the face age is used as more important characteristic information and is widely applied to the field of face image detection. For example, some clients running on mobile devices have an age detection function, in which the client obtains a face image and outputs the detected age based on the obtained face image to feed back to the user.

For these clients with face age recognition function, the accuracy of age recognition, i.e. the gap between the recognized age and the real age of the user, is an important content. At present, in the related technology of face age recognition, the real age of a face is usually used as an individual label information, the real age is used as a label of a face image, a one-to-one correspondence relationship is established between the face image and the real age, and then training of a face age recognition model is performed. Because the identity of a user is unique, the face characteristics of different faces at the same age are different, in the training process of the face age identification model, each training face image is input, the face age identification model belongs to a new class of images, and the face age identification model can only learn the face characteristics of training data and is easily influenced by noise in the training data. When the face age identification model is used for identifying the face age, the face image newly input into the face age identification model is a new image for the face age identification model, and the face features of the images are not learned, so that the face age identification model is difficult to adapt to the unique features of the newly input face image, and the accuracy in practical application is low.

Disclosure of Invention

In view of this, embodiments of the present application provide a method for training an age detection model, an age detection method, and a related apparatus, where the age detection model trained by the method can accurately detect an age of a face in a face image.

In a first aspect, an embodiment of the present application provides a method for training an age detection model, including:

acquiring M face images, wherein the real age is marked on each face image, and the real ages of the M face images cover 1-N years;

constructing a comparison data set aiming at any one target face image in the M personal face images, wherein the comparison data set comprises the target face image, a positive sample image and N-1 negative sample images, the real age of the positive sample image is the same as that of the target face image, and the real ages of the N-1 negative sample images are different from that of the target face image;

and performing iterative training on the neural network by adopting M face images and M comparison data sets and combining a loss function to obtain an age detection model, wherein the loss function reflects the similarity loss between the target face image and the positive sample image and the comparison loss between the target face image and the N-1 negative sample images.

In some embodiments, the neural network comprises a comparison network and an age detection network, the age detection network comprises a first feature extraction module and an age predictor, the comparison network comprises a second feature extraction module and a full connection layer, and the first feature extraction module and the second feature extraction module have the same structure and share parameters;

the aforesaid adopts M individual face image and M contrast data set, combines the loss function, carries out iterative training to neural network, obtains age detection model, includes:

inputting the target face image into an age detection network to obtain a predicted age;

inputting a contrast data set corresponding to the target face image into a contrast network to perform feature extraction and spatial mapping conversion to obtain a target face feature vector, a positive sample feature vector and N-1 negative sample feature vectors;

calculating the corresponding loss of the target face image by adopting a loss function, wherein the loss comprises age loss between the predicted age and the real age, similarity loss between the target face image and the positive sample image, and contrast loss between the target face image and the N-1 negative sample images;

and performing iterative training on the neural network according to the loss sum corresponding to the M face images until convergence, and taking the converged age detection network as an age detection model.

In some embodiments, the first feature extraction module includes a plurality of convolutional layers, each configured with an activation function layer and a normalization layer, the plurality of convolutional layers configured with convolutional kernels of the same size and with the same step size.

In some embodiments, the age predictor includes a multi-tier perceptron module.

In some embodiments, the loss functions include an age loss function for calculating an age loss between the predicted age and the true age, a similarity loss function for calculating a similarity loss between the target face image and the positive sample image, and a contrast loss function for calculating a contrast loss between the target face image and the N-1 negative sample images.

In some embodiments, the similarity loss function includes a similarity matrix reflecting the degree of similarity of age features of faces of similar ages.

In some embodiments, the loss function is

Loss＝Loss _L1 +L _con +Loss _dis ；

Therein, loss _L1 ＝|Y-T| _L1 ；

Wherein Loss is a Loss function, loss _L1 As a function of age loss, L _con As a contrast Loss function, loss _dis For similar loss functions, Y is the predicted age, T is the true age, V ^K Is a target face feature vector, V ^K+ For the positive sample feature vector, i is the index,

is the ith negative sample feature vector, tau is the coefficient and S is the similarity matrix.

In a second aspect, an embodiment of the present application provides an age detection method, including:

acquiring a human face image to be detected;

inputting the face image to be detected into an age detection model for age detection to obtain the age corresponding to the face image to be detected, wherein the age detection model is obtained by training by adopting the method for training the age detection model in the first aspect.

In a third aspect, an embodiment of the present application provides an electronic device, including:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as in the first aspect above.

In order to solve the above technical problem, in a fourth aspect, an embodiment of the present application provides a computer-readable storage medium storing computer-executable instructions for causing a computer device to perform the method of the first aspect.

The beneficial effects of the embodiment of the application are as follows: different from the prior art, in the method for training an age detection model provided in the embodiment of the present application, M face images are obtained first, each face image is labeled with a real age, and the real ages of the M face images cover 1 to N years. And constructing a comparison data set aiming at any one target face image in the M personal face images, wherein the comparison data set comprises the target face image, a positive sample image and N-1 negative sample images, the real age of the positive sample image is the same as that of the target face image, and the real age of the N-1 negative sample images is different from that of the target face image. And finally, performing iterative training on the neural network by adopting M personal face images and M comparison data sets and combining a loss function to obtain an age detection model, wherein the loss function reflects the similarity loss between the target face image and the positive sample image and the comparison loss between the target face image and the N-1 negative sample images.

In this embodiment, a comparison data set having a positive sample image and a negative sample image is constructed, M individual face images and M comparison data sets are used, and the neural network is iteratively trained by combining the loss functions capable of reflecting the similarity loss and the comparison loss, so that the neural network can learn and reduce the face feature differences at the same age, enlarge the face feature differences at different ages, and better learn more representative age feature vectors. The neural network is constrained by utilizing the contrast loss and the similarity loss, so that the human face feature difference at the same age can be reduced, and the convergence of the human face feature difference directions at different ages can be enlarged by contrast. Therefore, the age detection model obtained through convergence can better identify representative age characteristics, and has higher age detection accuracy.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

FIG. 1 is a schematic diagram of an application scenario of an age detection system in some embodiments of the present application;

FIG. 2 is a schematic diagram of an electronic device according to some embodiments of the present application;

FIG. 3 is a schematic flow chart of a method of training an age detection model according to some embodiments of the present application;

FIG. 4 is a schematic diagram of a neural network used for training in some embodiments of the present application;

fig. 5 is a schematic flow chart of an age detection method according to some embodiments of the present application.

Detailed Description

The present application will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the present application, but are not intended to limit the present application in any way. It should be noted that numerous variations and modifications could be made by those skilled in the art without departing from the spirit of the application. All falling within the scope of protection of the present application.

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.

It should be noted that, if not conflicted, the various features of the embodiments of the present application may be combined with each other within the scope of protection of the present application. Additionally, while functional block divisions are performed in apparatus schematics, with logical sequences shown in flowcharts, in some cases, steps shown or described may be performed in sequences other than block divisions in apparatus or flowcharts. Further, the terms "first," "second," "third," and the like, as used herein, do not limit the data and the execution order, but merely distinguish the same items or similar items having substantially the same functions and actions.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the present application. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

In addition, the technical features mentioned in the embodiments of the present application described below may be combined with each other as long as they do not conflict with each other.

To facilitate understanding of the methods provided by the embodiments of the present application, first, terms referred to in the embodiments of the present application will be described:

(1) Neural network

The neural network can be composed of neural units, and can be specifically understood as a neural network with an input layer, a hidden layer and an output layer, wherein generally the first layer is the input layer, the last layer is the output layer, and the middle layers are hidden layers. Among them, a neural network with many hidden layers is called a Deep Neural Network (DNN). The operation of each layer in the neural network can be described by the mathematical expression y = a (W · x + b), and from a physical level, the operation of each layer in the neural network can be understood as performing transformation from an input space (a set of input vectors) to an output space (i.e., from a row space to a column space of a matrix) through five operations, including: 1. ascending/descending dimensions; 2. zooming in/out; 3. rotating; 4. translating; 5. "bending". Wherein, the operations of 2 and 3 are completed by "W · x", the operation of 4 is completed by "+ b", and the operation of 5 is realized by "a ()", and the expression of "space" is here expressed by "space" two words because the classified object is not a single thing, but a class of things, and space refers to the set of all individuals of such things, wherein W is the weight matrix of each layer of the neural network, and each value in the matrix represents the weight value of one neuron of the layer. The matrix W determines the spatial transformation of the input space to the output space described above, i.e. W at each layer of the neural network controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Therefore, the training process of the neural network is essentially a way of learning the control space transformation, and more specifically, the weight matrix.

It should be noted that, in the embodiment of the present application, based on the model adopted by the machine learning task, the model is essentially a neural network. The common components in the neural network include a convolutional layer, a pooling layer, a normalization layer, a reverse convolutional layer and the like, the common components in the neural network are assembled to design a model, and when model parameters (weight matrixes of each layer) are determined so that model errors meet preset conditions or the number of the adjusted model parameters reaches a preset threshold, the model converges.

The convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to carry out convolution operation on the image. The purpose of convolution is to extract different features of the input image, the first layer of convolution layer may only extract some low-level features such as edges, lines and corners, and the deeper convolution layer can iteratively extract more complex features from the low-level features.

The inverse convolutional layer is used to map a space with a low dimension to a space with a high dimension, while maintaining the connection relationship/mode therebetween (the connection relationship refers to the connection relationship during convolution). The reverse convolution layer is configured with a plurality of convolution kernels, and each convolution kernel is provided with a corresponding step length so as to perform deconvolution operation on the image. In general, an upscale () function is built in a framework library (e.g., a PyTorch library) for designing a neural network, and a low-dimensional to high-dimensional spatial mapping can be realized by calling the upscale () function.

Pooling (posing) is a process that mimics the human visual system in that data can be reduced in size or images can be represented with higher level features. Common operations of pooling layers include maximum pooling, mean pooling, random pooling, median pooling, combined pooling, and the like. Generally, pooling layers are periodically inserted between convolutional layers of a neural network to achieve dimensionality reduction.

The normalization layer is used to normalize all neurons in the middle layer to prevent gradient explosion and gradient disappearance.

(2) Loss function

In the process of training the neural network, because the output of the neural network is expected to be as close as possible to the value really expected to be predicted, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the value really expected to be predicted (an initialization process is usually carried out before the first updating, namely, parameters are configured in advance for each layer in the neural network), for example, if the predicted value of the network is high, the weight matrix is adjusted to be lower, and the adjustment is carried out continuously until the neural network can predict the value really expected to be predicted. Therefore, it is necessary to define in advance how to compare the difference between the predicted value and the target value, which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the neural network becomes a process of reducing the loss as much as possible.

Before the embodiments of the present application are described, a method for training an age detection model, which is known to the inventors of the present application, is briefly described, so that the embodiments of the present application are easy to understand in the following.

In some methods for training an age detection model, training samples are obtained, each sample being a face image labeled with a corresponding age label; determining the age label distribution corresponding to the age label of each sample; and training the age estimation model based on the sample and the age label distribution until the total loss function of the age estimation model is converged to obtain an age detection model. By calculating a certain age value, the distance difference between two adjacent end points is converted into a probability value. The total loss function includes a distribution probability loss function.

In this scenario, the final prediction result is still the probability values for all ages that need to be predicted, from which the highest probability value is selected. Because the age distribution is a normal distribution condition, the conditions of more middle age groups and less two side age groups exist, so that the probability of the model can be distributed too intensively in the training process, namely the prediction result of the middle age group is more, and the prediction of the two side age groups is ignored, thereby influencing the overall applicability of the model.

In view of the above problems, the present application provides a method for training an age detection model, an age detection method, an electronic device, and a storage medium, wherein the training method performs iterative training on a neural network by constructing a comparison data set having a positive sample image and a negative sample image, using M personal face images and M comparison data sets, and combining a loss function that can reflect a similarity loss between a target face image and the positive sample image and a comparison loss between the target face image and N-1 negative sample images, so that the neural network can learn to reduce face feature differences at the same age, contrast and expand face feature differences at different ages, and better learn more representative age feature vectors. The neural network is constrained by utilizing the contrast loss and the similarity loss, so that the human face feature difference at the same age can be reduced, and the convergence of the human face feature difference directions at different ages can be enlarged by contrast. Therefore, the age detection model obtained through convergence can better identify representative age characteristics, and has higher age detection accuracy.

An exemplary application of the electronic device for training the age detection model or for age detection provided in the embodiment of the present application is described below, and it is understood that the electronic device may train the age detection model, and may also perform age detection on a face image by using the age detection model.

The electronic device provided by the embodiment of the application can be a server, for example, a server deployed in the cloud. When the server is used for training the age detection model, the training data is adopted to carry out iterative training on the neural network according to training data and the neural network provided by other equipment or technicians in the field, and final model parameters are determined, so that the neural network configures the final model parameters, and the age detection model can be obtained. When the server is used for age detection, a built-in age detection model is called, corresponding calculation processing is carried out on the face image to be detected provided by other equipment or a user, and the corresponding age is obtained.

The electronic device provided by some embodiments of the present application may be various types of terminals such as a notebook computer, a desktop computer, or a mobile device. When the terminal is used for training the age detection model, a person skilled in the art inputs prepared training data into the terminal, designs a neural network on the terminal, and iteratively trains the neural network by using the training data through the terminal to determine final model parameters, so that the neural network configures the final model parameters, and the age detection model can be obtained. When the terminal is used for age detection, a built-in age detection model is called, corresponding calculation processing is carried out on a face image to be detected input by a user, and the age corresponding to the face image to be detected is obtained.

By way of example, referring to fig. 1, fig. 1 is a schematic view of an application scenario of the age detection system provided in the embodiment of the present application, and the terminal 10 is connected to the server 20 through a network, where the network may be a wide area network or a local area network, or a combination of the two.

The terminal 10 may be used to acquire training data and build neural networks, for example, by those skilled in the art downloading prepared training data at the terminal and building a network structure for the neural network. It is understood that the terminal 10 may also be used to obtain a face image, for example, a user inputs the face image through an input interface, and the terminal automatically obtains the face image after the input is completed; for example, the terminal 10 is provided with a camera through which a face image is captured.

In some embodiments, the terminal 10 locally executes the method for training an age detection model provided in this embodiment to complete training of a designed neural network by using the training data, and determine final model parameters, so that the neural network configures the final model parameters, thereby obtaining the age detection model. In some embodiments, the terminal 10 may also send training data stored on the terminal by a person skilled in the art and a constructed neural network to the server 20 through the network, the server 20 receives the training data and the neural network, trains the designed neural network with the training data, determines final model parameters, and then sends the final model parameters to the terminal 10, and the terminal 10 stores the final model parameters, so that the neural network configuration can obtain the final model parameters, that is, the age detection model can be obtained.

In some embodiments, the terminal 10 locally executes the age detection method provided in this embodiment to provide an age detection service for the user, invokes a built-in age detection model, and performs corresponding calculation processing on the face image to be detected to obtain the age corresponding to the face image to be detected. In some embodiments, the terminal 10 may also send, to the server 20 through the network, a facial image to be detected input by the user on the terminal, and the server 20 receives the facial image to be detected, invokes a built-in age detection model to perform corresponding calculation processing on the facial image to be detected, so as to obtain an age corresponding to the facial image to be detected, and then sends the age to the terminal 10. The terminal 10 displays the age on its own display interface to inform the user of the age after receiving the age.

The structure of the electronic device in the embodiment of the present application is described below, and fig. 2 is a schematic structural diagram of the electronic device 500 in the embodiment of the present application, where the electronic device 500 includes at least one processor 510, a memory 550, at least one network interface 520, and a user interface 530. The various components in the electronic device 500 are coupled together by a bus system 540. It is understood that the bus system 540 is used to enable communications among the components of the connection. The bus system 540 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus system 540 in FIG. 2.

The Processor 510 may be an integrated circuit chip having Signal processing capabilities, such as a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like, wherein the general purpose Processor may be a microprocessor or any conventional Processor, or the like.

The user interface 530 includes one or more output devices 531 enabling presentation of media content, including one or more speakers and/or one or more visual display screens. The user interface 530 also includes one or more input devices 532, including user interface components to facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

The memory 550 can include both volatile and nonvolatile memory, and can also include both volatile and nonvolatile memory. The non-volatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a Random Access Memory (RAM). The memory 550 described in embodiments herein is intended to comprise any suitable type of memory. Memory 550 optionally includes one or more storage devices physically located remote from processor 510.

In some embodiments, memory 550 can store data to support various operations, examples of which include programs, modules, and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 551 including system programs for processing various basic system services and performing hardware-related tasks, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based tasks;

a network communication module 552 for communicating with other computing devices via one or more (wired or wireless) network interfaces 520, exemplary network interfaces 520 including Bluetooth, wireless Fidelity (WiFi), and Universal Serial Bus (USB), among others;

a display module 553 for enabling the presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 531 (e.g., a display screen, speakers, etc.) associated with the user interface 530;

an input processing module 554 for detecting one or more user inputs or interactions from one of the one or more input devices 532 and translating the detected inputs or interactions.

As can be understood from the foregoing, the method for training an age detection model and the age detection method provided in the embodiments of the present application may be implemented by various types of electronic devices with computing processing capabilities, such as a smart terminal and a server.

The method for training the age detection model provided in the embodiment of the present application is described below with reference to an exemplary application and implementation of the server provided in the embodiment of the present application. Referring to fig. 3, fig. 3 is a schematic flowchart of a method for training an age detection model according to an embodiment of the present application.

Referring to fig. 3 again, the method S100 may specifically include the following steps:

s10: acquiring M face images, wherein the real age is marked on each face image, and the real ages of the M face images cover 1-N years.

It is understood that each face image includes a face. Each face image is labeled with a real age. In some embodiments, N may be 100, i.e., each real age may cover an age range including 1-100 years.

The real age of each face image may be labeled by a thermal coding labeling method, for example, if the age range to be detected is between 1 and 100, the 100-dimensional vector is used as the real age corresponding to one face image. It will be appreciated that the use of heat-coding for labeling is a common practice for those skilled in the art and will not be described in detail herein.

In some embodiments, the terminal or the server may perform normalization processing on the face images in the training set, which is beneficial to improving the convergence speed and model accuracy of subsequent model training. Specifically, in some embodiments, the size of the face image may be set to 320 × 320, translating the range of pixel values of the face image from 0-255 to between 0-1.

In consideration of the problem of data imbalance among different age groups in the process of actually collecting face images, in some embodiments, data enhancement processing such as rotation, turning, clipping, scaling, and translation can be performed on data without changing illumination. Data of all age groups are balanced, and the robustness of the model is improved.

In some embodiments, the number of M is ten thousand, which may be 20000, for example, which is beneficial for training to obtain an accurate general model. The number of face images can be determined by those skilled in the art according to actual situations.

S20: and constructing a contrast data set aiming at any one target face image in the M personal face images.

The comparison data set comprises a target face image, a positive sample image and N-1 negative sample images, the real age of the positive sample image is the same as that of the target face image, and the real age of the N-1 negative sample images is different from that of the target face image.

It is to be understood that the target face image is any one of M face images for each of which a contrast data set is constructed in the same manner as the target face image. For example, if any one of the target face images V is x years old, a face image with x age is randomly selected from the M face images as a positive sample image V ⁺ . Randomly selecting N-1 face images with different age values and different ages of x as negative sample images

Thus, a comparison data set H = { V, V is obtained ⁺ ,V ^- }。

S30: and performing iterative training on the neural network by adopting M face images and M comparison data sets and combining a loss function to obtain an age detection model.

It will be appreciated that there are sets of contrast data for each face image, and thus, M face images have M sets of contrast data. And performing iterative training on the neural network by adopting the M personal face images and M comparison data sets corresponding to the M personal face images and combining a loss function until the neural network is converged to obtain an age detection model. It is understood that the structure and the operation mechanism of the "neural network" have been introduced in the foregoing noun explanation (1), and the explanation is not repeated here.

In the training process, the neural network learns the age characteristics of the M face images and the positive sample characteristics (the age characteristics of the same age) and the negative sample characteristics (the age characteristics of different ages) reflected by the corresponding M comparison data sets, makes age prediction, and reversely adjusts network parameters based on the loss calculated by the loss function. After multiple times of iterative training and parameter adjustment, the neural network converges to obtain the age detection model.

Reflecting the similarity loss between the target face image and the positive sample image and the contrast loss between the target face image and the N-1 negative sample images on the basis of a loss function, wherein the similarity loss is used for evaluating the face feature difference between the target face image and the positive sample image (of the same age); the contrast loss is used to evaluate the difference in facial features between the target face image and the N-1 negative sample images (of different ages). Therefore, the loss function can continuously restrict the neural network to learn and reduce the human face feature difference of the same age, and the human face feature difference of different ages is enlarged, so that more representative age feature vectors can be better learned. The method can reduce the human face feature difference of the same age by utilizing the contrast loss and the similar loss constraint neural network, and the human face feature difference of different ages is converged in a contrast expansion direction. Therefore, the age detection model obtained through convergence can better identify representative age characteristics, and has higher age detection accuracy.

It is understood that convergence may refer to the fact that under certain model parameters, the sum of the differences between the real ages and the predicted ages in the training set is smaller than a preset threshold or fluctuates within a certain range.

In some embodiments, the adam algorithm is used to optimize the model parameters, for example, the number of iterations is set to 10 ten thousand, the initial learning rate is set to 0.001, the weight attenuation of the learning rate is set to 0.0005, and the learning rate is attenuated to 1/10 of the original learning rate every 1000 iterations, wherein the learning rate and the loss can be input into the adam algorithm to obtain the adjusted model parameters output by the adam algorithm, and the adjusted model parameters are used to perform the next training until the model parameters of the converged neural network are output after the training is completed. Thus, the converged neural network serves as an age detection model.

In the embodiment, the neural network can learn to reduce the facial feature difference of the same age in the training process, and better learn more representative age feature vectors by comparing and enlarging the facial feature difference of different ages. The method can reduce the human face feature difference of the same age by utilizing the contrast loss and the similar loss constraint neural network, and the human face feature difference of different ages is converged in a contrast expansion direction. Therefore, the age detection model obtained through convergence can better identify representative age characteristics, and has higher age detection accuracy.

In some embodiments, referring to fig. 4, the neural network includes a comparison network and an age detection network, the age detection network includes a first feature extraction module and an age predictor, the comparison network includes a second feature extraction module and a full connection layer, and the first feature extraction module and the second feature extraction module have the same structure and share parameters.

It can be understood that the first feature extraction module is configured to perform downsampling on an input image to extract features, and the obtained feature map is input into the age predictor to perform age prediction. The second feature extraction module has the same structure as the first feature extraction module, so that downsampling feature extraction can be performed on an image input into a contrast network, and an obtained feature map is input into a full-connection layer to be subjected to spatial mapping and is converted into a vector.

It is understood that the second feature extraction module or the first feature extraction module includes a plurality of convolutional layers, each convolutional layer is configured with a convolution kernel, and the parameters refer to the weight and deviation of the convolution kernel in each convolutional layer. The first feature extraction module and the second feature extraction module share parameters, which means that weights and deviations of convolution kernels in corresponding convolution layers are the same. Therefore, the comparison age difference characteristics learned by the second characteristic extraction module in the comparison network can influence the first characteristic extraction module, so that the age detection network can learn the comparison age difference characteristics under the condition of not increasing the network complexity. The "contrast age difference feature" refers to the face difference features of N-1 negative sample face images, that is, the face difference features at different ages.

The fully-connected layer is a special convolution layer commonly used in the field and is responsible for converting a two-dimensional feature map output by convolution into a one-dimensional vector. Here, each node of the fully-connected layer is connected to all nodes of the last convolutional layer of the second feature extraction module, and the feature map of 10 × 256 size output by the convolutional layer is spatially mapped and converted into one feature vector.

In this embodiment, the step S30 specifically includes:

s31: and inputting the target face image into an age detection network to obtain the predicted age.

It can be understood that after the target face image is subjected to down-sampling feature extraction by the first feature extraction module in the age detection network, the obtained feature map is input into the age predictor for age prediction, and the age predictor outputs the predicted age. In some embodiments, the age predictor may include a softmax classifier.

S32: and inputting a comparison data set corresponding to the target face image into a comparison network for feature extraction and space mapping conversion to obtain a target face feature vector, a positive sample feature vector and N-1 negative sample feature vectors.

It can be understood that the comparison data set comprises N +1 images, the images are input into the comparison network, and each image is respectively subjected to feature extraction and spatial mapping conversion, so that a target face feature vector, a positive sample feature vector and N-1 negative sample feature vectors can be respectively obtained.

In some embodiments, the set of comparison data H = { V, V = { n, V = } ⁺ ,V ^- A transmission power control unit (in which,

after the images are input into a contrast network, each image is subjected to feature extraction and space mapping conversion, and a K-dimensional target face feature vector V is output ^K K-dimensional positive sample feature vector V ^K+ And N-1 negative sample feature vectors V of K dimensions ^K- 。

S33: and calculating the corresponding loss of the target face image by adopting a loss function. The loss includes age loss between the predicted age and the real age, similarity loss between the target face image and the positive sample image, and contrast loss between the target face image and the N-1 negative sample images.

S34: and performing iterative training on the neural network according to the loss sum corresponding to the M face images until convergence, and taking the converged age detection network as an age detection model.

Here, the loss function may be configured in the terminal by a person skilled in the art, the configured loss function is sent to the server along with the neural network, and the server calculates the loss corresponding to each face image by using the loss function after processing the predicted ages corresponding to the M face images. And performing iterative training on the neural network based on the loss corresponding to the M face images until convergence, and taking the converged age detection network as an age detection model. It is understood that the loss function has been described in detail in the above "noun description (2)", and thus, the detailed description thereof is not repeated. Based on the difference between the network structure and the training mode, the structure of the loss function can be set according to the actual situation.

In this embodiment, a loss function is used to calculate the corresponding loss of the target face image, including the age loss between the predicted age and the true age, the similarity loss between the target face image and the positive sample image, and the contrast loss between the target face image and the N-1 negative sample images. From the above, the similarity loss is used to evaluate the difference of the face features between the target face image and the positive sample image (same age); the contrast loss is used to evaluate the difference of the facial features of the target facial image and the N-1 negative sample images (different ages).

With the iterative training of the neural network, the back propagation of the loss can continuously restrict the predicted age to be close to the real age, and the similar loss and the contrast loss are smaller and smaller. Therefore, the neural network is continuously restrained to learn to reduce the human face feature difference of the same age, the human face feature difference of different ages is enlarged, and more representative age feature vectors are better learned. The method can reduce the human face feature difference of the same age by utilizing the contrast loss and the similar loss constraint neural network, and the human face feature difference of different ages is converged in a contrast expansion direction. Therefore, the converged age detection model can better identify representative age features and has higher age detection accuracy. In addition, the converged age detection network is used as an age detection model, so that the age detection model is simple in structure, light in weight, beneficial to improvement of the running speed and rapid and accurate in age detection.

In some embodiments, the first feature extraction module includes a plurality of convolution layers each followed by an activation function layer and a normalization layer, the plurality of convolution layers configured with convolution kernels of the same size and the same step size.

Referring again to fig. 4, the first feature extraction module includes 5 convolutional layers, and the 5 convolutional layers are followed by an activation function layer (including a Relu activation function) and a normalization layer. The number of convolution kernels for these 5 convolutional layers is set to 16, 32, 64, 128, and 256 in this order. These 5 convolution layers output characteristic maps of the size 160 × 16, 80 × 32, 40 × 64, 20 × 128, 10 × 256, respectively.

The 5 convolutional layers may use convolution kernels of 3 x 3 size, with the step size set to 2 each. The average step size setting can effectively reduce information loss caused by down sampling and up sampling, and the convolution kernel with the size of 3 x 3 is beneficial to reducing the aliasing effect after feature map fusion.

It is understood that the network structure of the first feature extraction module can be represented by the following mathematical expression:

wherein the content of the first and second substances,

the mth characteristic diagram representing the ith layer,

the mth feature map of the l + l layer is represented, W represents a convolution kernel, the size of the convolution kernel is set to be 3 x 3, the aliasing effect after feature map fusion is reduced, B represents a bias term, sigma (·) represents a Relu activation function, and IN represents example normalization.

In this embodiment, the convolution kernels with the same size and the same step size are configured in the plurality of convolution layers, the information loss caused by downsampling and upsampling can be effectively reduced by setting the step size, and the convolution kernels with the same size are beneficial to reducing the aliasing effect after feature map fusion.

It can be understood that, as shown in fig. 4, the first feature extraction module and the second feature extraction module have the same structure and share parameters, and thus, the second feature extraction module has the same performance as the first feature extraction module.

In some embodiments, the age predictor includes a multi-tier perceptron module.

It is understood that the multi-layered perceptron module includes an input layer, a multi-layered hidden layer, and an output layer, wherein the input layer includes N neurons, the hidden layer includes Q neurons, and the output layer includes K neurons. The operation of each layer may be described by a functional expression, it being understood that the functional expression differs for each layer.

It will be appreciated that if the input information characteristic code is represented by x, the input layer is fed to the hidden layer x, and the output of the hidden layer may be f (w) ₁ x+b ₁ ) Wherein w is ₁ Is a weight, b ₁ Is an offset, the function f may be a commonly used sigmoid function or tanh function. The hidden layer to the output layer is equivalent to a multi-class logistic regression, namely softmax regression, so that the output of the output layer is softmax (w) ₂ x ₁ +b ₂ ) Wherein x is ₁ F (w) output for hidden layer ₁ x+b ₁ )。

Thus, the multi-layer perceptron module may be represented by the following formula:

wherein G represents the softmax activation function, h represents the number of hidden layers, and W ⁱ And b ⁱ Representing the weights and offsets of the ith hidden layer. x represents the input information characteristic code. W is a group of ¹ And b ¹ Represents the weights and offsets of the input layers, S represents the activation function, and MLP (x) represents the target information vector.

In some embodiments, K may be 1024, so that the output layer outputs a one-dimensional vector with length of 1024, i.e. a target information vector with length of 1024.

Each layer of the multilayer perceptron module uses an activation function, and can introduce nonlinear factors into neurons, so that the module can be arbitrarily approximated to any nonlinear function, and further, more nonlinear models can be utilized. The multi-layer perceptron module has good feature extraction capability for discrete information, so that the extracted target information vector can fully reflect the features of information feature codes, namely the features of age interference information.

In some embodiments, the loss function includes an age loss function, a similarity loss function, and a contrast loss function. For example, the loss function may be a weighted sum of an age loss function, a similarity loss function, and a contrast loss function.

The age loss function is used for calculating the age loss between the predicted age and the real age, the similarity loss function is used for calculating the similarity loss between the target face image and the positive sample image, and the contrast loss function is used for calculating the contrast loss between the target face image and the N-1 negative sample images.

As will be understood by those skilled in the art, the age loss function is a function that calculates a deviation between the predicted age and the true age, indicating that the predicted age is more deviated from the true age when the age loss is greater and the predicted age is less deviated from the true age when the age loss is smaller.

The similarity loss function is to calculate the difference of the face features between the target face image and the positive sample image (of the same age), and indicates that the difference of the face features of the same age is larger when the similarity loss is larger and smaller when the similarity loss is smaller.

The contrast loss is to calculate the difference of the facial features between the target facial image and the negative sample image (different ages). When the contrast loss is larger, the difference of the human face features of different ages is larger, and when the contrast loss is smaller, the difference of the human face features of different ages is smaller.

In some embodiments, the similarity loss function includes a similarity matrix reflecting degrees of similarity of age features of faces of similar ages. In some embodiments, the similarity matrix may be a product of the target face feature vector, the positive sample feature vector, and the N-1 negative sample feature vectors, reflecting the degree of similarity of face age features of similar ages.

In this embodiment, the similarity loss function uses a similarity matrix, which can further constrain the similarity of the age features of the faces at similar ages, thereby further improving the accuracy of the age prediction.

In some embodiments, the loss function is:

Loss＝Loss _L1 +L _con +Loss _dis ；

therein, loss _L1 ＝|Y-T| _L1 ；

Wherein Loss is a Loss function, loss _L1 As a function of age loss, L _con As a contrast Loss function, loss _dis For similar loss function, Y is the predicted age, T is the true age, V ^K Is a target face feature vector, V ^K+ For the positive sample feature vector, i is the index,

In this embodiment, the loss of any target face image in the M face images includes a weighted sum of age loss, contrast loss, and similarity loss, so that in back propagation, the loss guides the neural network to learn to reduce the difference of face features of the same age, and to enlarge the difference of face features of different ages, so as to learn more representative age feature vectors better. The neural network is constrained by utilizing the contrast loss and the similarity loss, so that the human face feature difference at the same age can be reduced, and the convergence of the human face feature difference directions at different ages can be enlarged by contrast. Therefore, the converged age detection model can better identify representative age features, and has higher age detection accuracy.

In summary, according to the method for training an age detection model provided in the embodiment of the present application, M face images are obtained first, each face image is labeled with a real age, and the real ages of the M face images cover 1 to N years. And constructing a contrast data set aiming at any one target face image in the M face images, wherein the contrast data set comprises the target face image, the positive sample image and the N-1 negative sample images, the real age of the positive sample image is the same as that of the target face image, and the real age of the N-1 negative sample images is different from that of the target face image. And finally, performing iterative training on the neural network by adopting M personal face images and M comparison data sets and combining a loss function to obtain an age detection model, wherein the loss function reflects the similarity loss between the target face image and the positive sample image and the comparison loss between the target face image and the N-1 negative sample images.

Through constructing a comparison data set with a positive sample image and a negative sample image, adopting M individual face images and M comparison data sets and combining the loss functions capable of reflecting the similar loss and the comparison loss, iterative training is carried out on the neural network, so that the neural network can learn and reduce the face feature difference of the same age, contrast and enlarge the face feature differences of different ages, and better learn more representative age feature vectors. The method can reduce the human face feature difference of the same age by utilizing the contrast loss and the similar loss constraint neural network, and the human face feature difference of different ages is converged in a contrast expansion direction. Therefore, the age detection model obtained through convergence can better identify representative age characteristics, and has higher age detection accuracy.

After the age detection model is trained by the method for training the age detection model provided by the embodiment of the application, the age detection model can be applied to age detection. The age detection method provided by the embodiment of the application can be implemented by various electronic devices with computing processing capacity, such as an intelligent terminal, a server and the like.

The age detection method provided by the embodiment of the present application is described below with reference to exemplary applications and implementations of the terminal provided by the embodiment of the present application. Referring to fig. 5, fig. 5 is a schematic flowchart of an age detection method according to an embodiment of the present disclosure. The method S30 may comprise the steps of:

s31: and acquiring a human face image to be detected.

Here, the face image to be detected refers to a face image of an age to be detected. It is understood that the face image to be detected includes a human face.

In specific implementation, a clothing classification assistant (application software) built in a terminal (for example, a smart phone) acquires the image of the face to be detected. The face image to be detected can be shot by a terminal or input by a user into the terminal.

S32: and inputting the facial image to be detected into an age detection model for age detection to obtain the age corresponding to the facial image to be detected.

Here, the age detection model refers to an age detection model obtained by training according to the method embodiments of fig. 3 to 4, and has the same structure and function as the age detection model in the embodiments described above, and is not described in detail here.

Embodiments of the present application also provide a computer-readable storage medium storing computer-executable instructions for causing an electronic device to perform a method for training an age detection model provided in embodiments of the present application, for example, a method for training an age detection model as shown in fig. 3 to 5, or an age detection method provided in embodiments of the present application.

In some embodiments, the storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EE PROM, flash, magnetic surface memory, optical disk, or CD-ROM; or may be various devices including one or any combination of the above memories.

In some embodiments, executable instructions may be written in any form of programming language (including compiled or interpreted languages), in the form of programs, software modules, scripts or code, and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment.

By way of example, executable instructions may, but need not, correspond to files in a file system, may be stored in a portion of a file that holds other programs or data, such as in one or more scripts in a HyperText markup Language (H TML) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

By way of example, executable instructions may be deployed to be executed on one computing device (a device including a smart terminal and a server), or on multiple computing devices located at one site, or distributed across multiple sites interconnected by a communication network.

Embodiments of the present application also provide a computer-readable storage medium storing a computer program, where the computer program includes program instructions, and when the program instructions are executed by a computer, the computer executes a method for training an image generation model or a method for generating an age-change image according to the foregoing embodiments.

It should be noted that the above-described device embodiments are merely illustrative, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a general hardware platform, and certainly can also be implemented by hardware. Those skilled in the art will appreciate that all or part of the processes in the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, and the computer program can be stored in a computer readable storage medium, and when executed, the computer program can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; within the context of the present application, where technical features in the above embodiments or in different embodiments can also be combined, the steps can be implemented in any order and there are many other variations of the different aspects of the present application as described above, which are not provided in detail for the sake of brevity; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and these modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of training an age detection model, comprising:

acquiring M face images, wherein each face image is marked with a real age, and the real ages of the M face images cover 1-N years old;

and performing iterative training on a neural network by adopting the M personal face images and the M comparison data sets and combining a loss function to obtain an age detection model, wherein the loss function reflects the similar loss between the target face image and the positive sample image and the comparison loss between the target face image and the N-1 negative sample images.

2. The method of claim 1, wherein the neural network comprises a comparison network and an age detection network, wherein the age detection network comprises a first feature extraction module and an age predictor, wherein the comparison network comprises a second feature extraction module and a fully connected layer, and wherein the first feature extraction module and the second feature extraction module are structurally identical and share parameters;

adopting M individual face image and M the contrast data set, combining the loss function, carrying out iterative training to the neural network, and obtaining the age detection model, including:

inputting the target face image into the age detection network to obtain a predicted age;

inputting a contrast data set corresponding to the target face image into the contrast network for feature extraction and spatial mapping conversion to obtain a target face feature vector, a positive sample feature vector and N-1 negative sample feature vectors;

calculating the loss corresponding to the target face image by adopting a loss function, wherein the loss comprises the age loss between the predicted age and the real age, the similarity loss between the target face image and the positive sample image, and the contrast loss between the target face image and the N-1 negative sample images;

and performing iterative training on the neural network according to the loss sum corresponding to the M face images until convergence, and taking the converged age detection network as the age detection model.

3. The method of claim 2, wherein the first feature extraction module comprises a plurality of convolutional layers, each configured with an activation function layer and a normalization layer, the plurality of convolutional layers configured with convolutional kernels of the same size and with the same step size.

4. The method of claim 3, wherein the age predictor comprises a multi-tier perceptron module.

5. The method according to any one of claims 2 to 4, wherein the loss function includes an age loss function for calculating an age loss between the predicted age and a true age, a similarity loss function for calculating a similarity loss between the target face image and the positive sample image, and a contrast loss function for calculating a contrast loss between the target face image and the N-1 negative sample images.

6. The method of claim 5, wherein the similarity loss function comprises a similarity matrix reflecting the similarity degree of the age characteristics of the human faces of similar ages.

7. The method of claim 6, wherein the loss function is

Loss＝Loss _L1 +L _con +Loss _dis ；

Therein, loss _L1 ＝|Y-T| _L1 ；

Wherein Loss is the Loss function, loss _L1 As a function of said age loss, L _con As a function of the contrast Loss, loss _dis For said similarity loss function, Y is said predicted age, T is said true age, V ^K For the target face feature vector, V ^K+ For the positive sample feature vector, i is a label,

and the ith negative sample feature vector is obtained, tau is a coefficient, and S is the similarity matrix.

8. An age detection method, comprising:

acquiring a human face image to be detected;

inputting the face image to be detected into an age detection model for age detection to obtain the age corresponding to the face image to be detected, wherein the age detection model is obtained by training by adopting the method for training the age detection model according to any one of claims 1 to 7.

9. An electronic device, comprising:

at least one processor, and

a memory communicatively coupled to the at least one processor, wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

10. A computer-readable storage medium having stored thereon computer-executable instructions for causing a computer device to perform the method of any one of claims 1-8.